COMPARISON OF CURVE ESTIMATION REGRESSION METHODS IN PREDICTING PROTEIN AMOUNT FROM TOTAL MILK YIELD IN HOLSTEIN DAIRY CATTLE

Yielding of milk is of great economic importance for milk processors in dairy industry and for consumers. Also, milk composition has a major role in determining the price of milk. Protein amount is a major constituent in milk so this study focused on predicting its amount from total milk yield. Generally, the total milk yield and protein amount are linearly correlated, so it is important to study this relationship with other nonlinear models. This work attempted to: investigate the relationship between protein amount and milk production, predict protein amount from total milk yield and choose the best fit model for this purpose. Beside the linear model, ten nonlinear regression techniques were used such as power, quadratic and cubic modelling technique and others. Data of 1300 animal from lactation records of Holstein dairy cattle which belongs to Dina farms at Alexandria-Cairo desert road Egypt were used. The regression models (curve estimation regression method) were applied using SPSS software packages version 26. The goodness of fit measures for the best fit model are the highest value of R square and adjusted R square (inadequate or intuitive measures) with the lowest values of standard error of estimate and AIC values (more accurate measure). The results showed that from the 11 regression models, the power model was the best fit model to predict the amount of protein from total milk yield depending on R Square (0.856) and Adjusted R Square (0.856) that were the highest values between the models, smaller standard error of the estimates (0.230) and AIC value (-13135.84) which were the lowest values between the models. The power model could be used for prediction through this equation (protein amount = 0.130 * (total milk yield ** 0.815) after 15 iteration criteria.


INTRODUCTION
Milk producers concentrate on milk composition because of its economic importance and its importance to milk consumers.Milk constituents were detected from many years with appearance of Holstein breed with average 3.6 % fat, 3.2 % protein, and 4.7 % lactose (Young et al., 1986).
There are many factors which have effect on milk constituents such as breed, genetic variation within breed, health, environment, management, and food.
The total amount of protein in milk is detected by analyzing milk for nitrogen and multiplying by a factor of 6.38.The total protein percent of milk is about 3.5, of which 94 to 95 percent is in the form of true protein.
As it is known that different prediction techniques are very important part in statistics.Regression methods with different types are one of the most important techniques in this purpose.These methods are applied when the dependent and explanatory factors are in form of linear or non-linear functions.The nonlinear regression methods are widely applied in studying of animal behavior and breeding (Sengül and Kiraz, 2005).Linear and nonlinear models of regression methods are widely applied to study and predict the relationship between quantitative variables (dependent and independents) in different animals researches (Cankaya, 2009).
Prediction and studying the relationship between milk yield and milk constituents of the dairy animals is very important process for the dairy managers and for human health (Nguyen et al., 2020).
Numerous types of modeling techniques used for forecasting milk constituents from total milk yield with a good forecasting power (Lehmann et al., 2019).
As a general rule, there were a linear relationship between fat/protein yield and milk production (they increase together slowly) (Nguyen et al., 2020).
Nonlinear regression gives the curve-fitting function to detect the best fit curve shape with choosing what's known as starting values for the nonlinear algorithm.
The objective of this work was to compare between eleven different regression models (linear and non-linear) to choose the suitable models for predicting the amount of protein from total milk production, where the linear type was commonly applied in this purpose.The parameter estimates used for comparison.Linear and non-linear (inverse, s-curve, logarithmic, quadratic, cubic, power, compound, growth, exponential and logistic) were applied.

Data source:
Data were obtained from lactation records of Dina farms at Alexandria-Cairo desert road Egypt (n= 1300) from the period 2010.Variables under study were total milk yield, protein amount.
-Independent variable is TOM (total milk yield).
-Dependent variable is protein amount.

Handling and analysis of data:
The statistical analysis process is divided into two steps: the first one is applying curve estimation regression step for choosing the best fit model.The second one is applying non-linear regression steps for forming the prediction equation which suggested by the previous step.

Statistical models:
Plotting the data is important step firstly to test if the variables are linearly related or not.If the data are not linearly related, transformation can be done or different curve estimation methods can be applied to suggest the best one by using SPSS Version 26 (SPSS, 2020) and (Hassan and Mansour, 2021).
Curve estimation is a mathematical formula or the procedures of drawing a curve which have the best fit to set of data.It is used for predicting the dependent variable from the independent with avoiding multicollinearity problem which lower the accuracy of the model (Tırınk et al., 2020 andKurnaz et al., 2021).
Linear and nonlinear regression model (inverse, s-curve, logarithmic, quadratic, cubic, power, compound, growth, exponential and logistic) were utilized to study milk production and protein amount relationship.

Robust estimators:
As it is known that ordinary least squares method is not suitable in case of outliers or extremes because of large errors.Robust estimation measures help to decrease the effect outliers by identifying them to give accurate estimate (Almetwally and Almongy, 2018).Huber's M-estimator, Tukey's biweight, Hampel's M-estimator and Andrews' wave are good statistical robust estimators with least effort of computation and rapid convergence (Guo, 2003).

Statistical Hypotheses:
The first hypothesis: Null Hypothesis: Protein amount can't be predicted from total milk yield.

Alternative Hypothesis:
Protein amount can be predicted from total milk yield.

The second hypothesis:
Null Hypothesis: There is no difference between linear and non-linear model in prediction of dependent from independent.

Alternative Hypothesis:
There is a difference between linear and nonlinear model in prediction (Kira et al., 2019).

The models mathematics:
Hassan and Mansour (2021) explained the following mathematical formulas which representing different models as follows: 1. Linear regression model: It is y = β0 + β1x + e, where β0 is the y intercept of the regression line given by β0 + β1x, β1 is the slope of the regression line given by β0 + β1x, and e is the deviation of the actual y value from the line (error) given by β0 + β1x.This model assumed that: The error values are independent, normally distributed with zero mean E(e) = 0 and constant variance (y variance = σ 2 and is fixed for all x values) (Mason et al., 2003).

Nonlinear regression models:
Nonlinear regression models which were used to study the relationship between amount of protein and total milk yield as in Table 1.

Model Equation Linear
Where Y: protein amount in the prediction equation, β0: it is the y intercept, β1: the amount of change in the value of protein amount with one unit change in total milk yield, β2: The regression factor of squared total milk yield, β3: The regression factor of cubic total milk yield, X: The independent variable (total milk yield), X 2 : square of the total milk yield, X 3 : The cube of the total milk yield and In: natural logarithm.β1, β2, β3 and βk are the regression coefficients for the k independent variables respectively.

Fitting measures of model selection:
There are many measures for suggesting the best model (goodness of fit measures) such as the coefficient of determination which considered the square of correlation coefficient, Adjusted R-squared, Akaike information criterion (AIC) and low standard error estimate.
The coefficient of determination is R 2 = SS explained (regression) / SS Total.
SS explained: Sum of squares in regression analysis and SS Total: Total sum of squares in regression analysis.R 2 measure is frequently applied but it is not considered a suitable measure for nonlinear models performance because of many causes (it does not explain parameters number and the full model does not contain single parameter model) so other measures criteria for model selection suggested (Wallach, 2006).
N: Sample size.R 2 : Coefficient of determination.P: Number of regression parameters.
Mean square error, MSE = SSE/(n−k), where n is the data values, SSE is error sum square and k is the parameters number.

Akiake Information Criterion:
It is a statistic for choosing suitable model after comparison of different models.(AIC) = n*ln (SSe/n) + 2k, where n is the number of data values, k is the number of regression parameters.SSe is the error sum of squares and its small value is preferable (Akaike, 1974).AIC value is a guide for selecting better model, where its lower value is preferable than higher.

Standard error of the estimate:
It is a statistic of fitness of regression model in prediction process.Its smaller values between different models is preferable.It is the square root of the average squared deviation.

RESULTS
Protein amount and milk yield were statistically described as in table 2. Correlation measures was applied to describe the strength of association between Protein amount and milk yield as in table 3.   The coefficients of the curve estimation models for predicting dependent variable using the independent variable were shown in table 6.After applying curve estimation step for choosing suitable models as mentioned above.The results of nine nonlinear regression procedures applied for forming the prediction equations as in table 7.
-All models done after one model evaluation step.-Power model after 15 iteration criteria.

DISCUSSION
There were a high positive correlations with highly significant P value between independent and the dependent where the values of the correlation coefficient were 0.857, 0.719 and 0.886 for Pearson's correlation, Kendall's tau and Spearman's rank measures respectively as in table 3.
Huber's M-estimator, Tukey's biweight, Hampel's M-estimator and Andrews' wave, showed nearly the same results as a good indicator of avoiding the outliers effect.These results were in agreement with Okagbue et al. (2019).
Table 5 showed the summaries of the models, ANOVA for testing significance of the models to suggest the most suitable model.It is found that P-values for all models is less than 0.05 which means significance of the models and higher R 2 means model fitness.
Depending on R 2 values, it is found that the power, cubic, quadratic and linear models had high R 2 values, indicating fitting them to predict the amount of protein from total milk yield (0.856, 0.781, 0.770, and 0.735), respectively.It means that (%85.6, %78.1, %77, and %73.5) of the total variation is explained by the model.Adjusted R Square were (0.856, 0.780, 770, and 0.735), respectively and smaller standard error of the estimates indicated suitability of the models, but it is wrong to depend on these measures alone (inadequate or rough measures).
S curve and inverse non-linear regression model were not suitable to predict the amount of protein from total milk yield because of low value of R 2 and adjusted R 2 (0.441 and 0.440) and (0.152 and 0.151) respectively.
The rest models predict protein amount moderately with adjusted R 2 more than 0.5.
Depending on AIC values, it is found that power model was the best fit model with the lowest AIC value (-13135.84)followed by logistic, exponential, growth and compound which their values were (-11949.59)as shown in table 5.
Based on highest R 2 , Adjusted R 2 and lowest AIC value, the power model is the best fit model to predict protein amount from total milk yield.
Finally, according to the lowest AIC value that considered important measure of goodness of fit and highest R 2 , the models of prediction could be arranged as follows: Power > (logistic, exponential, growth and compound) > cubic > quadratic > linear > logarithmic.
Table 6 showed a significant effect for all parameters depending on t test and its P value which was highly significant indicating that the null hypothesis was rejected and the alternative was accepted.
Figure 1 showed the suggested models (nine from eleven) for studying the relationship and predicting purpose between dependent and independent, while Figure 2

CONCLUSION
This research concentrated on examining different regression models depending on two different methods of regression analysis to suggest which models would be suitable for predicting protein amount from total milk yield.Beside linear regression model which known for all (linear relationship between milk yield and protein amount), it is found that power model is more suitable than linear.Other models such logistic, exponential, growth and compound, logarithmic, quadratic and cubic models (non-linear regression models) were suitable also for the prediction process.S curve and inverse non-linear regression model were not suitable to fit this data.

Figure 1 :
Figure 1: Curve estimation regression for nine regression models suggested for predicting the relationship between protein amount and total milk yield.

Figure 2 :
Figure 2: Curve estimation regression for power regression model (the best fitted model) for predicting the relationship between protein amount and total milk yield.

Table 2 :
Descriptive statistics of total milk yield (independent variable) and protein amount (dependent one).

Table 3 :
Correlation between protein amount and total milk yield.

Table 4 :
Robust estimators of protein amount and milk yield of 1300 animal.

Table 5 :
Curve estimation regression model summaries for prediction protein amount.

Table 6 :
Parameter estimates (Coefficients) and T test with significance for eleven regression models for predicting protein amount.
-The dependent variable is ln(PA) in compound, S, growth, exponential and power.-The dependent variable is ln(1 / PA) in logistic.-The dependent variable in other model is PA.-The independent is TMY.The suggested models for prediction process as in the following chart.

Table 7 :
Prediction equations of protein amount from total milk yield.