Development of a forecast using the least squares method. An example of a problem solution. Linear Pair Regression Analysis Least Squares Calculation

Approximation of experimental data is a method based on the replacement of experimentally obtained data with an analytical function that most closely passes or coincides at the nodal points with the initial values (data obtained during the experiment or experiment). There are currently two ways to define an analytic function:

By constructing an n-degree interpolation polynomial that passes directly through all points given array of data. In this case, the approximating function is represented as: an interpolation polynomial in the Lagrange form or an interpolation polynomial in the Newton form.

By constructing an n-degree approximating polynomial that passes close to points from the given data array. Thus, the approximating function smoothes out all random noise (or errors) that may occur during the experiment: the measured values during the experiment depend on random factors that fluctuate according to their own random laws (measurement or instrument errors, inaccuracy or experimental errors). In this case, the approximating function is determined by the least squares method.

Least square method(in the English literature Ordinary Least Squares, OLS) is a mathematical method based on the definition of an approximating function, which is built in the closest proximity to points from a given array of experimental data. The proximity of the initial and approximating functions F(x) is determined by a numerical measure, namely: the sum of the squared deviations of the experimental data from the approximating curve F(x) should be the smallest.

Fitting curve constructed by the least squares method

The least squares method is used:

To solve overdetermined systems of equations when the number of equations exceeds the number of unknowns;

To search for a solution in the case of ordinary (not overdetermined) nonlinear systems of equations;

For approximating point values by some approximating function.

The approximating function by the least squares method is determined from the condition of the minimum sum of squared deviations of the calculated approximating function from a given array of experimental data. This criterion of the least squares method is written as the following expression:

Values of the calculated approximating function at nodal points ,

Specified array of experimental data at nodal points .

The quadratic criterion has a number of "good" properties, such as differentiability, providing a unique solution to the approximation problem with polynomial approximating functions.

Depending on the conditions of the problem, the approximating function is a polynomial of degree m

The degree of the approximating function does not depend on the number of nodal points, but its dimension must always be less than the dimension (number of points) of the given array of experimental data.

∙ If the degree of the approximating function is m=1, then we approximate the table function with a straight line (linear regression).

∙ If the degree of the approximating function is m=2, then we approximate the table function with a quadratic parabola (quadratic approximation).

∙ If the degree of the approximating function is m=3, then we approximate the table function with a cubic parabola (cubic approximation).

In the general case, when it is required to construct an approximating polynomial of degree m for given tabular values, the condition for the minimum sum of squared deviations over all nodal points is rewritten in the following form:

- unknown coefficients of the approximating polynomial of degree m;

The number of specified table values.

A necessary condition for the existence of a minimum of a function is the equality to zero of its partial derivatives with respect to unknown variables . As a result, we obtain the following system of equations:

Let's transform the resulting linear system of equations: open the brackets and move the free terms to the right side of the expression. As a result, the resulting system of linear algebraic expressions will be written in the following form:

This system of linear algebraic expressions can be rewritten in matrix form:

As a result, a system of linear equations of dimension m + 1 was obtained, which consists of m + 1 unknowns. This system can be solved using any method for solving linear algebraic equations (for example, the Gauss method). As a result of the solution, unknown parameters of the approximating function will be found that provide the minimum sum of squared deviations of the approximating function from the original data, i.e. the best possible quadratic approximation. It should be remembered that if even one value of the initial data changes, all coefficients will change their values, since they are completely determined by the initial data.

Approximation of initial data by linear dependence

(linear regression)

As an example, consider the method for determining the approximating function, which is given as a linear relationship. In accordance with the least squares method, the condition for the minimum sum of squared deviations is written as follows:

Coordinates of nodal points of the table;

Unknown coefficients of the approximating function, which is given as a linear relationship.

A necessary condition for the existence of a minimum of a function is the equality to zero of its partial derivatives with respect to unknown variables. As a result, we obtain the following system of equations:

Let us transform the resulting linear system of equations.

We solve the resulting system of linear equations. The coefficients of the approximating function in the analytical form are determined as follows (Cramer's method):

These coefficients provide the construction of a linear approximating function in accordance with the criterion for minimizing the sum of squares of the approximating function from given tabular values (experimental data).

Algorithm for implementing the method of least squares

1. Initial data:

Given an array of experimental data with the number of measurements N

The degree of the approximating polynomial (m) is given

2. Calculation algorithm:

2.1. Coefficients are determined for constructing a system of equations with dimension

Coefficients of the system of equations (left side of the equation)

- index of the column number of the square matrix of the system of equations

Free members of the system of linear equations (right side of the equation)

- index of the row number of the square matrix of the system of equations

2.2. Formation of a system of linear equations with dimension .

2.3. Solution of a system of linear equations in order to determine the unknown coefficients of the approximating polynomial of degree m.

2.4 Determination of the sum of squared deviations of the approximating polynomial from the initial values over all nodal points

The found value of the sum of squared deviations is the minimum possible.

Approximation with Other Functions

It should be noted that when approximating the initial data in accordance with the least squares method, a logarithmic function, an exponential function, and a power function are sometimes used as an approximating function.

Log approximation

Consider the case when the approximating function is given by a logarithmic function of the form:

The essence of the least squares method is in finding the parameters of a trend model that best describes the development trend of some random phenomenon in time or space (a trend is a line that characterizes the trend of this development). The task of the least squares method (OLS) is to find not just some trend model, but to find the best or optimal model. This model will be optimal if the sum of the squared deviations between the observed actual values and the corresponding calculated trend values is minimal (smallest):

where is the standard deviation between the observed actual value

and the corresponding calculated trend value,

The actual (observed) value of the phenomenon under study,

Estimated value of the trend model,

The number of observations of the phenomenon under study.

MNC is rarely used on its own. As a rule, most often it is used only as a necessary technique in correlation studies. It should be remembered that the information basis of the LSM can only be a reliable statistical series, and the number of observations should not be less than 4, otherwise, the smoothing procedures of the LSM may lose their common sense.

The OLS toolkit is reduced to the following procedures:

First procedure. It turns out whether there is any tendency at all to change the resultant attribute when the selected factor-argument changes, or in other words, whether there is a connection between " at " and " X ».

Second procedure. It is determined which line (trajectory) is best able to describe or characterize this trend.

Third procedure.

Example. Suppose we have information on the average sunflower yield for the farm under study (Table 9.1).

Table 9.1

Observation number

Productivity, c/ha

Since the level of technology in the production of sunflower in our country has not changed much over the past 10 years, it means that, most likely, the fluctuations in yield in the analyzed period depended very much on fluctuations in weather and climate conditions. Is it true?

First MNC procedure. The hypothesis about the existence of a trend in the change in sunflower yield depending on changes in weather and climate conditions over the analyzed 10 years is being tested.

In this example, for " y » it is advisable to take the yield of sunflower, and for « x » is the number of the observed year in the analyzed period. Testing the hypothesis about the existence of any relationship between " x " and " y » can be done in two ways: manually and with the help of computer programs. Of course, with the availability of computer technology, this problem is solved by itself. But, in order to better understand the OLS toolkit, it is advisable to test the hypothesis about the existence of a relationship between " x " and " y » manually, when only a pen and an ordinary calculator are at hand. In such cases, the hypothesis of the existence of a trend is best checked visually by the location of the graphic image of the analyzed time series - the correlation field:

The correlation field in our example is located around a slowly rising line. This in itself indicates the existence of a certain trend in the change in sunflower yield. It is impossible to speak about the presence of any trend only when the correlation field looks like a circle, a circle, a strictly vertical or strictly horizontal cloud, or consists of randomly scattered points. In all other cases, it is necessary to confirm the hypothesis of the existence of a relationship between " x " and " y and continue research.

Second MNC procedure. It is determined which line (trajectory) is best able to describe or characterize the trend in sunflower yield changes for the analyzed period.

With the availability of computer technology, the selection of the optimal trend occurs automatically. With "manual" processing, the choice of the optimal function is carried out, as a rule, in a visual way - by the location of the correlation field. That is, according to the type of chart, the equation of the line is selected, which is best suited to the empirical trend (to the actual trajectory).

As you know, in nature there is a huge variety of functional dependencies, so it is extremely difficult to visually analyze even a small part of them. Fortunately, in real economic practice, most relationships can be accurately described either by a parabola, or a hyperbola, or a straight line. In this regard, with the "manual" option for selecting the best function, you can limit yourself to only these three models.

		Hyperbola:

Parabola of the second order: :

It is easy to see that in our example, the trend in sunflower yield changes over the analyzed 10 years is best characterized by a straight line, so the regression equation will be a straight line equation.

Third procedure. The parameters of the regression equation that characterizes this line are calculated, or in other words, an analytical formula is determined that describes the best trend model.

Finding the values of the parameters of the regression equation, in our case, the parameters and , is the core of the LSM. This process is reduced to solving a system of normal equations.

(9.2)

This system of equations is quite easily solved by the Gauss method. Recall that as a result of the solution, in our example, the values of the parameters and are found. Thus, the found regression equation will have the following form:

Extrapolation - this is a method of scientific research, which is based on the dissemination of past and present trends, patterns, relationships to the future development of the forecasting object. Extrapolation methods include moving average method, exponential smoothing method, least squares method.

Essence least squares method consists in minimizing the sum of square deviations between the observed and calculated values. The calculated values are found according to the selected equation - the regression equation. The smaller the distance between the actual values and the calculated ones, the more accurate the forecast based on the regression equation.

The theoretical analysis of the essence of the phenomenon under study, the change in which is displayed by a time series, serves as the basis for choosing a curve. Considerations about the nature of the growth of the levels of the series are sometimes taken into account. So, if the growth of output is expected in an arithmetic progression, then smoothing is performed in a straight line. If it turns out that the growth is exponential, then smoothing should be done according to the exponential function.

The working formula of the method of least squares : Y t+1 = a*X + b, where t + 1 is the forecast period; Уt+1 – predicted indicator; a and b - coefficients; X - symbol of time.

Coefficients a and b are calculated according to the following formulas:

where, Uf - the actual values of the series of dynamics; n is the number of levels in the time series;

The smoothing of time series by the least squares method serves to reflect the patterns of development of the phenomenon under study. In the analytic expression of a trend, time is considered as an independent variable, and the levels of the series act as a function of this independent variable.

The development of a phenomenon does not depend on how many years have passed since the starting point, but on what factors influenced its development, in what direction and with what intensity. From this it is clear that the development of a phenomenon in time appears as a result of the action of these factors.

Correctly setting the type of curve, the type of analytical dependence on time is one of the most difficult tasks of pre-predictive analysis. .

The selection of the type of function that describes the trend, the parameters of which are determined by the least squares method, is in most cases empirical, by constructing a number of functions and comparing them with each other by the value of the root-mean-square error calculated by the formula:

where Uf - the actual values of the series of dynamics; Ur – calculated (smoothed) values of the time series; n is the number of levels in the time series; p is the number of parameters defined in the formulas describing the trend (development trend).

Disadvantages of the least squares method :

when trying to describe the economic phenomenon under study using a mathematical equation, the forecast will be accurate for a short period of time and the regression equation should be recalculated as new information becomes available;
the complexity of the selection of the regression equation, which is solvable using standard computer programs.

An example of using the least squares method to develop a forecast

A task . There are data characterizing the level of unemployment in the region, %

Build a forecast of the unemployment rate in the region for the months of November, December, January, using the methods: moving average, exponential smoothing, least squares.
Calculate the errors in the resulting forecasts using each method.
Compare the results obtained, draw conclusions.

Least squares solution

For the solution, we will compile a table in which we will make the necessary calculations:

Let's define the symbol of time as a consecutive numbering of the periods of the forecast base (column 3). Calculate columns 4 and 5. Calculate the values of the series Ur will be determined by the formula Y t + 1 = a * X + b, where t + 1 is the forecast period; Уt+1 – predicted indicator; a and b - coefficients; X - symbol of time.

The coefficients a and b are determined by the following formulas:

where, Uf - the actual values of the series of dynamics; n is the number of levels in the time series.
a = / = - 0.17
b \u003d 22.13 / 10 - (-0.17) * 55 / 10 \u003d 3.15

We calculate the average relative error using the formula:

ε = 28.63/10 = 2.86% forecast accuracy high.

Conclusion : Comparing the results obtained in the calculations moving average method , exponential smoothing and the least squares method, we can say that the average relative error in calculations by the exponential smoothing method falls within 20-50%. This means that the prediction accuracy in this case is only satisfactory.

In the first and third cases, the forecast accuracy is high, since the average relative error is less than 10%. But the moving average method made it possible to obtain more reliable results (forecast for November - 1.52%, forecast for December - 1.53%, forecast for January - 1.49%), since the average relative error when using this method is the smallest - 1 ,13%.

It is widely used in econometrics in the form of a clear economic interpretation of its parameters.

Linear regression is reduced to finding an equation of the form

Type equation allows for given parameter values X have theoretical values of the effective feature, substituting the actual values of the factor into it X.

Building a linear regression comes down to estimating its parameters − a and in. Linear regression parameter estimates can be found by different methods.

The classical approach to estimating linear regression parameters is based on least squares(MNK).

LSM allows one to obtain such parameter estimates a and in, under which the sum of the squared deviations of the actual values of the resultant trait (y) from calculated (theoretical) mini-minimum:

To find the minimum of a function, it is necessary to calculate the partial derivatives with respect to each of the parameters a and b and equate them to zero.

Denote by S, then:

Transforming the formula, we obtain the following system of normal equations for estimating the parameters a and in:

Solving the system of normal equations (3.5) either by the method of successive elimination of variables or by the method of determinants, we find the desired parameter estimates a and in.

Parameter in called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit.

The regression equation is always supplemented with an indicator of the tightness of the relationship. When using linear regression, the linear correlation coefficient acts as such an indicator. There are various modifications of the linear correlation coefficient formula. Some of them are listed below:

As you know, the linear correlation coefficient is within the limits: -1 ≤ ≤ 1.

To assess the quality of the selection of a linear function, the square is calculated

A linear correlation coefficient called determination coefficient . The coefficient of determination characterizes the proportion of the variance of the effective feature y, explained by regression, in the total variance of the resulting trait:

Accordingly, the value 1 - characterizes the proportion of dispersion y, caused by the influence of other factors not taken into account in the model.

Questions for self-control

1. The essence of the method of least squares?

2. How many variables provide a pairwise regression?

3. What coefficient determines the tightness of the connection between the changes?

4. Within what limits is the coefficient of determination determined?

5. Estimation of parameter b in correlation-regression analysis?

1. Christopher Dougherty. Introduction to econometrics. - M.: INFRA - M, 2001 - 402 p.

2. S.A. Borodich. Econometrics. Minsk LLC "New Knowledge" 2001.

3. R.U. Rakhmetova Short course in econometrics. Tutorial. Almaty. 2004. -78s.

4. I.I. Eliseeva. Econometrics. - M.: "Finance and statistics", 2002

5. Monthly information and analytical magazine.

Nonlinear economic models. Nonlinear regression models. Variable conversion.

Nonlinear economic models..

Variable conversion.

elasticity coefficient.

If there are non-linear relationships between economic phenomena, then they are expressed using the corresponding non-linear functions: for example, an equilateral hyperbola , parabolas of the second degree, etc.

There are two classes of non-linear regressions:

1. Regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of various degrees - , ;

Equilateral hyperbole - ;

Semilogarithmic function - .

2. Regressions that are non-linear in the estimated parameters, for example:

Power - ;

Demonstrative -;

Exponential - .

The total sum of the squared deviations of the individual values of the resulting attribute at from the average value is caused by the influence of many factors. We conditionally divide the entire set of reasons into two groups: studied factor x and other factors.

If the factor does not affect the result, then the regression line on the graph is parallel to the axis oh and

Then the entire dispersion of the resulting attribute is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not affect the result, then u tied With X functionally, and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.

Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor X, i.e. regression at on X, and caused by the action of other causes (unexplained variation). The suitability of the regression line for the forecast depends on what part of the total variation of the trait at accounts for the explained variation

Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the factor X has a significant impact on the outcome. y.

, i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P

The assessment of the significance of the regression equation as a whole is given with the help of F- Fisher's criterion. In this case, a null hypothesis is put forward that the regression coefficient is equal to zero, i.e. b= 0, and hence the factor X does not affect the result y.

The direct calculation of the F-criterion is preceded by an analysis of the variance. Central to it is the expansion of the total sum of squared deviations of the variable at from the average value at into two parts - "explained" and "unexplained":

Total sum of squared deviations;

Sum of squares of deviation explained by regression;

Residual sum of squared deviation.

Any sum of squared deviations is related to the number of degrees of freedom , i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of population units n and with the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P possible is required to form a given sum of squares.

Dispersion per degree of freedomD.

F-ratios (F-criterion):

If the null hypothesis is true, then the factor and residual variances do not differ from each other. For H 0, a refutation is necessary so that the factor variance exceeds the residual by several times. The English statistician Snedecor developed tables of critical values F-relationships at different levels of significance of the null hypothesis and a different number of degrees of freedom. Table value F-criterion is the maximum value of the ratio of variances that can occur if they diverge randomly for a given level of probability of the presence of a null hypothesis. Computed value F-relationship is recognized as reliable if o is greater than the tabular one.

In this case, the null hypothesis about the absence of a relationship of features is rejected and a conclusion is made about the significance of this relationship: F fact > F table H 0 is rejected.

If the value is less than the table F fact ‹, F table, then the probability of the null hypothesis is higher than a given level and it cannot be rejected without a serious risk of drawing the wrong conclusion about the presence of a relationship. In this case, the regression equation is considered statistically insignificant. N o does not deviate.

Standard error of the regression coefficient

To assess the significance of the regression coefficient, its value is compared with its standard error, i.e., the actual value is determined t-Student's test: which is then compared with the table value at a certain level of significance and the number of degrees of freedom ( n- 2).

Parameter Standard Error a:

The significance of the linear correlation coefficient is checked based on the magnitude of the error correlation coefficient r:

Total variance of a feature X:

Multiple Linear Regression

Model building

Multiple regression is a regression of an effective feature with two or more factors, i.e. a model of the form

Regression can give a good result in modeling if the influence of other factors affecting the object of study can be neglected. The behavior of individual economic variables cannot be controlled, i.e., it is not possible to ensure the equality of all other conditions for assessing the influence of one factor under study. In this case, you should try to identify the influence of other factors by introducing them into the model, i.e. build a multiple regression equation: y = a+b 1 x 1 +b 2 +…+b p x p + .

The main goal of multiple regression is to build a model with a large number of factors, while determining the influence of each of them individually, as well as their cumulative impact on the modeled indicator. The specification of the model includes two areas of questions: the selection of factors and the choice of the type of regression equation

Least square method is used to estimate the parameters of the regression equation.

One of the methods for studying stochastic relationships between features is regression analysis.
Regression analysis is the derivation of a regression equation, which is used to find the average value of a random variable (feature-result), if the value of another (or other) variables (feature-factors) is known. It includes the following steps:

choice of the form of connection (type of analytical regression equation);
estimation of equation parameters;
evaluation of the quality of the analytical regression equation.

Most often, a linear form is used to describe the statistical relationship of features. Attention to a linear relationship is explained by a clear economic interpretation of its parameters, limited by the variation of variables, and by the fact that in most cases, non-linear forms of a relationship are converted (by taking a logarithm or changing variables) into a linear form to perform calculations.
In the case of a linear pair relationship, the regression equation will take the form: y i =a+b·x i +u i . The parameters of this equation a and b are estimated from the data of statistical observation x and y . The result of such an assessment is the equation: , where , - estimates of the parameters a and b , - the value of the effective feature (variable) obtained by the regression equation (calculated value).

The most commonly used for parameter estimation is least squares method (LSM).
The least squares method gives the best (consistent, efficient and unbiased) estimates of the parameters of the regression equation. But only if certain assumptions about the random term (u) and the independent variable (x) are met (see OLS assumptions).

The problem of estimating the parameters of a linear pair equation by the least squares method consists in the following: to obtain such estimates of the parameters , , at which the sum of the squared deviations of the actual values of the effective feature - y i from the calculated values - is minimal.
Formally OLS criterion can be written like this: .

Classification of least squares methods

Least square method.
Maximum likelihood method (for a normal classical linear regression model, normality of regression residuals is postulated).
The generalized least squares method of GLSM is used in the case of error autocorrelation and in the case of heteroscedasticity.
Weighted least squares method (a special case of GLSM with heteroscedastic residuals).

Illustrate the essence the classical method of least squares graphically. To do this, we will build a dot plot according to the observational data (x i , y i , i=1;n) in a rectangular coordinate system (such a dot plot is called a correlation field). Let's try to find a straight line that is closest to the points of the correlation field. According to the least squares method, the line is chosen so that the sum of squared vertical distances between the points of the correlation field and this line would be minimal.

Mathematical notation of this problem: .
The values of y i and x i =1...n are known to us, these are observational data. In the function S they are constants. The variables in this function are the required estimates of the parameters - , . To find the minimum of a function of 2 variables, it is necessary to calculate the partial derivatives of this function with respect to each of the parameters and equate them to zero, i.e. .
As a result, we obtain a system of 2 normal linear equations:
Solving this system, we find the required parameter estimates:

The correctness of the calculation of the parameters of the regression equation can be checked by comparing the sums (some discrepancy is possible due to rounding of the calculations).
To calculate parameter estimates , you can build Table 1.
The sign of the regression coefficient b indicates the direction of the relationship (if b > 0, the relationship is direct, if b<0, то связь обратная). Величина b показывает на сколько единиц изменится в среднем признак-результат -y при изменении признака-фактора - х на 1 единицу своего измерения.
Formally, the value of the parameter a is the average value of y for x equal to zero. If the sign-factor does not have and cannot have a zero value, then the above interpretation of the parameter a does not make sense.

Assessment of the tightness of the relationship between features is carried out using the coefficient of linear pair correlation - r x,y . It can be calculated using the formula: . In addition, the coefficient of linear pair correlation can be determined in terms of the regression coefficient b: .
The range of admissible values of the linear coefficient of pair correlation is from –1 to +1. The sign of the correlation coefficient indicates the direction of the relationship. If r x, y >0, then the connection is direct; if r x, y<0, то связь обратная.
If this coefficient is close to unity in modulus, then the relationship between the features can be interpreted as a fairly close linear one. If its modulus is equal to one ê r x , y ê =1, then the relationship between the features is functional linear. If features x and y are linearly independent, then r x,y is close to 0.
Table 1 can also be used to calculate r x,y.

To assess the quality of the obtained regression equation, the theoretical coefficient of determination is calculated - R 2 yx:

,
where d 2 is the variance y explained by the regression equation;
e 2 - residual (unexplained by the regression equation) variance y ;
s 2 y - total (total) variance y .
The coefficient of determination characterizes the share of variation (dispersion) of the resulting feature y, explained by regression (and, consequently, the factor x), in the total variation (dispersion) y. The coefficient of determination R 2 yx takes values from 0 to 1. Accordingly, the value 1-R 2 yx characterizes the proportion of variance y caused by the influence of other factors not taken into account in the model and specification errors.
With paired linear regression R 2 yx =r 2 yx .