An example of using regression analysis. Correlation and regression analysis in Excel: instructions for execution. Hyperbolic, linear and logarithmic

1. The term “regression” was first introduced by the founder of biometrics F. Galton (19th century), whose ideas were developed by his follower K. Pearson.

Regression analysis - method statistical processing data that allows you to measure the relationship between one or more causes (factorial characteristics) and a consequence (resultative characteristic).

Sign- this is the main one distinguishing feature, a feature of the phenomenon or process being studied.

Effective sign - indicator under study.

Factor sign- an indicator that influences the value of the resulting characteristic.

The purpose of regression analysis is to estimate functional dependence average value of the resulting characteristic ( at) from factor ( x 1, x 2, …, x n), expressed as regression equations

at= f(x 1, x 2, …, x n). (6.1)

There are two types of regression: paired and multiple.

Paired (simple) regression- equation of the form:

at= f(x). (6.2)

The resulting feature in pair regression is considered as a function of one argument, i.e. one factor characteristic.

Regression analysis includes the following steps:

· determining the type of function;

· determination of regression coefficients;

· calculation of theoretical values of the resulting characteristic;

· checking statistical significance regression coefficients;

· checking the statistical significance of the regression equation.

Multiple regression- equation of the form:

at= f(x 1, x 2, …, x n). (6.3)

The resulting attribute is considered as a function of several arguments, i.e. many factor signs.

2. In order to correctly determine the type of function, it is necessary to find the direction of the connection based on theoretical data.

According to the direction of connection, regression is divided into:

· direct regression arising under the condition that with an increase or decrease in the independent quantity " X" values of the dependent quantity " y" also increase or decrease accordingly;

· reverse regression arising under the condition that with an increase or decrease in the independent value "X" dependent quantity " y" decreases or increases accordingly.

To characterize connections, the following types of paired regression equations are used:

· y=a+bx– linear;

· y=e ax + b – exponential;

· y=a+b/x – hyperbolic;

· y=a+b 1 x+b 2 x 2 – parabolic;

· y=ab x – exponential and etc.

Where a, b 1, b 2- coefficients (parameters) of the equation; at- effective sign; X- factor sign.

3. Construction of a regression equation comes down to estimating its coefficients (parameters), for this we use method least squares (MNC).

The least squares method makes it possible to obtain such parameter estimates for which the sum of the squared deviations of the actual values of the resultant attribute " at"from theoretical" y x» is minimal, that is

Regression equation parameters y=a+bx using the least squares method are estimated using the formulas:

Where A - free coefficient, b- regression coefficient, shows how much the resultant sign will change “ y" when a factor characteristic changes " x» per unit of measurement.

4. To assess the statistical significance of regression coefficients, the Student's t-test is used.

Scheme for testing the significance of regression coefficients:

1) H 0:a=0, b=0 - regression coefficients do not differ significantly from zero.

H 1: a≠ 0, b≠ 0 - regression coefficients are significantly different from zero.

2) R=0.05 – significance level.

Where m b,m a- random errors:

; . (6.7)

4) t table(R; f),

Where f=n-k- 1 - number of degrees of freedom (tabular value), n- number of observations, k X".

5) If , then it is rejected, i.e. the coefficient is significant.

If , then it is accepted, i.e. the coefficient is insignificant.

5. To check the correctness of the constructed regression equation, the Fisher criterion is used.

Scheme for testing the significance of the regression equation:

1) H 0: The regression equation is not significant.

H 1: The regression equation is significant.

2) R=0.05 – significance level.

3) , (6.8)

where is the number of observations; k- number of parameters in the equation with variables " X"; at- actual value of the resultant attribute; y x- theoretical value of the resultant sign; - pair correlation coefficient.

4) F table(R; f 1 ; f 2),

Where f 1 =k, f 2 =n-k-1- number of degrees of freedom (tabular values).

5) If F calculated >F table, then the regression equation is chosen correctly and can be used in practice.

If F calc , then the regression equation is chosen incorrectly.

6. The main indicator reflecting the quality of regression analysis is coefficient of determination (R 2).

Determination coefficient shows what proportion of the dependent variable " at" is taken into account in the analysis and is caused by the influence on it of factors included in the analysis.

Determination coefficient (R 2) takes values in the interval . The regression equation is qualitative if R 2 ≥0,8.

The coefficient of determination is equal to the square of the correlation coefficient, i.e.

Example 6.1. Based on the following data, construct and analyze a regression equation:

Solution.

1) Calculate the correlation coefficient: . The relationship between the signs is direct and moderate.

2) Construct a paired linear regression equation.

2.1) Create a calculation table.

№ X at Hu x 2 y x (y-y x) 2
55,89 47,54 65,70
45,07 15,42 222,83
54,85 34,19 8,11
51,36 5,55 11,27
42,28 45,16 13,84
47,69 1,71 44,77
45,86 9,87 192,05
Sum 159,45 558,55
Average 77519,6 22,78 79,79 2990,6

,

Paired linear regression equation: y x =25.17+0.087x.

3) Find the theoretical values " y x" by substituting actual values into the regression equation " X».

4) Build graphs of actual " y" and theoretical values " y x"effective characteristic (Figure 6.1):r xy =0.47) and a small number of observations.

7) Calculate the coefficient of determination: R 2=(0.47) 2 =0.22. The constructed equation is of poor quality.

Because calculations when performing regression analysis are quite extensive; it is recommended to use special programs (Statistica 10, SPSS, etc.).

Figure 6.2 shows a table with the results of regression analysis carried out using the Statistica 10 program.

Figure 6.2. Results of regression analysis carried out using the Statistica 10 program

5. Literature:

1. Gmurman V.E. Probability theory and mathematical statistics: Textbook. manual for universities / V.E. Gmurman. - M.: Higher School, 2003. - 479 p.

2. Koychubekov B.K. Biostatistics: Textbook. - Almaty: Evero, 2014. - 154 p.

3. Lobotskaya N.L. Higher mathematics. / N.L. Lobotskaya, Yu.V. Morozov, A.A. Dunaev. - Mn.: Higher School, 1987. - 319 p.

4. Medic V.A., Tokmachev M.S., Fishman B.B. Statistics in medicine and biology: A guide. In 2 volumes / Ed. Yu.M. Komarova. T. 1. Theoretical statistics. - M.: Medicine, 2000. - 412 p.

5. Application of methods of statistical analysis for the study of public health and healthcare: textbook / ed. Kucherenko V.Z. - 4th ed., revised. and additional – M.: GEOTAR - Media, 2011. - 256 p.

The purpose of regression analysis is to measure the relationship between a dependent variable and one (pairwise regression analysis) or more (multiple) independent variables. Independent variables are also called factor, explanatory, determinant, regressor and predictor variables.
The dependent variable is sometimes called the determined, explained, or “response” variable. The extremely widespread use of regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is an effective method for modeling and forecasting.
Let's start explaining the principles of working with regression analysis with a simpler one - the pair method.
Paired Regression Analysis
The first steps when using regression analysis will be almost identical to those we took in calculating the correlation coefficient. The three main conditions for the effectiveness of correlation analysis using the Pearson method - normal distribution of variables, interval measurement of variables, linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are constructed, a statistical and descriptive analysis of the variables is carried out, and a regression line is calculated. As in the framework of correlation analysis, regression lines are constructed using the least squares method.
To more clearly illustrate the differences between the two methods of data analysis, let us turn to the example already discussed with the variables “SPS support” and “rural population share”. The source data is identical. The difference in scatterplots will be that in regression analysis it is correct to plot the dependent variable - in our case, “SPS support” on the Y-axis, whereas in correlation analysis this does not matter. After cleaning outliers, the scatterplot looks like this:
The fundamental idea of regression analysis is that, having a general trend for the variables - in the form of a regression line - it is possible to predict the value of the dependent variable, given the values of the independent one.
Let's imagine an ordinary mathematical linear function. Any straight line in Euclidean space can be described by the formula:
where a is a constant that specifies the displacement along the ordinate axis; b is a coefficient that determines the angle of inclination of the line.
Knowing the slope and constant, you can calculate (predict) the value of y for any x.
This simplest function formed the basis of the regression analysis model with the caveat that we will not predict the value of y exactly, but within a certain confidence interval, i.e. approximately.
The constant is the point of intersection of the regression line and the y-axis (F-intersection, usually denoted “interceptor” in statistical packages). In our example with voting for the Union of Right Forces, its rounded value will be 10.55. The angular coefficient b will be approximately -0.1 (as in correlation analysis, the sign shows the type of connection - direct or inverse). Thus, the resulting model will have the form SP C = -0.1 x Sel. us. + 10.55.
Thus, for the case of the “Republic of Adygea” with a share of the rural population of 47%, the predicted value will be 5.63:
ATP = -0.10 x 47 + 10.55 = 5.63.
The difference between the original and predicted values is called the remainder (we have already encountered this term, which is fundamental for statistics, when analyzing contingency tables). So, for the case of the “Republic of Adygea” the remainder will be equal to 3.92 - 5.63 = -1.71. The larger the modular value of the remainder, the less successfully the predicted value.
We calculate the predicted values and residuals for all cases:
Happening Sat down. us. THX
(original)
THX
(predicted)
Leftovers
Republic of Adygea 47 3,92 5,63 -1,71 -
Altai Republic 76 5,4 2,59 2,81
Republic of Bashkortostan 36 6,04 6,78 -0,74
The Republic of Buryatia 41 8,36 6,25 2,11
The Republic of Dagestan 59 1,22 4,37 -3,15
The Republic of Ingushetia 59 0,38 4,37 3,99
Etc.

Analysis of the ratio of initial and predicted values serves to assess the quality of the resulting model and its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values of the dependent variable. In paired regression analysis, it is equal to the usual Pearson correlation coefficient between the dependent and independent variables, in our case - 0.63. To meaningfully interpret multiple R, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - by squaring. The coefficient of determination R-squared (R 2) shows the proportion of variation in the dependent variable that is explained by the independent variable(s).
In our case, R 2 = 0.39 (0.63 2); this means that the variable “rural population share” explains approximately 40% of the variation in the variable “SPS support”. The larger the coefficient of determination, the higher the quality of the model.
Another indicator of model quality is the standard error of estimate. This is a measure of how widely the points are “scattered” around the regression line. The measure of spread for interval variables is the standard deviation. Accordingly, the standard error of the estimate is the standard deviation of the distribution of residuals. The higher its value, the greater the scatter and the worse the model. In our case, the standard error is 2.18. It is by this amount that our model will “err on average” when predicting the value of the “SPS support” variable.
Regression statistics also include analysis of variance. With its help, we find out: 1) what proportion of the variation (dispersion) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable is accounted for by the residuals (unexplained part); 3) what is the ratio of these two quantities (/"-ratio). Dispersion statistics are especially important for sample studies - it shows how likely it is that there is a relationship between the independent and dependent variables in the population. However, for continuous studies (as in our example) the study results of variance analysis are not useful.In this case, they check whether the identified statistical pattern is caused by a combination of random circumstances, how characteristic it is for the complex of conditions in which the population being examined is located, i.e. it is established that the result obtained is not true for some broader general aggregate, but the degree of its regularity, freedom from random influences.
In our case, the ANOVA statistics are as follows:
SS df MS F meaning
Regress. 258,77 1,00 258,77 54,29 0.000000001
Remainder 395,59 83,00 L,11
Total 654,36
The F-ratio of 54.29 is significant at the 0.0000000001 level. Accordingly, we can confidently reject the null hypothesis (that the relationship we discovered is due to chance).
The t criterion performs a similar function, but in relation to regression coefficients (angular and F-intersection). Using the / criterion, we test the hypothesis that in the general population the regression coefficients are equal to zero. In our case, we can again confidently reject the null hypothesis.
Multiple regression analysis
The multiple regression model is almost identical to the paired regression model; the only difference is that several independent variables are sequentially included in the linear function:
Y = b1X1 + b2X2 + …+ bpXp + a.
If there are more than two independent variables, we are not able to get a visual idea of their relationship; in this regard, multiple regression is less “visual” than pairwise regression. When you have two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statistica) there is an option to rotate a three-dimensional chart, which allows you to visually represent the structure of the data well.
When working with multiple regression, as opposed to pairwise regression, it is necessary to determine the analysis algorithm. The standard algorithm includes all available predictors in the final regression model. The step-by-step algorithm involves the sequential inclusion (exclusion) of independent variables based on their explanatory “weight”. The stepwise method is good when there are many independent variables; it “cleanses” the model of frankly weak predictors, making it more compact and concise.
An additional condition for the correctness of multiple regression (along with interval, normality and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.
The interpretation of multiple regression statistics includes all the elements we considered for the case of pairwise regression. In addition, there are other important components to the statistics of multiple regression analysis.
We will illustrate the work with multiple regression using the example of testing hypotheses that explain differences in the level of electoral activity across Russian regions. Specific empirical studies have suggested that voter turnout levels are influenced by:
National factor (variable “Russian population”; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the share of the Russian population leads to a decrease in voter turnout;
Urbanization factor (the “urban population” variable; operationalized as the share of the urban population in the constituent entities of the Russian Federation; we have already worked with this factor as part of the correlation analysis). It is assumed that an increase in the share of the urban population also leads to a decrease in voter turnout.
The dependent variable - “intensity of electoral activity” (“active”) is operationalized through average turnout data by region in federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will be as follows:
Happening Variables
Assets. Gor. us. Rus. us.
Republic of Adygea 64,92 53 68
Altai Republic 68,60 24 60
The Republic of Buryatia 60,75 59 70
The Republic of Dagestan 79,92 41 9
The Republic of Ingushetia 75,05 41 23
Republic of Kalmykia 68,52 39 37
Karachay-Cherkess Republic 66,68 44 42
Republic of Karelia 61,70 73 73
Komi Republic 59,60 74 57
Mari El Republic 65,19 62 47
Etc. (after cleaning out emissions, 83 cases out of 88 remain)
Statistics describing the quality of the model:
1. Multiple R = 0.62; L-square = 0.38. Consequently, the national factor and the urbanization factor together explain about 38% of the variation in the “electoral activity” variable.
2. The average error is 3.38. This is exactly how “wrong on average” the constructed model is when predicting the level of turnout.
3. /l-ratio of explained and unexplained variation is 25.2 at the 0.000000003 level. The null hypothesis about the randomness of the identified relationships is rejected.
4. The criterion / for the constant and regression coefficients of the variables “urban population” and “Russian population” is significant at the level of 0.0000001; 0.00005 and 0.007 respectively. The null hypothesis that the coefficients are random is rejected.
Additional useful statistics in analyzing the relationship between the original and predicted values of the dependent variable are the Mahalanobis distance and Cook's distance. The first is a measure of the uniqueness of the case (shows how much the combination of values of all independent variables for a given case deviates from the average value for all independent variables simultaneously). The second is a measure of the influence of the case. Different observations have different effects on the slope of the regression line, and Cook's distance can be used to compare them on this indicator. This can be useful when cleaning up outliers (an outlier can be thought of as an overly influential case).
In our example, unique and influential cases include Dagestan.
Happening Original
values
Predska
values
Leftovers Distance
Mahalanobis
Distance
Adygea 64,92 66,33 -1,40 0,69 0,00
Altai Republic 68,60 69.91 -1,31 6,80 0,01
The Republic of Buryatia 60,75 65,56 -4,81 0,23 0,01
The Republic of Dagestan 79,92 71,01 8,91 10,57 0,44
The Republic of Ingushetia 75,05 70,21 4,84 6,73 0,08
Republic of Kalmykia 68,52 69,59 -1,07 4,20 0,00
The regression model itself has the following parameters: Y-intersection (constant) = 75.99; b (horizontal) = -0.1; Kommersant (Russian nas.) = -0.06. Final formula:
Aactive, = -0.1 x Hor. us.n+- 0.06 x Rus. us.n + 75.99.
Can we compare the “explanatory power” of predictors based on the coefficient value of 61. In this case, yes, since both independent variables have the same percentage format. However, most often multiple regression deals with variables measured on different scales (for example, income level in rubles and age in years). Therefore, in the general case, it is incorrect to compare the predictive capabilities of variables using a regression coefficient. In multiple regression statistics, there is a special beta coefficient (B) for this purpose, calculated separately for each independent variable. It represents the partial (calculated after taking into account the influence of all other predictors) correlation coefficient between the factor and the response and shows the independent contribution of the factor to the prediction of response values. In pairwise regression analysis, the beta coefficient is understandably equal to the pairwise correlation coefficient between the dependent and independent variable.
In our example, beta (Highland population) = -0.43, beta (Russian population) = -0.28. Thus, both factors negatively affect the level of electoral activity, while the importance of the urbanization factor is significantly higher than the importance of the national factor. The combined influence of both factors determines about 38% of the variation in the “electoral activity” variable (see L-square value).

Regression analysis
Regression (linear) analysis- a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criterion variables. Terminology dependent And independent variables reflects only the mathematical dependence of the variables ( see False correlation), rather than cause-and-effect relationships.

Goals of Regression Analysis

Determination of the degree of determination of the variation of a criterion (dependent) variable by predictors (independent variables)

Predicting the value of a dependent variable using the independent variable(s)

Determining the contribution of individual independent variables to the variation of the dependent variable

Regression analysis cannot be used to determine whether there is a relationship between variables, since the presence of such a relationship is a prerequisite for applying the analysis.

Mathematical Definition of Regression

A strictly regression relationship can be defined as follows. Let , be random variables with a given joint probability distribution. If for each set of values a conditional mathematical expectation is defined
(regression equation in general form),
then the function is called regression values of Y by values, and its graph is regression line by , or regression equation.

The dependence on is manifested in the change in the average values of Y with a change in . Although, for each fixed set of values, the value remains a random variable with a certain scattering.

To clarify the question of how accurately regression analysis estimates the change in Y when changing , the average value of the dispersion of Y for different sets of values is used (in fact, we are talking about the measure of dispersion of the dependent variable around the regression line).

Least squares method (calculation of coefficients)

In practice, the regression line is most often sought in the form of a linear function (linear regression), which best approximates the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed ones from their estimates is minimized (meaning estimates using a straight line that purports to represent the desired regression relationship):

(M - sample size). This approach is based on the well-known fact that the amount appearing in the above expression takes on a minimum value precisely for the case when .

To solve the problem of regression analysis using the least squares method, the concept is introduced residual functions:

Minimum condition for the residual function:

The resulting system is a system of linear equations with unknowns

If we represent the free terms on the left side of the equations as a matrix

and the coefficients for the unknowns on the right side are the matrix

then we get the matrix equation: , which is easily solved by the Gauss method. The resulting matrix will be a matrix containing the coefficients of the regression line equation:

To obtain the best estimates, it is necessary to fulfill the preconditions of the OLS (Gauss–Markov conditions). In the English literature, such estimates are called BLUE (Best Linear Unbiased Estimators).

Interpretation of Regression Parameters

The parameters are partial correlation coefficients; is interpreted as the proportion of the variance of Y explained by fixing the influence of the remaining predictors, that is, it measures the individual contribution to the explanation of Y. In the case of correlated predictors, the problem of uncertainty in the estimates arises, which become dependent on the order in which the predictors are included in the model. In such cases, it is necessary to use correlation and stepwise regression analysis methods.

When talking about nonlinear models of regression analysis, it is important to pay attention to whether we are talking about nonlinearity in independent variables (from a formal point of view, easily reduced to linear regression), or about nonlinearity in the estimated parameters (causing serious computational difficulties). In case of nonlinearity of the first type, from a substantive point of view, it is important to highlight the appearance in the model of terms of the form , , indicating the presence of interactions between features , etc. (see Multicollinearity).

see also

Links

www.kgafk.ru - Lecture on the topic “Regression analysis”

www.basegroup.ru - methods for selecting variables in regression models

Literature

Norman Draper, Harry Smith Applied regression analysis. Multiple Regression = Applied Regression Analysis. - 3rd ed. - M.: “Dialectics”, 2007. - P. 912. - ISBN 0-471-17082-8

Robust methods for estimating statistical models: Monograph. - K.: PP "Sansparel", 2005. - P. 504. - ISBN 966-96574-0-7, UDC: 519.237.5:515.126.2, BBK 22.172+22.152

Radchenko Stanislav Grigorievich, Methodology of regression analysis: Monograph. - K.: "Korniychuk", 2011. - P. 376. - ISBN 978-966-7599-72-0

Wikimedia Foundation. 2010.

Regression analysis is one of the most popular methods of statistical research. It can be used to establish the degree of influence of independent variables on the dependent variable. Microsoft Excel has tools designed to perform this type of analysis. Let's look at what they are and how to use them.

But, in order to use the function that allows you to perform regression analysis, you first need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.

Now when we go to the tab "Data", on the ribbon in the toolbox "Analysis" we will see a new button - "Data analysis".

Types of Regression Analysis

There are several types of regressions:

parabolic;

sedate;

logarithmic;

exponential;

demonstrative;

hyperbolic;

linear regression.

We will talk in more detail about performing the last type of regression analysis in Excel later.

Linear Regression in Excel

Below, as an example, is a table showing the average daily air temperature outside and the number of store customers for the corresponding working day. Let's find out using regression analysis exactly how weather conditions in the form of air temperature can affect the attendance of a retail establishment.

The general linear regression equation is as follows: Y = a0 + a1x1 +…+ akhk. In this formula Y means a variable, the influence of factors on which we are trying to study. In our case, this is the number of buyers. Meaning x are the various factors that influence a variable. Options a are regression coefficients. That is, they are the ones who determine the significance of a particular factor. Index k denotes the total number of these same factors.

Analysis results analysis

The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. Dependency less than 0.5 is bad.

Another important indicator is located in the cell at the intersection of the line "Y-intersection" and column "Odds". This indicates what value Y will have, and in our case, this is the number of buyers, with all other factors equal to zero. In this table, this value is 58.04.

Value at the intersection of the graph "Variable X1" And "Odds" shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high influence indicator.

As you can see, using Microsoft Excel it is quite easy to create a regression analysis table. But only a trained person can work with the output data and understand its essence.

Regression analysis is a statistical research method that allows you to show the dependence of a particular parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large volumes of data. Today, having learned how to build regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.
Types of Regression
This concept itself was introduced into mathematics in 1886. Regression happens:
linear;
parabolic;
sedate;
exponential;
hyperbolic;
demonstrative;
logarithmic.

Example 1
Let's consider the problem of determining the dependence of the number of team members who quit on the average salary at 6 industrial enterprises.
Task. At six enterprises, the average monthly salary and the number of employees who quit voluntarily were analyzed. In tabular form we have:

Number of people who quit

Salary

30,000 rubles

35,000 rubles

40,000 rubles

45,000 rubles

50,000 rubles

55,000 rubles

60,000 rubles

For the task of determining the dependence of the number of quitting workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +...+a k x k, where x i are the influencing variables, a i are the regression coefficients, and k is the number of factors.
For this problem, Y is the indicator of quitting employees, and the influencing factor is salary, which we denote by X.
Using the capabilities of the Excel spreadsheet processor
Regression analysis in Excel must be preceded by applying built-in functions to existing tabular data. However, for these purposes it is better to use the very useful “Analysis Pack” add-on. To activate it you need:
from the “File” tab go to the “Options” section;
in the window that opens, select the line “Add-ons”;
click on the “Go” button located below, to the right of the “Management” line;
check the box next to the name “Analysis package” and confirm your actions by clicking “Ok”.

If everything is done correctly, the required button will appear on the right side of the “Data” tab, located above the Excel worksheet.
in Excel
Now that we have all the necessary virtual tools at hand to carry out econometric calculations, we can begin to solve our problem. For this:
Click on the “Data Analysis” button;
in the window that opens, click on the “Regression” button;
in the tab that appears, enter the range of values for Y (the number of quitting employees) and for X (their salaries);
We confirm our actions by pressing the “Ok” button.

As a result, the program will automatically fill a new spreadsheet with regression analysis data. Note! Excel allows you to manually set the location you prefer for this purpose. For example, this could be the same sheet where the Y and X values are located, or even a new workbook specifically designed to store such data.
Analysis of regression results for R-squared
In Excel, the data obtained during processing of the data in the example under consideration has the form:
First of all, you should pay attention to the R-squared value. It represents the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the parameters under consideration by 75.5%. The higher the value of the coefficient of determination, the more suitable the selected model is for a specific task. It is considered to correctly describe the real situation when the R-square value is above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.
Odds Analysis
The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are reset to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a specific model.
The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence is completely small. The "-" sign indicates that the coefficient is negative. This is obvious, since everyone knows that the higher the salary at the enterprise, the fewer people express a desire to terminate the employment contract or quit.
Multiple regression
This term refers to a relationship equation with several independent variables of the form:
y=f(x 1 +x 2 +…x m) + ε, where y is the resultant characteristic (dependent variable), and x 1, x 2,…x m are factor characteristics (independent variables).
Parameter Estimation
For multiple regression (MR), it is carried out using the least squares method (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε we construct a system of normal equations (see below)
To understand the principle of the method, consider a two-factor case. Then we have a situation described by the formula
From here we get:
where σ is the variance of the corresponding feature reflected in the index.
OLS is applicable to the MR equation on a standardized scale. In this case, we get the equation:
in which t y, t x 1, … t xm are standardized variables, for which the average values are equal to 0; β i are the standardized regression coefficients, and the standard deviation is 1.
Please note that all β i in this case are specified as normalized and centralized, therefore their comparison with each other is considered correct and acceptable. In addition, it is customary to screen out factors by discarding those with the lowest βi values.
Problem Using Linear Regression Equation
Suppose we have a table of price dynamics for a specific product N over the past 8 months. It is necessary to make a decision on the advisability of purchasing a batch of it at a price of 1850 rubles/t.

month number

month name

product price N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet processor, you need to use the “Data Analysis” tool, already known from the example presented above. Next, select the “Regression” section and set the parameters. It must be remembered that in the “Input interval Y” field a range of values must be entered for the dependent variable (in this case, prices for goods in specific months of the year), and in the “Input interval X” - for the independent variable (month number). Confirm the action by clicking “Ok”. On a new sheet (if so indicated) we obtain data for regression.
Using them, we construct a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the line with the name of the month number and the coefficients and lines “Y-intersection” from the sheet with the results of the regression analysis. Thus, the linear regression equation (LR) for task 3 is written as:
Product price N = 11.714* month number + 1727.54.
or in algebraic notation
y = 11.714 x + 1727.54
Analysis of results
To decide whether the resulting linear regression equation is adequate, the coefficients of multiple correlation (MCC) and determination are used, as well as the Fisher test and the Student t test. In the Excel spreadsheet with regression results, they are called multiple R, R-squared, F-statistic and t-statistic, respectively.
KMC R makes it possible to assess the closeness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong connection between the variables “Number of month” and “Price of product N in rubles per 1 ton”. However, the nature of this relationship remains unknown.
The square of the coefficient of determination R2 (RI) is a numerical characteristic of the proportion of the total scatter and shows the scatter of which part of the experimental data, i.e. values of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., statistical data are described with a high degree of accuracy by the resulting SD.
F-statistics, also called Fisher's test, are used to evaluate the significance of a linear relationship, refuting or confirming the hypothesis of its existence.
(Student's test) helps to evaluate the significance of the coefficient with an unknown or free term of the linear relationship. If the value of the t-test > tcr, then the hypothesis about the insignificance of the free term of the linear equation is rejected.
In the problem under consideration for the free term, using Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have zero probability that the correct hypothesis about the insignificance of the free term will be rejected. For the coefficient for the unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for an unknown will be rejected is 0.12%.
Thus, it can be argued that the resulting linear regression equation is adequate.
The problem of the feasibility of purchasing a block of shares
Multiple regression in Excel is performed using the same Data Analysis tool. Let's consider a specific application problem.
The management of the NNN company must decide on the advisability of purchasing a 20% stake in MMM JSC. The cost of the package (SP) is 70 million US dollars. NNN specialists have collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:
accounts payable (VK);
annual turnover volume (VO);
accounts receivable (VD);
cost of fixed assets (COF).

In addition, the parameter of the enterprise's wage arrears (V3 P) in thousands of US dollars is used.
Solution using Excel spreadsheet processor
First of all, you need to create a table of source data. It looks like this:
call the “Data Analysis” window;
select the “Regression” section;
In the “Input interval Y” box, enter the range of values of the dependent variables from column G;
Click on the icon with a red arrow to the right of the “Input interval X” window and highlight the range of all values from columns B, C, D, F on the sheet.

Mark the “New worksheet” item and click “Ok”.
Obtain a regression analysis for a given problem.
Study of results and conclusions
We “collect” the regression equation from the rounded data presented above on the Excel spreadsheet:
SP = 0.103*SOF + 0.541*VO - 0.031*VK +0.405*VD +0.691*VZP - 265.844.
In a more familiar mathematical form, it can be written as:
y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844
Data for MMM JSC are presented in the table:

Substituting them into the regression equation, we get a figure of 64.72 million US dollars. This means that the shares of MMM JSC are not worth purchasing, since their value of 70 million US dollars is quite inflated.
As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.
Now you know what regression is. The Excel examples discussed above will help you solve practical problems in the field of econometrics.

№	X	at	Hu	x 2	y x		(y-y x) 2
					55,89	47,54	65,70
					45,07	15,42	222,83
					54,85	34,19	8,11
					51,36	5,55	11,27
					42,28	45,16	13,84
					47,69	1,71	44,77
					45,86	9,87	192,05
Sum						159,45	558,55
Average				77519,6		22,78	79,79	2990,6

	SS	df	MS	F	meaning
Regress.	258,77	1,00	258,77	54,29	0.000000001
Remainder	395,59	83,00	L,11
Total	654,36

Happening	Variables
Happening	Assets.	Gor. us.	Rus. us.
Republic of Adygea	64,92	53	68
Altai Republic	68,60	24	60
The Republic of Buryatia	60,75	59	70
The Republic of Dagestan	79,92	41	9
The Republic of Ingushetia	75,05	41	23
Republic of Kalmykia	68,52	39	37
Karachay-Cherkess Republic	66,68	44	42
Republic of Karelia	61,70	73	73
Komi Republic	59,60	74	57
Mari El Republic	65,19	62	47

Happening	Original values	Predska values	Leftovers	Distance Mahalanobis	Distance
Adygea	64,92	66,33	-1,40	0,69	0,00
Altai Republic	68,60	69.91	-1,31	6,80	0,01
The Republic of Buryatia	60,75	65,56	-4,81	0,23	0,01
The Republic of Dagestan	79,92	71,01	8,91	10,57	0,44
The Republic of Ingushetia	75,05	70,21	4,84	6,73	0,08
Republic of Kalmykia	68,52	69,59	-1,07	4,20	0,00


		Number of people who quit	Salary
			30,000 rubles
			35,000 rubles
			40,000 rubles
			45,000 rubles
			50,000 rubles
			55,000 rubles
			60,000 rubles


month number	month name	product price N
		1750 rubles per ton
		1755 rubles per ton
		1767 rubles per ton
		1760 rubles per ton
		1770 rubles per ton
		1790 rubles per ton
		1810 rubles per ton
		1840 rubles per ton