Regression model analysis example. Regression analysis in Microsoft Excel. Mathematical definition of regression

The main goal of regression analysis consists in determining the analytical form of the relationship, in which the change in the resultant attribute is due to the influence of one or more factor signs, and the set of all other factors that also affect the resultant attribute is taken as constant and average values.
Tasks of regression analysis:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, there are positive linear and non-linear and negative linear and non-linear regression.
b) Definition of the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values ​​of the dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of given values ​​of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside specified interval(i.e. solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Pair regression - the equation of the relationship of two variables y and x: y=f(x), where y is the dependent variable (resultant sign); x - independent, explanatory variable (feature-factor).

There are linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.
Regressions that are non-linear in explanatory variables:

Regressions that are non-linear in the estimated parameters:

  • power y=a x b ε
  • exponential y=a b x ε
  • exponential y=e a+b x ε
The construction of the regression equation is reduced to estimating its parameters. To estimate the parameters of regressions linear in parameters, use the method least squares(MNK). LSM makes it possible to obtain such parameter estimates for which the sum of the squared deviations of the actual values ​​of the effective feature y from the theoretical values ​​y x is minimal, i.e.
.
For linear and not linear equations reduced to linear, the following system is solved for a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the studied phenomena is estimated by the linear pair correlation coefficient r xy for linear regression (-1≤r xy ≤1):

and correlation index p xy - for non-linear regression (0≤p xy ≤1):

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
The average approximation error is the average deviation of the calculated values ​​from the actual ones:
.
Permissible limit of values ​​A - no more than 8-10%.
The average coefficient of elasticity E shows how many percent on average the result y will change from its average value when the factor x changes by 1% from its average value:
.

The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y-y )²=∑(y x -y )²+∑(y-y x)²
where ∑(y-y )² - total amount squared deviations;
∑(y x -y)² - sum of squared deviations due to regression ("explained" or "factorial");
∑(y-y x)² - residual sum of squared deviations.
The share of the variance explained by regression in the total variance of the effective feature y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

F-test - evaluation of the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of closeness of connection. For this, a comparison of the actual F fact and the critical (tabular) F table of the values ​​of the Fisher F-criterion is performed. F fact is determined from the ratio of the values ​​of the factorial and residual variances calculated for one degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a - the probability of rejecting the correct hypothesis, provided that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F is a fact, then the hypothesis H about is not rejected and the statistical insignificance, the unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, Student's t-test and confidence intervals for each of the indicators are calculated. A hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
Random errors of linear regression parameters and correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t tabl and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistics is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a , b и r xy не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t the fact that the hypothesis H about is not rejected and the random nature of the formation of a, b or r xy is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
Δ a =t table m a , Δ b =t table m b .
The formulas for calculating confidence intervals are as follows:
γ a \u003d aΔ a; γ a \u003d a-Δ a; γ a =a+Δa
γ b = bΔ b ; γ b = b-Δ b ; γb =b+Δb
If zero falls within the boundaries of the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is assumed to be zero, since it cannot simultaneously take on both positive and negative values.
The forecast value y p is determined by substituting the corresponding (forecast) value x p into the regression equation y x =a+b·x . The average standard error of the forecast m y x is calculated:
,
where
and the confidence interval of the forecast is built:
γ y x =y p Δ y p ; γ y x min=y p -Δ y p ; γ y x max=y p +Δ y p
where Δ y x =t table ·m y x .

Solution Example

Task number 1. For seven territories of the Ural region For 199X, the values ​​of two signs are known.
Table 1.

Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power law (previously it is necessary to perform the procedure of linearization of variables by taking the logarithm of both parts);
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through the average approximation error A and Fisher's F-test.

Solution (Option #1)

To calculate the parameters a and b of the linear regression y=a+b·x (the calculation can be done using a calculator).
solve the system of normal equations with respect to a and b:
Based on the initial data, we calculate ∑y, ∑x, ∑y x, ∑x², ∑y²:
y x yx x2 y2 y xy-y xAi
l68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed value (Total/n)57,89
y
54,90
x
3166,05
x y
3048,34
3383,68
XX8,1
s 5,74 5,86 XXXXXX
s232,92 34,34 XXXXXX


a=y -b x = 57.89+0.35 54.9 ≈ 76.88

Regression equation: y= 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of spending on the purchase of food products is reduced by an average of 0.35% points.
Calculate the linear coefficient of pair correlation:

Communication is moderate, reverse.
Let's determine the coefficient of determination: r² xy =(-0.35)=0.127
The 12.7% variation in the result is explained by the variation in the x factor. Substituting the actual values ​​into the regression equation X, we determine the theoretical (calculated) values ​​of y x . Let us find the value of the average approximation error A :

On average, the calculated values ​​deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

The obtained value indicates the need to accept the hypothesis H 0 about the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of closeness of connection.
1b. The construction of the power model y=a x b is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:
lg y=lg a + b lg x
Y=C+b Y
where Y=lg(y), X=lg(x), C=lg(a).

For calculations, we use the data in Table. 1.3.
Table 1.3

YX YX Y2 x2 y xy-y x(y-yx)²Ai
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Mean1,7605 1,7370 3,0572 3,1011 3,0194 XX28,27 8,0
σ 0,0425 0,0484 XXXXXXX
σ20,0018 0,0023 XXXXXXX

Calculate C and b:

C=Y -b X = 1.7605+0.298 1.7370 = 2.278126
We get a linear equation: Y=2.278-0.298 X
After potentiating it, we get: y=10 2.278 x -0.298
Substituting in given equation actual values X, we get the theoretical values ​​of the result. Based on them, we calculate the indicators: the tightness of the connection - the correlation index p xy and the average approximation error A .

The characteristics of the power model indicate that it is somewhat better linear function describes the relationship.

1c. The construction of the equation of the exponential curve y \u003d a b x is preceded by the procedure for linearizing the variables when taking the logarithm of both parts of the equation:
lg y=lg a + x lg b
Y=C+B x
For calculations, we use the table data.

Yx Yx Y2 x2y xy-y x(y-yx)²Ai
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed zn.1,7605 54,9 96,5711 3,1011 3048,34 XX28,68 8,0
σ 0,0425 5,86 XXXXXXX
σ20,0018 34,339 XXXXXXX

The values ​​of the regression parameters A and AT amounted to:

A=Y -B x = 1.7605+0.0023 54.9 = 1.887
A linear equation is obtained: Y=1.887-0.0023x. We potentiate the resulting equation and write it in the usual form:
y x =10 1.887 10 -0.0023x = 77.1 0.9947 x
We estimate the tightness of the relationship through the correlation index p xy:

3588,01 56,9 3,0 9,00 5,0 4 56,7 0,0162 0,9175 0,000262 3214,89 55,5 1,2 1,44 2,1 5 55 0,0170 0,9354 0,000289 3025,00 56,4 -1,4 1,96 2,5 6 54,3 0,0212 1,1504 0,000449 2948,49 60,8 -6,5 42,25 12,0 7 49,3 0,0181 0,8931 0,000328 2430,49 57,5 -8,2 67,24 16,6 Total405,2 0,1291 7,5064 0,002413 23685,76 405,2 0,0 194,90 56,5 Mean57,9 0,0184 1,0723 0,000345 3383,68 XX27,84 8,1 σ 5,74 0,002145 XXXXXXX σ232,9476 0,000005 XX

Regression and correlation analysis - statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using concrete practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when they are combined.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, wages, and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, to predict, plan the development priority areas to make managerial decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Consider the example of building a regression model in Excel and interpreting the results. Let's take linear type regression.

A task. At 6 enterprises, the average monthly salary and the number of employees who left were analyzed. It is necessary to determine the dependence of the number of retired employees on the average salary.

The linear regression model has the following form:

Y \u003d a 0 + a 1 x 1 + ... + a k x k.

Where a are the regression coefficients, x are the influencing variables, and k is the number of factors.

In our example, Y is the indicator of quit workers. The influencing factor is wages (x).

Excel has built-in functions that can be used to calculate the parameters of a linear regression model. But the Analysis ToolPak add-in will do it faster.

Activate a powerful analytical tool:

Once activated, the add-on will be available under the Data tab.

Now we will deal directly with the regression analysis.



First of all, we pay attention to the R-square and coefficients.

R-square is the coefficient of determination. In our example, it is 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model. Good - above 0.8. Poor - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the less quit. Which is fair.



Correlation analysis in Excel

Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.

The correlation coefficient is denoted r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient value is 0, there is no linear relationship between the samples.

Consider how to use Excel to find the correlation coefficient.

The CORREL function is used to find the paired coefficients.

Task: Determine if there is a relationship between work time lathe and the cost of its maintenance.

Put the cursor in any cell and press the fx button.

  1. In the "Statistical" category, select the CORREL function.
  2. Argument "Array 1" - the first range of values ​​- the time of the machine: A2: A14.
  3. Argument "Array 2" - the second range of values ​​- the cost of repairs: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis several parameters (more than 2), it is more convenient to use "Data Analysis" (add-on "Analysis Package"). In the list, you need to select a correlation and designate an array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this one:

Correlation-regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data is visible.

Regression analysis

regression (linear) analysis- a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criteria. Terminology dependent and independent variables reflects only the mathematical dependence of the variables ( see Spurious correlation), rather than a causal relationship.

Goals of regression analysis

  1. Determination of the degree of determinism of the variation of the criterion (dependent) variable by predictors (independent variables)
  2. Predicting the value of the dependent variable using the independent variable(s)
  3. Determination of the contribution of individual independent variables to the variation of the dependent

Regression analysis cannot be used to determine whether there is a relationship between variables, since the existence of such a relationship is a prerequisite for applying the analysis.

Mathematical definition of regression

Strictly regressive dependence can be defined as follows. Let , be random variables with a given joint probability distribution. If for each set of values ​​a conditional expectation is defined

(general regression equation),

then the function is called regression Y values ​​by values ​​, and its graph - regression line by , or regression equation.

Dependence on is manifested in the change in the average values ​​of Y when changing . Although for each fixed set of values, the quantity remains a random variable with a certain dispersion.

To clarify the question of how accurately the regression analysis estimates the change in Y with a change, the average value of the variance of Y is used for different sets of values ​​(in fact, we are talking about the measure of dispersion of the dependent variable around the regression line).

Least squares method (calculation of coefficients)

In practice, the regression line is most often sought as a linear function (linear regression) that best approximates the desired curve. This is done using the method of least squares, when the sum of the squared deviations of the actually observed from their estimates is minimized (meaning estimates using a straight line that claims to represent the desired regression dependence):

(M - sample size). This approach is based on known fact that the sum appearing in the above expression takes the minimum value precisely for the case when .

To solve the problem of regression analysis by the least squares method, the concept is introduced residual functions:

The condition for the minimum of the residual function:

The resulting system is a system of linear equations with unknowns

If we represent the free terms of the left side of the equations by the matrix

and the coefficients of the unknowns on the right side of the matrix

then we get the matrix equation: , which is easily solved by the Gauss method. The resulting matrix will be a matrix containing the coefficients of the regression line equation:

To obtain the best estimates, it is necessary to fulfill the LSM prerequisites (Gauss–Markov conditions). In English literature, such estimates are called BLUE (Best Linear Unbiased Estimators) - the best linear unbiased estimates.

Interpreting Regression Parameters

The parameters are partial correlation coefficients; is interpreted as the proportion of the variance of Y explained by fixing the influence of the remaining predictors, that is, it measures the individual contribution to the explanation of Y. In the case of correlated predictors, there is a problem of uncertainty in the estimates, which become dependent on the order in which the predictors are included in the model. In such cases, it is necessary to apply the analysis methods of correlation and stepwise regression analysis.

Speaking about non-linear regression analysis models, it is important to pay attention to whether we are talking about non-linearity in independent variables (from a formal point of view, easily reduced to linear regression), or non-linearity in estimated parameters (causing serious computational difficulties). In case of nonlinearity of the first type, from a meaningful point of view, it is important to single out the appearance in the model of members of the form , , indicating the presence of interactions between features , etc. (see Multicollinearity).

see also

Links

  • www.kgafk.ru - Lecture on "Regression Analysis"
  • www.basegroup.ru - methods for selecting variables in regression models

Literature

  • Norman Draper, Harry Smith Applied regression analysis. Multiple Regression = Applied Regression Analysis. - 3rd ed. - M .: "Dialectics", 2007. - S. 912. - ISBN 0-471-17082-8
  • Sustainable assessment methods statistical models: Monograph. - K. : PP "Sansparelle", 2005. - S. 504. - ISBN 966-96574-0-7, UDC: 519.237.5:515.126.2, LBC 22.172 + 22.152
  • Radchenko Stanislav Grigorievich, Regression Analysis Methodology: Monograph. - K. : "Korniychuk", 2011. - S. 376. - ISBN 978-966-7599-72-0

Wikimedia Foundation. 2010 .

The main feature of regression analysis is that it can be used to obtain specific information about the form and nature of the relationship between the variables under study.

The sequence of stages of regression analysis

Let us briefly consider the stages of regression analysis.

    Task formulation. At this stage, preliminary hypotheses about the dependence of the studied phenomena are formed.

    Definition of dependent and independent (explanatory) variables.

    Collection of statistical data. Data must be collected for each of the variables included in the regression model.

    Formulation of a hypothesis about the form of connection (simple or multiple, linear or non-linear).

    Definition regression functions (consists in the calculation of the numerical values ​​of the parameters of the regression equation)

    Evaluation of the accuracy of regression analysis.

    Interpretation of the obtained results. The results of the regression analysis are compared with preliminary hypotheses. The correctness and plausibility of the obtained results are evaluated.

    Prediction of unknown values ​​of the dependent variable.

With the help of regression analysis, it is possible to solve the problem of forecasting and classification. Predictive values ​​are calculated by substituting the values ​​of the explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and the part of the set where the value of the function is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

Tasks of regression analysis

Consider the main tasks of regression analysis: establishing the form of dependence, determining regression functions, an estimate of the unknown values ​​of the dependent variable.

Establishing the form of dependence.

The nature and form of the relationship between variables can form the following types of regression:

    positive linear regression (expressed as a uniform growth of the function);

    positive uniformly accelerating regression;

    positive uniformly increasing regression;

    negative linear regression (expressed as a uniform drop in function);

    negative uniformly accelerated decreasing regression;

    negative uniformly decreasing regression.

However, the varieties described are usually not found in pure form, but in combination with each other. In this case, one speaks of combined forms of regression.

Definition of the regression function.

The second task is to find out the effect on the dependent variable of the main factors or causes, all other things being equal, and subject to the exclusion of the impact on the dependent variable of random elements. regression function defined as a mathematical equation of one type or another.

Estimation of unknown values ​​of the dependent variable.

The solution of this problem is reduced to solving a problem of one of the following types:

    Estimation of the values ​​of the dependent variable within the considered interval of the initial data, i.e. missing values; this solves the problem of interpolation.

    Estimating the future values ​​of the dependent variable, i.e. finding values ​​outside the given interval of the initial data; this solves the problem of extrapolation.

Both problems are solved by substituting the found estimates of the parameters of the values ​​of the independent variables into the regression equation. The result of solving the equation is an estimate of the value of the target (dependent) variable.

Let's look at some of the assumptions that regression analysis relies on.

Linearity assumption, i.e. it is assumed that the relationship between the variables under consideration is linear. So, in this example, we built a scatterplot and were able to see a clear linear relationship. If, on the scatterplot of variables, we see a clear absence of a linear relationship, i.e. there is a non-linear relationship, non-linear methods of analysis should be used.

Normality Assumption leftovers. It assumes that the distribution of the difference between predicted and observed values ​​is normal. To visually determine the nature of the distribution, you can use histograms leftovers.

When using regression analysis, one should take into account its main limitation. It consists in the fact that regression analysis allows you to detect only dependencies, and not the relationships that underlie these dependencies.

Regression analysis makes it possible to assess the degree of association between variables by calculating the expected value of a variable based on several known values.

Regression equation.

The regression equation looks like this: Y=a+b*X

Using this equation, the variable Y is expressed in terms of the constant a and the slope of the line (or slope) b multiplied by the value of the variable X. The constant a is also called the intercept, and the slope is the regression coefficient or B-factor.

In most cases (if not always) there is a certain scatter of observations about the regression line.

Remainder is the deviation of an individual point (observation) from the regression line (predicted value).

To solve the problem of regression analysis in MS Excel, select from the menu Service"Analysis Package" and the Regression analysis tool. Specify the X and Y input intervals. The Y input interval is the range of dependent data being analyzed and must include one column. The input interval X is the range of independent data to be analyzed. The number of input ranges must not exceed 16.

At the output of the procedure in the output range, we get the report given in table 8.3a-8.3v.

RESULTS

Table 8.3a. Regression statistics

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

First, consider the upper part of the calculations presented in table 8.3a, - regression statistics.

Value R-square, also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the value R-square is between these values, called extreme, i.e. between zero and one.

If the value R-squared close to unity, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, the value R-squared, close to zero, means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

plural R - coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R equals square root from the coefficient of determination, this value takes values ​​in the range from zero to one.

In simple linear regression analysis plural R equal to the Pearson correlation coefficient. Really, plural R in our case, it is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients

Odds

standard error

t-statistic

Y-intersection

Variable X 1

* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between the variables is determined based on the signs (negative or positive) of the regression coefficients (coefficient b).

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

AT table 8.3c. output results are presented leftovers. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains

Observation

Predicted Y

Remains

Standard balances

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value remainder in our case - 0.778, the smallest - 0.043. For a better interpretation of these data, we will use the graph of the original data and the constructed regression line presented in Fig. rice. 8.3. As you can see, the regression line is quite accurately "fitted" to the values ​​of the original data.

It should be taken into account that the example under consideration is quite simple and it is far from always possible to qualitatively construct a linear regression line.

Rice. 8.3. Initial data and regression line

The problem of estimating unknown future values ​​of the dependent variable based on the known values ​​of the independent variable remained unconsidered, i.e. forecasting task.

Having a regression equation, the forecasting problem is reduced to solving the equation Y= x*2.305454545+2.694545455 with known values ​​of x. The results of predicting the dependent variable Y six steps ahead are presented in table 8.4.

Table 8.4. Y variable prediction results

Y(predicted)

Thus, as a result of using regression analysis in the Microsoft Excel package, we:

    built a regression equation;

    established the form of dependence and the direction of the relationship between the variables - a positive linear regression, which is expressed in a uniform growth of the function;

    established the direction of the relationship between the variables;

    assessed the quality of the resulting regression line;

    were able to see the deviations of the calculated data from the data of the original set;

    predicted the future values ​​of the dependent variable.

If a regression function is defined, interpreted and justified, and the assessment of the accuracy of the regression analysis meets the requirements, we can assume that the constructed model and predictive values ​​are sufficiently reliable.

The predicted values ​​obtained in this way are the average values ​​that can be expected.

In this paper, we reviewed the main characteristics descriptive statistics and among them such concepts as mean,median,maximum,minimum and other characteristics of data variation.

There was also a brief discussion of the concept emissions. The considered characteristics refer to the so-called exploratory data analysis, its conclusions may not apply to the general population, but only to a data sample. Exploratory data analysis is used to draw primary conclusions and form hypotheses about the population.

The basics of correlation and regression analysis, their tasks and possibilities of practical use were also considered.

1. For the first time the term "regression" was introduced by the founder of biometrics F. Galton (XIX century), whose ideas were developed by his follower K. Pearson.

Regression analysis- a method of statistical data processing that allows you to measure the relationship between one or more causes (factorial signs) and a consequence (effective sign).

sign- this is the main distinguishing feature, feature of the phenomenon or process being studied.

Effective sign - investigated indicator.

Factor sign- an indicator that affects the value of the effective feature.

The purpose of the regression analysis is to evaluate the functional dependence of the average value of the effective feature ( at) from factorial ( x 1, x 2, ..., x n), expressed as regression equations

at= f(x 1, x 2, ..., x n). (6.1)

There are two types of regression: paired and multiple.

Paired (simple) regression- equation of the form:

at= f(x). (6.2)

The resultant feature in pairwise regression is considered as a function of one argument, i.e. one factor.

Regression analysis includes the following steps:

definition of the function type;

determination of regression coefficients;

Calculation of theoretical values ​​of the effective feature;

Checking the statistical significance of the regression coefficients;

Checking the statistical significance of the regression equation.

Multiple regression- equation of the form:

at= f(x 1, x 2, ..., x n). (6.3)

The resultant feature is considered as a function of several arguments, i.e. many factors.

2. In order to correctly determine the type of function, it is necessary to find the direction of the connection based on theoretical data.

According to the direction of the connection, the regression is divided into:

· direct regression, arising under the condition that with an increase or decrease in the independent value " X" values ​​of the dependent quantity " at" also increase or decrease accordingly;

· reverse regression, arising under the condition that with an increase or decrease in the independent value "X" dependent value " at" decreases or increases accordingly.

To characterize the relationships, the following types of paired regression equations are used:

· y=a+bxlinear;

· y=e ax + b – exponential;

· y=a+b/x – hyperbolic;

· y=a+b 1 x+b 2 x 2 – parabolic;

· y=ab x – exponential and etc.

where a, b 1 , b 2- coefficients (parameters) of the equation; at- effective sign; X- factor sign.

3. The construction of the regression equation is reduced to estimating its coefficients (parameters), for this they use least square method(MNK).

The least squares method allows you to obtain such estimates of the parameters, in which the sum of the squared deviations of the actual values ​​of the effective feature " at»from theoretical « y x» is minimal, that is

Regression Equation Options y=a+bx by the least squares method are estimated using the formulas:

where a - free coefficient, b- regression coefficient, shows how much the resultant sign will change y» when changing the factor attribute « x» per unit of measure.

4. To assess the statistical significance of the regression coefficients, Student's t-test is used.

Scheme for checking the significance of regression coefficients:

1) H 0: a=0, b=0 - regression coefficients are insignificantly different from zero.

H 1: a≠ 0, b≠ 0 - regression coefficients are significantly different from zero.

2) R=0.05 – significance level.

where m b,m a- random errors:

; . (6.7)

4) t table(R; f),

where f=n-k- 1 - number of degrees of freedom (table value), n- number of observations, k X".

5) If , then deviates, i.e. significant coefficient.

If , then is accepted, i.e. coefficient is insignificant.

5. To check the correctness of the constructed regression equation, the Fisher criterion is used.

Scheme for checking the significance of the regression equation:

1) H 0: the regression equation is not significant.

H 1: the regression equation is significant.

2) R=0.05 – significance level.

3) , (6.8)

where is the number of observations; k- the number of parameters in the equation with variables " X"; at- the actual value of the effective feature; y x- the theoretical value of the effective feature; - coefficient of pair correlation.

4) F table(R; f 1 ; f2),

where f 1 \u003d k, f 2 \u003d n-k-1- number of degrees of freedom (table values).

5) If F calc >F table, then the regression equation is chosen correctly and can be applied in practice.

If a F calc , then the regression equation is chosen incorrectly.

6. The main indicator reflecting the measure of the quality of regression analysis is coefficient of determination (R 2).

Determination coefficient shows what proportion of the dependent variable " at» is taken into account in the analysis and is caused by the influence of the factors included in the analysis.

Determination coefficient (R2) takes values ​​in the range . The regression equation is qualitative if R2 ≥0,8.

The coefficient of determination is equal to the square of the correlation coefficient, i.e.

Example 6.1. Based on the following data, construct and analyze the regression equation:

Solution.

1) Calculate the correlation coefficient: . The relationship between the signs is direct and moderate.

2) Build a paired linear regression equation.

2.1) Make a calculation table.

X at Hu x 2 y x (y-y x) 2
55,89 47,54 65,70
45,07 15,42 222,83
54,85 34,19 8,11
51,36 5,55 11,27
42,28 45,16 13,84
47,69 1,71 44,77
45,86 9,87 192,05
Sum 159,45 558,55
Average 77519,6 22,78 79,79 2990,6

,

Paired linear regression equation: y x \u003d 25.17 + 0.087x.

3) Find theoretical values ​​" y x» by substituting actual values ​​into the regression equation « X».

4) Plot graphs of actual " at" and theoretical values ​​" y x» effective feature (Figure 6.1): r xy =0.47) and a small number of observations.

7) Calculate the coefficient of determination: R2=(0.47) 2 =0.22. The constructed equation is of poor quality.

Because calculations during regression analysis are quite voluminous, it is recommended to use special programs ("Statistica 10", SPSS, etc.).

Figure 6.2 shows a table with the results of the regression analysis carried out using the program "Statistica 10".

Figure 6.2. The results of the regression analysis carried out using the program "Statistica 10"

5. Literature:

1. Gmurman V.E. Probability Theory and Mathematical Statistics: Proc. manual for universities / V.E. Gmurman. - M.: Higher school, 2003. - 479 p.

2. Koichubekov B.K. Biostatistics: Textbook. - Almaty: Evero, 2014. - 154 p.

3. Lobotskaya N.L. Higher Mathematics. / N.L. Lobotskaya, Yu.V. Morozov, A.A. Dunaev. - Minsk: Higher School, 1987. - 319 p.

4. Medic V.A., Tokmachev M.S., Fishman B.B. Statistics in Medicine and Biology: A Guide. In 2 volumes / Ed. Yu.M. Komarov. T. 1. Theoretical statistics. - M.: Medicine, 2000. - 412 p.

5. Application of statistical analysis methods for the study of public health and health care: textbook / ed. Kucherenko V.Z. - 4th ed., revised. and additional - M.: GEOTAR - Media, 2011. - 256 p.

mob_info