Solving problems in econometrics. Solution and Analysis How to Find the Average in Econometrics

Suppose we have found these estimates and we can write the equation:

ŷ = a + bX,

where a- regression constant, the point of intersection of the regression line with the axis OY;

b- regression coefficient, the slope of the regression line, characterizing the ratio DY¤DX;

ŷ is the theoretical value of the variable being explained.

As is known, in pair regression, the choice of the type of mathematical model can be carried out in three types:

1. Graphic.

2. Analytical.

3. Experimental.

A graphical method can be used to select a function that describes the observed values. The initial data is plotted on the coordinate plane. The values ​​of the factor attribute are plotted on the abscissa axis, and the values ​​of the resulting attribute are plotted on the ordinate axis. The location of the dots will show the approximate shape of the connection. As a rule, this relationship is curvilinear. If the curvature of this line is small, then we can accept the hypothesis of the existence of a rectilinear connection.

Let us represent the consumption function as a scatterplot. To do this, in the coordinate system on the abscissa axis, we plot the value of income, and on the ordinate axis - the cost of consumption of a conditional product. The location of the points corresponding to the sets of values ​​"income - expenditure on consumption" will show an approximate form of the relationship (Figure 1).

Visually, according to the diagram, it is almost never possible to unambiguously name the best dependence.

Let's move on to estimating the parameters of the selected function a and b least squares method.

The estimation problem can be reduced to the "classical" problem of finding the minimum. The variables now turn out to be estimates a and b unknown parameters of the proposed connection at and X. To find the smallest value of any function, you first need to find partial derivatives of the first order. Then equate each of them to zero and solve the resulting system of equations with respect to the variables. In our case, such a function is the sum of squared deviations - S, and the variables a and b. That is, we must find = 0 and = 0 and solve the resulting system of equations with respect to a and b.

Let us derive parameter estimates using the least squares method, assuming that the constraint equation has the form ŷ = a + bX. Then the function S has the form

. Differentiating a function S on a, we obtain the first normal equation by differentiating with respect to b is the second normal equation. . .

After appropriate transformations, we get:

(*)

There are simplified rules for constructing a system of normal equations. Let's apply them to a linear function:

1) Multiply each term of the equation ŷ = a + bX by the coefficient at the first parameter ( a), that is, per unit.

2) We put the summation sign in front of each variable.

3) We multiply the free term of the equation by n.

4) Get the first normal equation

5) We multiply each term of the original equation by the coefficient of the second parameter ( b), that is, X.

6) We put the summation sign in front of each variable.

7) We get the second normal equation

According to these rules, a system of normal equations is compiled for any linear function. The rules were first formulated by the English economist R. Pearl.

Equation parameters are calculated using the following formulas:

, ,

Let's build, using the initial data in table 1, a system of normal equations (*) and solve it for the unknowns a and b:


1677=11*a+4950*ba = -3309

790 400=4950*a+2 502 500*bb = 7.6923

The regression equation looks like:

ŷ \u003d -3309 + 7.6923 x ,

Let's compare the actual and estimated consumption costs of good A (Table 2).

Table 2 Comparison of actual and calculated values ​​of expenditures for the consumption of goods BUT with a linear relationship:

group number

Consumption spending

goods BUT

Deviation of actual costs from estimated

actual (y)

settlement

absolute

(u-ŷ)

1 120 -1770,54 1890,54
2 129 -1385,92 1514,92
3 135 -1001,31 1136,31
4 140 -616,45 756,45
5 145 -232,08 377,08
6 151 152,53 -1,53
7 155 537,15 -382,15
8 160 921,76 -761,76
9 171 1306,38 -1135,38
10 182 1690,99 -1508,99
11 189 2075,61 -1886,61
Total - - 0

Let's plot the resulting function ŷ and a scatterplot using actual values ​​(y) and calculated values ​​( ŷ) .

The calculated values ​​deviate from the actual ones due to the fact that the relationship between the signs is correlational.

As a measure of the tightness of the relationship, the correlation coefficient is used:

=

We obtain using the initial data from table 1:

σ x =158;

σ y = 20,76;

r = 0,990.

The linear correlation coefficient can take any value ranging from minus 1 to plus 1. The closer the absolute value of the correlation coefficient is to 1, the closer the relationship between the features. The sign of the linear correlation coefficient indicates the direction of the relationship - the positive sign corresponds to the direct relationship, and the minus sign corresponds to the inverse relationship.

Conclusion: relationship between values X and corresponding values at

close, direct relationship.

In our example d = 0,9801

This means that the change in the cost of goods BUT can be 98.01% explained by the change in income.

The remaining 1.99% may result from:

1) insufficiently well-chosen form of communication;

2) influence on the dependent variable of any other unaccounted for factors.

Statistical testing of hypotheses.

We put forward a null hypothesis that the regression coefficient is statistically insignificant:

H 0 : b = 0.

The statistical significance of the regression coefficient is checked using t-Student's criterion. To do this, we first determine the residual sum of squares

s 2 ost= å (y i – ŷ i) 2

s 2 ost = 1,3689.

and its standard deviation

s = 0,39. se ( b ) = 0,018.

actual value t-Student's test for the regression coefficient:

.

tb = 427,35.

The value |t b |>t cr (t cr =2.26 for 95% significance level) allows us to conclude that the regression coefficient is different from zero (at the corresponding significance level) and, therefore, that there is an influence (connection) X and y.

Conclusion: actual value t-Student's criterion exceeds the tabular one, which means that the null hypothesis is rejected and with a probability of 95% an alternative hypothesis about the statistical significance of the regression coefficient is accepted.

[b– t cr *se( b), b+ t cr *se( b)] - 95% confidence interval for b.

Confidence interval covers the true value of the parameter b with a given probability (in this case, 95%).

7,6516 < b < 7,7329.

Let's move on to testing the statistical significance of the correlation and determination coefficients:

r = 0,990;

d = r 2 = 0,9801.

We put forward a null hypothesis that the regression equation as a whole is statistically insignificant:

H 0 : r 2 = 0.

The assessment of the statistical significance of the constructed regression model as a whole is made using F- Fisher's criterion. actual value F-criteria for the equation of pair regression, linear in parameters is defined as:

where s 2 factor is the dispersion for theoretical values ŷ (explained variation);

s 2 rest - residual sum of squares;

r 2 - coefficient of determination.

actual value F- Fisher's criterion:

F f = 443,26

Conclusion: we reject the null hypothesis and with a probability of 95% accept the alternative hypothesis about the statistical significance of the regression equation.

Dispersion in statistics is found as individual values ​​of the feature in the square of . Depending on the initial data, it is determined by the simple and weighted variance formulas:

1. (for ungrouped data) is calculated by the formula:

2. Weighted variance (for a variation series):

where n is the frequency (repeatability factor X)

An example of finding the variance

This page describes a standard example of finding the variance, you can also look at other tasks for finding it

Example 1. We have the following data for a group of 20 correspondence students. It is necessary to build an interval series of the feature distribution, calculate the mean value of the feature and study its variance

Let's build an interval grouping. Let's determine the range of the interval by the formula:

where X max is the maximum value of the grouping feature;
X min is the minimum value of the grouping feature;
n is the number of intervals:

We accept n=5. The step is: h \u003d (192 - 159) / 5 \u003d 6.6

Let's make an interval grouping

For further calculations, we will build an auxiliary table:

X'i is the middle of the interval. (for example, the middle of the interval 159 - 165.6 = 162.3)

The average growth of students is determined by the formula of the arithmetic weighted average:

We determine the dispersion by the formula:

The variance formula can be converted as follows:

From this formula it follows that the variance is the difference between the mean of the squares of the options and the square and the mean.

Variance in variation series with equal intervals according to the method of moments can be calculated in the following way using the second dispersion property (dividing all options by the value of the interval). Definition of variance, calculated by the method of moments, according to the following formula is less time consuming:

where i is the value of the interval;
A - conditional zero, which is convenient to use the middle of the interval with the highest frequency;
m1 is the square of the moment of the first order;
m2 - moment of the second order

(if in the statistical population the attribute changes in such a way that there are only two mutually exclusive options, then such variability is called alternative) can be calculated by the formula:

Substituting in this dispersion formula q = 1- p, we get:

Types of dispersion

Total variance measures the variation of a trait over the entire population as a whole under the influence of all the factors that cause this variation. It is equal to the mean square of the deviations of the individual values ​​of the attribute x from the total average value x and can be defined as simple variance or weighted variance.

characterizes random variation, i.e. part of the variation, which is due to the influence of unaccounted for factors and does not depend on the sign-factor underlying the grouping. This variance is equal to the mean square of the deviations of the individual values ​​of the attribute within the X group from the arithmetic mean of the group and can be calculated as a simple variance or as a weighted variance.

In this way, within-group variance measures variation of a trait within a group and is determined by the formula:

where xi - group average;
ni is the number of units in the group.

For example, intra-group variances that need to be determined in the task of studying the effect of workers' qualifications on the level of labor productivity in a shop show variations in output in each group caused by all possible factors (technical condition of equipment, availability of tools and materials, age of workers, labor intensity, etc. .), except for differences in the qualification category (within the group, all workers have the same qualification).

The average of the within-group variances reflects the random, i.e., that part of the variation that occurred under the influence of all other factors, with the exception of the grouping factor. It is calculated by the formula:

It characterizes the systematic variation of the resulting trait, which is due to the influence of the trait-factor underlying the grouping. It is equal to the mean square of the deviations of the group means from the overall mean. Intergroup variance is calculated by the formula:

Variance addition rule in statistics

According to variance addition rule the total variance is equal to the sum of the average of the intragroup and intergroup variances:

The meaning of this rule is that the total variance that occurs under the influence of all factors is equal to the sum of the variances that arise under the influence of all other factors, and the variance that arises due to the grouping factor.

Using the formula for adding variances, it is possible to determine the third unknown from two known variances, and also to judge the strength of the influence of the grouping attribute.

Dispersion Properties

1. If all the values ​​of the attribute are reduced (increased) by the same constant value, then the variance will not change from this.
2. If all the values ​​of the attribute are reduced (increased) by the same number of times n, then the variance will accordingly decrease (increase) by n^2 times.

Econometrics is a science that gives a quantitative expression of the interconnections of economic phenomena and processes. Solutions to the following econometrics problems are currently available online:

Correlation-regression method of analysis

Non-parametric indicators of communication

Heteroscedasticity of the random component

autocorrelation

  1. Autocorrelation of time series levels. Checking for autocorrelation with the construction of a correlogram;

Econometric methods for conducting expert research

  1. Using the method of dispersion analysis, check the null hypothesis about the influence of the factor on the quality of the object.

The resulting solution is drawn up in Word format. Immediately after the solution, there is a link to download the template in Excel, which makes it possible to check all the indicators received. If the task requires a solution in Excel, then you can use the statistical functions in Excel.

Time series components

  1. The Analytical Leveling service can be used for analytical smoothing of a time series (in a straight line) and for finding the parameters of the trend equation. To do this, you must specify the amount of initial data. If there is a lot of data, they can be inserted from Excel.
  2. Calculation of Trend Equation Parameters.
    When choosing the type of trend function, you can use the finite difference method. If the general trend is expressed by a second-order parabola, then we obtain constant second-order finite differences. If the growth rates are approximately constant, then an exponential function is used to equalize.
    When choosing the form of the equation, one should proceed from the amount of information available. The more parameters the equation contains, the more observations should be for the same degree of estimation reliability.
  3. Smoothing by the moving average method. Using

    Correlation dependence between the factor x (average per capita subsistence level per day of one able-bodied person) and the resulting feature y (average daily wage). Linear regression equation parameters, economic interpretation of the regression coefficient.

y=f(x)+E ,y t =f(x) – theoretical function, E=y- y t

y t \u003d a + bx - correlation dependence of the average daily wage (y) on the average per capita subsistence minimum per day of one able-bodied person (x)

a+b =

a +b =

b=
- regression coefficient.

It shows by how many units the average wage (Y) changes on average with an increase in the average per capita subsistence minimum per day of one able-bodied person (X) by 1 unit.

b=
= 0,937837482

This means that with an increase in the per capita subsistence minimum per day of one able-bodied person (x) by 1 unit, the average daily wage will increase by an average of 0.937 units.

a= -b , a=135.4166667-0.937837482 86.75=54.05926511

3) Coefficient of variation

The coefficient of variation shows what proportion of the average value of SW is its average spread.

υ x = δх/x = 0.144982838, υ y = δy/y = 0.105751299

4) Correlation coefficient

The correlation coefficient is used to assess the tightness of the linear relationship between the average per capita subsistence minimum per day of one able-bodied person and the average daily wage.

rxy \u003d b δх / δy \u003d 0.823674909 because rxy ˃0 , then the correlation between variables is called direct

All this shows the dependence of the average daily wage on the average per capita subsistence minimum per day of one able-bodied person.

5) Coefficient of determination

The coefficient of determination is used to assess the quality of the selection of linear regression equations.

The coefficient of determination characterizes the share of the variance of the effective attribute Y (average daily wages), explained by regression in the total variance of the effective attribute.

R 2 xy \u003d (∑ (y t - y cf) 2) / (∑ (y - y cf) 2) \u003d 0.678440355, 0.5< R 2 < 0,7 ,

it means that the strength of the connection is noticeable, close to high, and the regression equation is well chosen.

6) Estimation of the accuracy of the model, or estimation of the approximation.

=1/n ∑ ׀(y i - y t)/y i ׀ 100% - average approximation error.

An error of less than 5-7% indicates a good selection of the model.

If the error is more than 10%, you should think about choosing a different type of model equation.

Approximation error \u003d 0.015379395 100% \u003d 1.53%, which indicates a good fit of the model to the original data

7) Scheme of dispersion analysis.

∑(y - y sr) 2 =∑(y t - y sr) 2 + ∑(y i - y t) 2 n is the number of observations, m is the number of parameters for the variable x

Variance components

Sum of squares

Number of degrees of freedom

Dispersion per degree of freedom

∑(y - y cf) 2

S 2 total \u003d (∑ (y - y cf) 2) / (n-1)

factorial

∑(y t - y avg) 2

S 2 fact \u003d (∑ (y t - y cf) 2) / m

Residual

∑(y i - y t) 2

S 2 rest \u003d (∑ (y i - y t) 2) / (n-m-1)

Analysis of variance

Components

Sum of squares

Number of degrees of freedom

Dispersion

general

factorial

residual

8) Checking the adequacy of the model byF- Fisher's criterion (α=0.05).

The assessment of the statistical significance of the regression equation as a whole is carried out usingF- Fisher's criterion.

H 0 - hypothesis about the statistical significance of the regression equation.

H 1 - statistical significance of the regression equation.

F estimated is determined from the ratio of the values ​​of the factor and residual variances calculated for one degree of freedom.

F calc \u003d S 2 fact / S 2 rest \u003d ((∑ (y t - y cf) 2) / m) / ((∑ (y i - y t) 2) / (n-m-1)) \u003d 1669.585177 / 79.13314895 = 21.09842966

F tabular - the maximum possible value of the criterion, which could be formed under the influence of random factors with given degrees of freedom, i.e. To 1 = m, TO 2 = n- m-1, and significance level α (α=0.05)

F table (0.05; 1; n-2), F table (0.05; 1; 10), F table = 4.964602701

If aF table < F calc , then the hypothesisH 0 about the random nature of the estimated characteristics is rejected, and their statistical significance and the reliability of the regression equation are recognized. OtherwiseH 0 is not rejected, and the statistical insignificance and unreliability of the regression equation is recognized. In our case, F table< F расч, следовательно признаётся статистическая значимость и надёжность уравнения регрессии.

9) Evaluation of the statistical significance of the regression and correlation coefficients fort-Student's criterion (α=0.05).

Estimation of the significance of the coeff. regression., t is the criterion Student. Let's check the statistical significance of the parameter b.

Hypothesis H 0: b=0, t b (calc) = ׀b ׀/ m b , m b = S rest / (δ x
) , where n is the number of observations

m b = 79.13314895 / (12.57726123
) = 0,204174979

t b (calc) = 0.937837482 / 0.204174979 = 4.593302697

t table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom (K=n-2), and significance level α (α=0.05). t table = 2.2281, If ​​t (calc) > t table, then the hypothesis H 0 is rejected, and the significance of the equation parameters is recognized.

In our case, t b (calc) > t tab, therefore, the hypothesis H 0 is rejected, and the statistical significance of the parameter a b is recognized.

Let's check the statistical significance of the parameter a. Hypothesis H 0: a=0 t a (calc) = ׀a ׀/ m a

m a = (S rest
)/(n δ x) , ma = (79.13314895
)/(12 12.57726123)= 17.89736655, t a (calc) = 54.05926511 / 17.89736655=3.020515055

t a (calc) > t tab. therefore, the hypothesis H 0 is rejected, and the statistical significance of the parameter a is recognized.

Assessment of the significance of the correlation. Let's check the statistical significance of the correlation coefficient.

mrxy=
, mrxy =
= 0.179320842, trxy = 0.823674909/ 0.179320842 = 4.593302697

tr = t b , tr > t table, therefore, the statistical significance of the correlation coefficient is recognized.

1. The essence of correlation-regression analysis and its tasks.

2. Definition of regression and its types.

3. Features of the model specification. Reasons for the existence of a random variable.

4. Methods for choosing a paired regression.

5. Method of least squares.

6. Indicators for measuring the closeness and strength of the connection.

7. Estimates of statistical significance.

8. The predicted value of the variable y and the confidence intervals of the forecast.

1. The essence of correlation-regression analysis and its tasks. Economic phenomena, being very diverse, are characterized by many features that reflect certain properties of these processes and phenomena and are subject to interdependent changes. In some cases, the relationship between features turns out to be very close (for example, the hourly output of an employee and his salary), while in other cases such a relationship is not expressed at all or is extremely weak (for example, the gender of students and their academic performance). The closer the relationship between these features, the more accurate the decisions made.

There are two types of dependencies between phenomena and their features:

    functional (deterministic, causal) dependence . It is given in the form of a formula that associates each value of one variable with a strictly defined value of another variable (the influence of random factors is neglected). In other words, functional dependency is a relationship in which each value of the independent variable x corresponds to a precisely defined value of the dependent variable y. In economics, functional relationships between variables are exceptions to the general rule;

    statistical (stochastic, non-deterministic) dependence - this is a connection of variables, on which the influence of random factors is superimposed, i.e. this is a relationship in which each value of the independent variable x corresponds to a set of values ​​of the dependent variable y, and it is not known in advance which value y will take.

Correlation dependence is a special case of statistical dependence.

Correlation dependence - this is a relationship in which each value of the independent variable x corresponds to a certain mathematical expectation (average value) of the dependent variable y.

Correlation dependence is an “incomplete” dependence, which does not appear in each individual case, but only in average values ​​with a sufficiently large number of cases. For example, it is known that improving the skills of an employee leads to an increase in labor productivity. This statement is often confirmed in practice, but does not mean that two or more workers of the same category / level, engaged in a similar process, will have the same labor productivity.

Correlation dependence is investigated using the methods of correlation and regression analysis.

Correlation-regression analysis allows you to establish the tightness, the direction of the connection and the form of this connection between the variables, i.e. its analytic expression.

The main task of correlation analysis consists in quantitatively determining the closeness of the connection between two signs with a paired connection and between effective and several factor signs with a multifactorial connection and a statistical assessment of the reliability of the established connection.

2. Definition of regression and its types. Regression analysis is the main mathematical and statistical tool in econometrics. Regression it is customary to call the dependence of the average value of a quantity (y) on some other quantity or on several quantities (x i).

Depending on the number of factors included in the regression equation, it is customary to distinguish between simple (paired) and multiple regressions.

Simple (paired) regression is a model where the mean value of the dependent (explained) variable y is considered as a function of one independent (explanatory) variable x. Implicitly, pairwise regression is a model of the form:

Explicitly:

,

where a and b are estimates of the regression coefficients.

Multiple Regression is a model where the average value of the dependent (explained) variable y is considered as a function of several independent (explaining) variables x 1 , x 2 , … x n . Implicitly, pairwise regression is a model of the form:

.

Explicitly:

where a and b 1 , b 2 , b n are estimates of the regression coefficients.

An example of such a model is the dependence of an employee's salary on his age, education, qualifications, length of service, industry, etc.

Regarding the form of dependence, there are:

      linear regression;

      non-linear regression, which implies the existence of non-linear relationships between factors, expressed by the corresponding non-linear function. Often, models that are non-linear in appearance can be reduced to a linear form, which allows them to be classified as linear.

3. Features of the model specification. Reasons for the existence of a random variable. Any econometric study starts with model specifications , i.e. with the formulation of the type of model, based on the relevant theory of the relationship between variables.

First of all, from the whole range of factors influencing the resultant sign, it is necessary to single out the most significantly influencing factors. Pairwise regression is sufficient if there is a dominant factor that is used as an explanatory variable. The simple regression equation characterizes the relationship between two variables, which manifests itself as a certain regularity only on average for the whole set of observations. In the regression equation, the correlation is represented as a functional dependence expressed by the corresponding mathematical function. In almost every single case, the value of y consists of two terms:

,

where y is the actual value of the effective feature;

- the theoretical value of the effective feature, found on the basis of the regression equation;

- a random variable that characterizes the deviations of the real value of the resulting feature from the theoretical value found by the regression equation.

Random value also called perturbation. It includes the influence of factors not taken into account in the model, random errors and measurement features. The presence of a random variable in the model is generated by three sources:

    model specification,

    the selective nature of the source data,

    features of measuring variables.

Specification errors will include not only the wrong choice of one or another mathematical function, but also the underestimation of any significant factor in the regression equation (the use of pair regression instead of multiple regression).

Along with specification errors, sampling errors can occur, since the researcher most often deals with sample data when establishing patterns of relationships between features. Sampling errors also occur due to the heterogeneity of data in the initial statistical population, which, as a rule, happens when studying economic processes. If the population is heterogeneous, then the regression equation has no practical meaning. To obtain a good result, units with abnormal values ​​of the studied traits are usually excluded from the population. And in this case, the results of the regression are sample characteristics. Initial data

However, the greatest danger in the practical use of regression methods is measurement errors. If specification errors can be reduced by changing the form of the model (the type of mathematical formula), and sampling errors can be reduced by increasing the amount of initial data, then measurement errors practically nullify all efforts to quantify the relationship between features.

4. Methods for choosing a paired regression. Assuming that measurement errors are kept to a minimum, econometric studies focus on model specification errors. In pair regression, the choice of the type of mathematical function
can be done in three ways:

    graphic;

    analytical, i.e. based on the theory of the studied relationship;

    experimental.

When studying the relationship between two features graphic method selection of the type of regression equation is quite clear. It is based on the correlation field. The main types of curves used in quantifying relationships




The class of mathematical functions for describing the relationship between two variables is quite wide; other types of curves are also used.

Analytical Method the choice of the type of the regression equation is based on the study of the material nature of the relationship of the studied features, as well as a visual assessment of the nature of the relationship. Those. if we are talking about the Laffer curve, which shows the relationship between the progressivity of taxation and budget revenues, then we are talking about a parabolic curve, and in microanalysis, isoquants are hyperbolas.

mob_info