Small-sample statistics. Types of samples. Small sample Linear small sample statistics

In practice, quite often one has to deal with very small samples, the number of which is significantly less than twenty to thirty. In statistics, such samples are called small samples. The need for special consideration of small samples is caused by the fact that the methods of point and interval estimation of sample characteristics discussed above require a sufficiently large number of samples.

The concept of small samples. Student distribution

The sample mean and, accordingly, its error are distributed normally, and the correction for the sample variance bias is very close to unity and has no practical significance. The sampling error under these conditions very rarely exceeds this value. The situation is different with a small sample size. With small samples, the sample variance is significantly biased. Therefore, it would be inappropriate to use the normal distribution function for probabilistic conclusions about the possible magnitude of the error. When the sample size is small, you should always use an unbiased variance estimator:

Therefore, to obtain an unbiased estimate of variance from small sample data, the sum of squared deviations must be divided by the value. This quantity is called the number of degrees of freedom of variation. In the following, for brevity, the number of degrees of freedom of variation will be denoted by the Greek letter (nu).

The problem of estimating sample characteristics based on small samples was first studied by the English mathematician and statistician W. Gosset, who published his work under the pseudonym Student (1908).

Based on the proposition that the distribution of a characteristic in the general population is normal and considering, instead of absolute deviations, their relationship to an independent standard, Student found a distribution that depends only on the sample size. Later (1925) R. Fisher gave a more rigorous proof of this distribution, which was called the Student distribution.

The Student's t value is expressed as the following ratio:

The numerator of the expression contains a variable value that reflects the possible deviations of sample means from the general mean. The quantity is normally distributed with center equal to zero and variance equal to.

It should be especially emphasized that the denominator of the expression cannot be considered as the average error of the variable. The quantity is considered here as a variable independently distributed from the numerator. means the mean square (standard) deviation of a given sample and is not an estimate of the population, since the Student distribution does not depend on any parameter of the population. determined from the sample data as

The distributions are independent of each other. Only under this condition and for samples from normal populations does the Student distribution occur.

The main advantage of the Student distribution is that it does not depend on population parameters and deals only with quantities obtained directly from the sample.

The differential law Student distribution (probability density) has the form:

where is the sample size;

the value corresponding to the maximum ordinate of the distribution curve at t = 0.

Accordingly, the Student distribution function is expressed:

In other words,

where t f is a standardized (normalized) difference calculated from the results of a small sample.

The quantities Г() and Г() are gamma functions. For a certain number, the gamma function is expressed by an improper integral:

In small samples it is always a positive integer (sample size).

In this case, the gamma function always has a finite value and is expressed through factorials:

hence:

When calculating the gamma function, it is useful to know the following properties:

1) If there is;

  • 3) For example,

Using this property, you can easily calculate the values ​​of Г() and Г() in terms of the distribution density;

4) The function reaches a minimum at a fractional value

Fig 3.1

The general view of the gamma function is shown in Fig. 3.1.

Of the properties of the Student distribution, usually considered in a probability theory course, attention is drawn to the following:

1) The Student distribution is remarkable in that it depends only on one parameter - the sample size and does not depend on the mean and variance of the population (unlike the normal distribution, which depends on these two parameters).

  • 2) The Student distribution is accurate for any sample size and therefore for small samples, which allows probabilistic conclusions to be made from a small number of observations.
  • 3) As the sample size increases, the value approaches the value, and the Student distribution approaches normal. When the Student distribution becomes normal. In practice, it is considered sufficient for a normal approximation.

Figure 3.2

In Fig. Figure 3.2 shows the relationship between the Student distribution and the normal distribution.

As can be seen from Fig. 3.2, under the ends of the Student distribution curve, for example or, there is a significantly larger part of the area than under the normal distribution curve for the same values. This means that with a small sample size, the likelihood of making large errors increases markedly. The figure shows that for values ​​of the normalized deviation exceeding the absolute value, the area under the Student distribution curve is much larger than under the normal distribution curve.

The magnitude of the discrepancies between the values ​​of the Student distribution function depending on the sample size and the values ​​of the normal distribution function can be judged from the data in Table. 3.2, which shows the values ​​of the areas under the distribution curve from at different sample sizes at.

Table 3.1

Normal distribution function value

Table 3.2

Probability values ​​for different sample sizes

Normalized deviation

Significance for small samples with numbers

Significance for large samples

From table 3.2. It can be seen that as the sample size increases, the small sample quickly approaches normal. At the same time, with a very small sample size, the discrepancies between the values ​​at a given value are very significant.

Research has established that the Student distribution is practically applicable not only in the case of a normal distribution of a characteristic in the general population. It turned out that it leads to practically acceptable conclusions even when the distribution of the characteristic in the general population is not normal, but only symmetrical and even somewhat asymmetrical, but the sample size is not too small.

The values ​​of the Student distribution function are tabulated at different values. Therefore, when assessing sample characteristics, ready-made tables are used:

Table 3.3

Function Value Table

The values ​​of the Student distribution function can be used in various ways depending on the nature of the problems being solved when determining the probability of deviation of the sample from the general one. The most commonly used:

1) Determining the probability that the difference between the sample mean and the general mean will be less by a certain specified amount. In normalized deviations, the task comes down to determining the probability that it will be less than value, specified by the conditions of the problem, i.e. to finding the value

Figure 3.3

This is the probability of large negative deviations, which is shown in Fig. 3.3 corresponds to the shaded area.

2) Determining the probability that the difference between the sample mean and the general mean will be no less than a certain specified value, in other words, one should find

Figure 3.4

This is the probability of large positive deviations, which is shown as a shaded area in Fig. 3.4. this probability can be easily found using tables.

3) Determining the probability that the normalized deviation in absolute value will be less, expressed

This is the probability of deviations that are smaller in absolute value. This probability can be determined using tables. Since in practice it is most often necessary to determine this probability, a special table of values ​​has been compiled (Table 3.3).

A graphical illustration of the probability of deviations of smaller absolute value is given in Fig. 3.5

Figure 3.5

4) Determining the probability that the sampling error in absolute value will be no less than a certain specified value. In normalized units, the probability that the absolute value will be no less will be expressed

This is the probability of deviations that are large in absolute value. It is illustrated graphically in Fig. 3.6.

Figure 3.6

To find the probability of large absolute value deviations, there are special tables (Appendix 3). This probability can be easily calculated using tables.

When controlling the quality of goods in economic research, an experiment can be conducted on the basis of a small sample.

Under small sample refers to a non-continuous statistical survey in which the sample population is formed from a relatively small number of units in the general population. The volume of a small sample usually does not exceed 30 units and can reach 4-5 units.

In trade, a minimum sample size is used when a large sample is either impossible or impractical (for example, if the research involves damage or destruction of the samples being examined).

The magnitude of the error of a small sample is determined by formulas different from the formulas of sample observation with a relatively large sample size (n>100). Average small sample erroru(mu)m.v. calculated by the formula:

um.v = root(Gsquare(m.v.) . /n),

where Gsquared (m.v.) is the variance of a small sample. *this is sigma*

According to the formula (the number is there) we have:

G0square=Gsquare *n/ (n-1).

But since with a small sample n/(n-1) is significant, the calculation of the variance of a small sample is carried out taking into account the so-called number of degrees of freedom. The number of degrees of freedom refers to the number of options that can take arbitrary values ​​without changing the value of the average. When determining the variance Gsquared, the number of degrees of freedom is n-1:

Gsquare(m.v.) = sum (xi–x(with wavy line))/(n-1).

The limiting error of a small sample Dm.v. (triangle sign) is determined by the formula:

In this case, the value of the confidence coefficient t depends not only on the given confidence probability, but also on the number of sampling units n. For individual values ​​of t and n, the confidence probability of a small sample is determined using special Student tables, which give the distribution of standardized deviations:

t= (x(with a wavy line) –x(with a line)) /Gm.v.

Student's tables are given in textbooks on mathematical statistics. Here are some values ​​from these tables that characterize the probability that the marginal error of a small sample will not exceed t times the average error:

St=P[(x(with wavy line) –x(with line)

As the sample size increases, the Student distribution approaches normal, and at 20 it no longer differs much from the normal distribution.

When conducting small sample surveys, it is important to keep in mind that the smaller the sample size, the greater the difference between the Student distribution and the normal distribution. With a minimum sample size (n=4), this difference is quite significant, indicating a decrease in the accuracy of the results of a small sample.

Using a small sample in trade, a number of practical problems are solved, first of all, establishing the limit within which the general average of the characteristic being studied is located.

Since when conducting a small sample, the value of 0.95 or 0.99 is practically accepted as a confidence probability, then to determine the maximum sampling error Dm.v. The following readings of the Student distribution are used.

small-sample statistics

It is generally accepted that the beginning of S. m.v. or, as it is often called, “small n” statistics, was founded in the first decade of the 20th century with the publication of the work of W. Gosset, in which he placed the t-distribution postulated by the “student” who gained world fame a little later. At the time, Gossett was working as a statistician at the Guinness breweries. One of his duties was to analyze successive batches of barrels of freshly brewed porter. For a reason he never really explained, Gossett experimented with the idea of ​​significantly reducing the number of samples taken from the very large number of barrels in the brewery's warehouses to randomly control the quality of the porter. This led him to postulate the t-distribution. Because the Guinness breweries' bylaws prohibited their employees from publishing research results, Gossett published the results of his experiment comparing quality control sampling using the t-distribution for small samples and the traditional z-distribution (normal distribution) anonymously, under the pseudonym "Student" - hence the name Student's t-distribution).

t-distribution. The t-distribution theory, like the z-distribution theory, is used to test the null hypothesis that two samples are simply random samples from the same population and therefore the calculated statistics (eg mean and standard deviation) are unbiased estimates of population parameters. However, unlike the theory of the normal distribution, the theory of the t-distribution for small samples does not require a priori knowledge or exact estimates mathematical expectation and population variances. Moreover, although testing a difference between the means of two large samples for statistical significance requires the fundamental assumption that characteristics of the population are normally distributed, the theory of the t distribution does not require assumptions about the parameters.

It is well known that normally distributed characteristics are described by one single curve - the Gaussian curve, which satisfies the following equation:

With the t-distribution, the whole family of curves is represented by the following formula:

This is why the equation for t includes a gamma function, which in mathematics means that as n changes this equation another curve will satisfy.

Degrees of freedom

In the equation for t, the letter n denotes the number of degrees of freedom (df) associated with the estimate of the population variance (S2), which represents the second moment of any moment generating function, such as the equation for the t distribution. In S., the number of degrees of freedom indicates how many characteristics remain free after their partial use in specific form analysis. In a t-distribution, one of the deviations from the sample mean is always fixed, since the sum of all such deviations must be equal to zero. This affects the sum of squares when calculating the sample variance as an unbiased estimate of the parameter S2 and leads to df being equal to the number of measurements minus one for each sample. Hence, in the formulas and procedures for calculating t-statistics for testing the null hypothesis, df = n - 2.

F-pacndivision. The null hypothesis tested by a t test is that two samples were randomly drawn from the same population or were randomly drawn from two different populations with the same variance. What to do if you need to conduct an analysis more groups? The answer to this question was sought for twenty years after Gosset discovered the t-distribution. Two of the most eminent statisticians of the 20th century were directly involved in its production. One is the largest English statistician R. A. Fisher, who proposed the first theories. formulations, the development of which led to the production of the F-distribution; his work on small sample theory, developing Gosset's ideas, was published in the mid-20s (Fisher, 1925). Another is George Snedecor, one of a galaxy of early American statisticians, who developed a way to compare two independent samples of any size by calculating the ratio of two estimates of variance. He called this relationship the F-ratio, after Fischer. Research results Snedecor led to the fact that the F-distribution began to be specified as the distribution of the ratio of two statistics c2, each with its own degrees of freedom:

From this came Fisher's classic work on analysis of variance, a statistical method explicitly focused on the analysis of small samples.

The sampling distribution F (where n = df) is represented by the following equation:

As with the t-distribution, the gamma function indicates that there is a family of distributions that satisfy the equation for F. In this case, however, the analysis involves two df quantities: the number of degrees of freedom for the numerator and for the denominator of the F-ratio.

Tables for estimating t- and F-statistics. When testing the null hypothesis using S., based on the theory of large samples, usually only one lookup table is required - a table of normal deviations (z), which allows you to determine the area under the normal curve between any two z values ​​​​on the x-axis. However, the tables for the t- and F-distributions are necessarily presented in a set of tables, since these tables are based on a variety of distributions resulting from varying the number of degrees of freedom. Although t- and F-distributions are probability density distributions, like the normal distribution for large samples, they differ from the latter in four ways that are used to describe them. The t distribution, for example, is symmetric (note t2 in its equation) for all df, but becomes increasingly peaked as the sample size decreases. Peaked curves (those with kurtosis greater than normal) tend to be less asymptotic (that is, less close to the x-axis at the ends of the distribution) than curves with normal kurtosis, such as the Gaussian curve. This difference results in noticeable discrepancies between the points on the x-axis corresponding to the t and z values. With df = 5 and a two-tailed α level of 0.05, t = 2.57, whereas the corresponding z = 1.96. Therefore, t = 2.57 indicates statistical significance at the 5% level. However, in the case of a normal curve, z = 2.57 (more precisely 2.58) will already indicate a 1% level of statistical significance. Similar comparisons can be made with the F distribution, since t is equal to F when the number of samples is two.

What constitutes a “small” sample?

At one time, the question was raised about how large the sample should be in order to be considered small. There is simply no definite answer to this question. However, the conventional boundary between a small and a large sample is considered to be df = 30. The basis for this somewhat arbitrary decision is the result of comparing the t-distribution with the normal distribution. As noted above, the discrepancy between t and z values ​​tends to increase as df decreases and decrease as df increases. In fact, t begins to closely approach z long before the limiting case where t = z for df = ∞. A simple visual examination of the table values ​​of t shows that this approximation becomes quite fast, starting from df = 30 and above. Comparative values ​​of t (at df = 30) and z are equal, respectively: 2.04 and 1.96 for p = 0.05; 2.75 and 2.58 for p = 0.01; 3.65 and 3.29 for p = 0.001.

Other statistics for “small” samples

Although statistics such as t and F are specifically designed for use with small samples, they are equally applicable to large samples. There are, however, many others. statistical methods, intended for the analysis of small samples and often used for this purpose. This refers to the so-called. non-parametric or distribution-free methods. Basically, the scales appearing in these methods are intended to be applied to measurements obtained using scales that do not satisfy the definition of ratio or interval scales. Most often these are ordinal (rank) or nominal measurements. Nonparametric scales do not require assumptions regarding distribution parameters, particularly regarding estimates of dispersion, because ordinal and nominal scales eliminate the very concept of dispersion. For this reason, nonparametric methods are also used for measurements obtained using interval and ratio scales when small samples are analyzed and the basic assumptions required for the use of parametric methods are likely to be violated. These tests, which can be reasonably applied to small samples, include: Fisher's exact probability test, Friedman's two-factor nonparametric (rank) analysis of variance, Kendall's t rank correlation coefficient, Kendall's coefficient of concordance (W), Kruskal's H test - Wallace for non-parametric (rank) one-way analysis of variance, Mann-Whitney U-test, median test, sign test, Spearman's rank correlation coefficient r and Wilcoxon t-test.

A person can recognize his abilities only by trying to apply them. (Seneca)

Bootstrap, small samples, application in data analysis

main idea

The bootstrap method was proposed by B. Efron as a development of the jackknife method in 1979.

Let us describe the main idea of ​​bootstrap.

The purpose of data analysis is to obtain the most accurate selective assessments and generalize the results to the entire population.

The technical term for numerical data drawn from a sample is sample statistics.

Basic descriptive statistics are selective mean, median, standard deviation, etc.

Summary statistics such as sample mean, median, correlation will vary from sample to sample.

The researcher needs to know the size of these variations as a function of the population. Based on this, the margin of error is calculated.

The initial picture of all possible values ​​of a sample statistic in the form of a probability distribution is called a sampling distribution.

The key is size samples. What if the sample size is small? One reasonable approach is to random way to extract data from an existing sample.

The idea of ​​bootstrap is to use the results of calculations on samples as a “fictitious population” to determine the sampling distribution of statistics. In fact, it analyzes big the number of “phantom” samples, called bootstrap samples.

Usually several thousand samples are randomly generated, from this set we can find the bootstrap distribution of the statistics we are interested in.

So, let us have a sample, at the first step we randomly select one of the elements of the sample, return this element to the sample, again randomly select an element, and so on.

Let us repeat the described random selection procedure n times.

In bootstrap, a random selection is made with return, selected elements of the original sample returns into the selection and can then be selected again.

Formally, at each step we select an element of the original sample with probability 1/n.

In total we have n elements of the original sample, the probability of obtaining a sample with numbers (N 1 ... Nn), where Ni varies from 0 to n is described by a polynomial distribution.

Several thousand such samples are generated, which is quite achievable for modern computers.

For each sample, an estimate of the quantity of interest is constructed, and then the estimates are averaged.

Since there are many samples, we can build empirical function distribution of estimates, then calculate quantiles, calculate the confidence interval.

It is clear that the bootstrap method is a modification of the Monte Carlo method.

If samples are generated no return elements, then the well-known folding knife method is obtained.

Question: why do this and when is it reasonable to use the method in real data analysis?

In bootstrapping, we do not obtain new information, but we use the available data wisely, based on the task at hand.

For example, bootstrap can be used to small samples, for estimating medians, correlations, constructing confidence intervals, and in other situations.

Efron's original work looked at pairwise correlation estimates for a sample size of n = 15.

B = 1000 bootstrap samples are generated (bootstrap replication).

Based on the obtained coefficients ro 1 ... ro B, a general estimate of the correlation coefficient and an estimate of the standard deviation are constructed.

The standard error of the sample correlation coefficient, calculated using the normal approximation, is:

where the correlation coefficient is 0.776, the original sample size is n = 15.

The bootstrap estimate of the standard error is 0.127, see Efron, Gall Gong, 1982.

Theoretical background

Let be the target parameter of the study, for example, the average income in the selected society.

Using an arbitrary sample of size, we obtain a data set Let the corresponding sample statistics be

For most sample statistics at big value (>30), the sampling distribution is a normal curve with center and standard deviation, where the positive parameter depends on the population and type of statistics

This classic result is known as the central limit theorem.

There are often serious technical difficulties in estimating the required standard deviation from data.

For example, if median or sample correlation.

The bootstrap method overcomes these difficulties.

The idea is simple: let us denote by an arbitrary value that represents the same statistics calculated from the bootstrap sample, which is obtained from the original sample

What can be said about the sampling distribution if the “initial” sample is fixed?

In the limit, the sampling distribution is also bell-shaped with parameters and

Thus, the bootstrap distribution well approximates the sampling distribution

Note that when we move from one sample to another, only , in the expression, changes, since it was calculated using

This is essentially a bootstrap version of the central limit theorem.

It has also been found that if the marginal sampling distribution of a statistical function does not include population unknowns, the bootstrap distribution provides a better approximation of the sampling distribution than the central limit theorem.

In particular, when the statistical function has the form where denotes the true or sample estimate of the standard error, the limiting sampling distribution is usually standard normal.

This effect is called second-order correction using bootstrapping.

Let i.e. population average, etc. sample average; is the population standard deviation, is the sample standard deviation calculated from the original data, and is calculated from the bootstrap sample.

Then the sample distribution of the value where , will be approximated by the bootstrap distribution, where is the average of the bootstrap sample, .

Similarly, the sampling distribution will be approximated by the bootstrap distribution, where .

The first results on second order correction were published by Babu and Singh in 1981-83.

Bootstrap Applications

Approximation of the standard error of a sample estimate

Let us assume that the parameter is known for the population

Let be an estimate made on the basis of a random sample of size, i.e. is a function of Since the sample varies across the set of all possible samples, the following approach is used to estimate the standard error:

Let's calculate using the same formula that was used for but this time based on different bootstrap samples of each size. Roughly speaking, it can be accepted unless it is very large. In this case, you can reduce it to n ln n. Then it can be determined based, in fact, on the essence of the bootstrap method: the population (sample) is replaced by an empirical population (sample).

Bayesian correction using bootstrap method

The mean of the sampling distribution often depends on usually as for large That is, Bayesian approximation:

where is the bootstrap copies. Then the adjusted value will be -

It is worth noting that the previous resampling method, called the jackknife method, is more popular.

Confidence intervals

Confidence intervals (CIs) for a given parameter are sample-based ranges.

This range has the property that a value with a very high (predetermined) probability belongs to it. This is called the significance level. Of course, this probability must apply to any possible sample, because Each sample contributes to the determination of the confidence interval. The two most commonly used significance levels are 95% and 99%. Here we will limit ourselves to the value of 95%.

Traditionally, CIs depend on the sampling distribution of the quantity, more precisely in the limit. There are two main types of confidence intervals that can be constructed using bootstrap.

Percentile method

This method has already been mentioned in the introduction, it is very popular due to its simplicity and naturalness. Let's assume that we have 1000 bootstrap copies, let's denote them by Then the confidence interval will include values ​​from the range Returning to the theoretical justification of the method, it is worth noting that it requires symmetry of the sampling distribution around. The reason for this is that the method approximates the sampling distribution using the bootstrap distribution, although logically it turns out that it should be approximated by a value that is opposite in sign.

Centered bootstrap percentile method

Let us assume that the sampling distribution is approximated using the bootstrap distribution, that is, as was originally assumed in bootstrapping. Let us denote the 100th percentile (in bootstrap repetitions) by Then the assumption that the value lies in the range from to will be correct with a probability of 95%. The same expression can easily be converted into a similar one for the range from to. This interval is called a centered confidence interval based on bootstrap percentiles (at a significance level of 95%).

Bootstrap-t test

As already noted, bootstrap uses a function of the form where there is a sample estimate of the standard error

This gives additional accuracy.

As a basic example, let's take the standard t-statistic (hence the name of the method): that is, the special case when (population mean), (sample mean) and - sample standard deviation. The bootstrap analogue of such a function is where is calculated in the same way as using only the bootstrap sample.

Let us denote the 100th bootstrap percentile by and assume that the value lies in the interval

Using equality you can rewrite the previous statement, i.e. lies in the interval

This interval is called the bootstrap t-confidence interval for at the 95% level.

In the literature, it is used to achieve greater accuracy than the previous approach.

Example of real data

As a first example, take data from Hollander and Wolfe 1999, page 63, which presents the effect of light on chick hatching rates.

A standard boxplot assumes no normality in the population data. We performed a bootstrap analysis of the median and mean.

Separately, it is worth noting the lack of symmetry in the bootstrap t-histogram, which differs from the standard limit curve. The 95% confidence intervals for the median and mean (calculated using the bootstrap percentile method) roughly cover the range

This range represents the overall difference (increase) in chick hatching rate results as a function of lighting.

As a second example, consider data from Devore 2003, p. 553, which examined the correlation between biochemical oxygen demand (BOD) and hydrostatic weighing (HW) results of professional football players.

Two-dimensional data consists of pairs, and pairs can be randomly selected during bootstrap resampling. For example, first take then, etc.

In the figure, the box-whisker plot shows the lack of normality for the underlying populations. Correlation histograms calculated from bootstrap bivariate data are asymmetric (shifted to the left).

For this reason, the centered percentile bootstrap method is in this case more suitable.

The analysis revealed that the measurements were correlated for at least 78% of the population.

Data for example 1:

8.5 -4.6 -1.8 -0.8 1.9 3.9 4.7 7.1 7.5 8.5 14.8 16.7 17.6 19.7 20.6 21.9 23.8 24.7 24.7 25.0 40.7 46.9 48.3 52.8 54.0

Data for example 2:

2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9

8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2

The literature often proposes different bootstrapping schemes that could give reliable results in different statistical situations.

What was discussed above is only the most basic elements, and there are actually a lot of other scheme options. For example, which method is better to use in the case of two-stage sampling or stratified sampling?

It is not difficult to come up with a natural scheme in this case. Bootstrapping in the case of data with regression models generally attracts a lot of attention. There are two main methods: in the first, covariances and response variables are resampled together (pairwise bootstrapping), in the second, bootstrapping is performed on the residuals (residual bootstrapping).

The pairwise method remains correct (in terms of results at ) even if the error variances in the models are not equal. The second method is incorrect in this case. This disadvantage is compensated by the fact that such a scheme provides additional accuracy in estimating the standard error.

It is much more difficult to apply bootstrapping to time series data.

Time series analysis, however, is one of the key areas in econometrics. There are two main difficulties here: firstly, time series data tend to be sequentially dependent. That is, it depends on , etc.

Secondly, the statistical population changes over time, that is, non-stationarity appears.

For this purpose, methods have been developed that transfer the dependence in the source data to bootstrap samples, in particular, the block design.

Instead of a bootstrap sample, a sample is immediately constructed block data that retains the dependencies from the original sample.

Quite a lot of research is currently being carried out in the field of applying bootstrapping to the fields of econometrics; in general, the method is actively developing.

In addition to the actual random sample with its clear probabilistic justification, there are other samples that are not completely random, but are widely used. It should be noted that the strict application of purely random selection of units from the general population is not always possible in practice. Such samples include mechanical sampling, typical, serial (or nested), multiphase and a number of others.

It is rare for a population to be homogeneous; this is the exception rather than the rule. Therefore, if there is a population in the population various types It is often desirable to ensure a more even representation of different types of phenomena in a sample population. This goal is successfully achieved by using typical sampling. The main difficulty is that we must have additional information about the entire population, which in some cases is difficult.

A typical sample is also called a stratified or stratified sample; it is also used for the purpose of more uniform representation of different regions in the sample, and in this case the sample is called regionalized.

So, under typical A sample is understood as a sample in which the general population is divided into typical subgroups formed according to one or more essential characteristics (for example, the population is divided into 3-4 subgroups according to average per capita income or level of education - primary, secondary, higher, etc. ). Next, from all typical groups, you can select units for the sample in several ways, forming:

a) a typical sample with uniform placement, where samples are taken from different types (layers) equal number units. This scheme works well if in the population the layers (types) do not differ very much from each other in the number of units;

b) typical sampling with proportional placement, when it is required (as opposed to uniform placement) that the proportion (%) of selection for all strata be the same (for example, 5 or 10%);

c) a typical sample with optimal placement, when the degree of variation of characteristics in different groups of the general population is taken into account. With this placement, the proportion of selection for groups with large variability of the trait increases, which ultimately leads to a decrease in random error.

The formula for the average error in a typical selection is similar to the usual sampling error for a purely random sample, with the only difference being that instead of the total variance, the average of the particular within-group variances is entered, which naturally leads to a decrease in error compared to a purely random sample. However, its use is not always possible (for many reasons). If there is no need for great precision, it is easier and cheaper to use serial sampling.

Serial(cluster) sampling consists in the fact that not units of the population (for example, students), but individual series or nests (for example, study groups) are selected for the sample. In other words, with serial (cluster) sampling, the observation unit and the sampling unit do not coincide: certain groups of units (nests) adjacent to each other are selected, and the units included in these nests are subject to examination. So, for example, when conducting a sample survey of housing conditions, we can randomly select a certain number of households (sampling unit) and then find out the living conditions of the families living in these houses (observation units).

Series (nests) consist of units connected to each other territorially (districts, cities, etc.), organizationally (enterprises, workshops, etc.), or in time (for example, a set of units produced during this segment production time).

Serial selection can be organized in the form of single-stage, two-stage or multi-stage selection.

Randomly selected series are subjected to continuous research. Thus, serial sampling consists of two stages of random selection of series and continuous study of these series. Serial selection provides significant savings in manpower and resources and is therefore often used in practice. The error of serial selection differs from the error of random selection itself in that instead of the value of the total variance, interseries (intergroup) variance is used, and instead of the sample size, the number of series is used. The accuracy is usually not very high, but in some cases it is acceptable. A serial sample can be repeated or non-repetitive, and series can be equal-sized or unequal-sized.

Serial sampling can be organized according to different schemes. For example, you can form a sample population in two stages: first, the series to be surveyed are selected in random order, then from each selected series a certain number of units are also selected in random order to be directly observed (measured, weighed, etc.). The error of such a sample will depend on the error of serial selection and on the error of individual selection, i.e. Multi-stage selection, as a rule, gives less accurate results compared to single-stage selection, which is explained by the occurrence of representativeness errors at each sampling stage. In this case, you need to use the sampling error formula for combined sampling.

Another form of selection is multiphase selection (1, 2, 3 phases or stages). This selection differs in structure from multi-stage selection, since with multi-phase selection the same selection units are used in each phase. Errors in multiphase sampling are calculated at each phase separately. The main feature of a two-phase sample is that the samples differ from each other according to three criteria depending on: 1) the proportion of units studied in the first phase of the sample and again included in the second and subsequent phases; 2) from maintaining equal chances for each sample unit of the first phase to again be the object of study; 3) on the size of the interval separating the phases from each other.

Let us dwell on one more type of selection, namely mechanical(or systematic). This selection is probably the most common. This is apparently explained by the fact that of all the selection techniques, this technique is the simplest. In particular, it is much simpler than random selection, which requires the ability to use tables of random numbers, and does not require additional information about the population and its structure. In addition, mechanical selection is closely intertwined with proportional stratified selection, which leads to a reduction in sampling error.

For example, the use of mechanical selection of members of a housing cooperative from a list compiled in the order of admission to this cooperative will ensure proportional representation of cooperative members with different lengths of experience. Using the same technique to select respondents from an alphabetical list of individuals ensures equal chances for names starting with different letters, and so on. Use of time sheets or other lists in enterprises or educational institutions etc. can ensure the necessary proportionality in the representation of workers with different lengths of experience. Note that mechanical selection is widely used in sociology, in the study of public opinion, etc.

In order to reduce the magnitude of the error and especially the costs of conducting a sampling study, various combinations of individual types of selection (mechanical, serial, individual, multiphase, etc.) are widely used. In such cases, more complex sampling errors should be calculated, which consist of errors that occur at different stages of the study.

A small sample is a collection of units less than 30. Small samples occur quite often in practice. For example, the number of rare diseases or the number of units possessing a rare trait; In addition, a small sample is resorted to when the research is expensive or the research involves the destruction of products or samples. Small samples are widely used in the field of product quality surveys. Theoretical basis to determine errors in a small sample were laid down by the English scientist W. Gosset (pseudonym Student).

It must be remembered that when determining the error for a small sample, instead of the sample size, you should take the value ( n– 1) or before determining the average sampling error, calculate the so-called corrected sample variance (in the denominator instead of n should be put ( n- 1)). Note that such a correction is made only once - when calculating the sample variance or when determining the error. Magnitude ( n– 1) is called the degree of freedom. In addition, the normal distribution is replaced t-distribution (Student distribution), which is tabulated and depends on the number of degrees of freedom. The only parameter of the Student distribution is the value ( n- 1). Let us emphasize once again that the amendment ( n– 1) is important and significant only for small sample populations; at n> 30 and above the difference disappears, approaching zero.

So far we have been talking about random samples, i.e. such when the selection of units from the population is random (or almost random) and all units have an equal (or almost equal) probability of being included in the sample. However, the selection of units can be based on the principle of non-random selection, when the principle of accessibility and purposefulness is at the forefront. In such cases, it is impossible to talk about the representativeness of the resulting sample, and the calculation of errors of representativeness can only be done with information about the general population.

There are several known schemes for forming a non-random sample, which have become widespread and are used mainly in sociological research: selection of available observation units, selection using the Nuremberg method, targeted sampling when identifying experts, etc. Important It also has a quota sample, which is formed by the researcher according to a small number of significant parameters and gives a very close match with the general population. In other words, quota selection should provide the researcher with almost complete coincidence of the sample and general populations according to his chosen parameters. Purposeful achievement of the proximity of two populations in a limited range of indicators is achieved, as a rule, using a sample of a significantly smaller size than when using random selection. It is this circumstance that makes quota selection attractive for a researcher who does not have the opportunity to focus on a self-weighting random sample of a large size. It should be added that a reduction in sample size is most often combined with a reduction in monetary costs and research time, which increases the advantages of this selection method. Let us also note that with quota sampling there is quite significant preliminary information about the structure of the population. The main advantage here is that the sample size is significantly smaller than with random sampling. The selected characteristics (most often socio-demographic - gender, age, education) should closely correlate with the studied characteristics of the general population, i.e. object of research.

As already indicated, the sampling method makes it possible to obtain information about the general population with much less money, time and effort than with continuous observation. It is also clear that a complete study of the entire population is impossible in some cases, for example, when checking the quality of products, samples of which are destroyed.

At the same time, however, it should be pointed out that the population is not a completely “black box” and we still have some information about it. Conducting, for example, a sample study concerning the life, everyday life, property status, income and expenses of students, their opinions, interests, etc., we still have information about their total number, grouping by gender, age, marital status, place of residence , course of study and other characteristics. This information is always used in sample research.

There are several types of distribution of sample characteristics to the general population: the method of direct recalculation and the method of correction factors. Recalculation of sample characteristics is carried out, as a rule, taking into account confidence intervals and can be expressed in absolute and relative values.

It is quite appropriate to emphasize here that most of the statistical information relating to the economic life of society in its most diverse manifestations and types is based on sample data. Of course, they are supplemented by complete registration data and information obtained as a result of censuses (of population, enterprises, etc.). For example, all budget statistics (on income and expenses of the population) provided by Rosstat are based on data from a sample study. Information on prices, production volumes, and trade volumes, expressed in the corresponding indices, is also largely based on sample data.

Statistical hypotheses and statistical tests. Basic Concepts

The concepts of statistical test and statistical hypothesis are closely related to sampling. A statistical hypothesis (as opposed to other scientific hypotheses) is an assumption about some properties of the population that can be tested using data from a random sample. It should be remembered that the result obtained is probabilistic in nature. Consequently, the result of the study, confirming the validity of the put forward hypothesis, can almost never serve as a basis for its final acceptance, and conversely, a result that is inconsistent with it is quite sufficient to reject the put forward hypothesis as erroneous or false. This is so because the result obtained can be consistent with other hypotheses, and not just with the one put forward.

Under statistical criterion is understood as a set of rules that allow us to answer the question under which observation results the hypothesis is rejected and under which it is not. In other words, a statistical criterion is a kind of decisive rule that ensures the acceptance of a true (correct) hypothesis and the rejection of a false hypothesis with a high degree of probability. Statistical tests are one-sided and two-sided, parametric and non-parametric, more or less powerful. Some criteria are used frequently, others are used less frequently. Some criteria are intended to solve special issues, and some criteria can be used to solve a wide class of problems. These criteria have become widespread in sociology, economics, psychology, natural sciences etc.

Let us introduce some basic concepts of statistical hypothesis testing. Hypothesis testing begins with a null hypothesis. N 0, i.e. some assumption of the researcher, as well as a competing, alternative hypothesis N 1, which contradicts the main one. For example: N 0: , N 1: or N 0: , N 1: (where A- general average).

The main goal of the researcher when testing a hypothesis is to reject the hypothesis he puts forward. As R. Fisher wrote, the purpose of testing any hypothesis is to reject it. Hypothesis testing is based on contradiction. Therefore, if we believe that, for example, the average wage of workers obtained from a particular sample and equal to 186 monetary units per month does not coincide with the actual wages for the entire population, then the null hypothesis is accepted that these wages are equal.

Competing hypothesis N 1 can be formulated in different ways:

N 1: , N 1: , N 1: .

Next, it is determined Type I error(a), which states the probability that a true hypothesis will be rejected. Obviously, this probability should be small (usually from 0.01 to 0.1, most often the default is 0.05, or the so-called 5% significance level). These levels arise from the sampling method, according to which a twofold or threefold error represents the limits beyond which random variation in sample characteristics most often does not extend. Type II error(b) is the probability that an incorrect hypothesis will be accepted. As a rule, a type I error is more “dangerous”; it is precisely this that is recorded by the statistician. If at the beginning of the study we want to record a and b simultaneously (for example, a = 0.05; b = 0.1), then for this we must first calculate the sample size.

Critical zone(or area) is a set of criterion values ​​at which N 0 is rejected. Critical point T kr is the point separating the area of ​​acceptance of the hypothesis from the area of ​​deviation, or critical zone.

As already mentioned, a Type I error (a) is the probability of rejecting a correct hypothesis. The smaller a, the less likely it is to make a Type I error. But at the same time, when a decreases (for example, from 0.05 to 0.01), it is more difficult to reject the null hypothesis, which, in fact, is what the researcher sets for himself. Let us emphasize again that further reduction of a to 0.05 and beyond will actually result in all hypotheses, true and false, falling within the range of acceptance of the null hypothesis, and will make it impossible to distinguish between them.

Type II error (b) occurs when it is accepted N 0, but in fact the alternative hypothesis is true N 1 . The value g = 1 – b is called the power of the criterion. Type II error (i.e., incorrectly accepting a false hypothesis) decreases with increasing sample size and increasing significance level. It follows from this that it is impossible to simultaneously reduce a and b. This can only be achieved by increasing the sample size (which is not always possible).

Most often, hypothesis testing tasks come down to comparing two sample means or proportions; to compare the general average (or share) with the sample one; comparison of empirical and theoretical distributions (goodness-of-fit criteria); comparison of two sample variances (c 2 -criterion); comparing two sample correlation coefficients or regression coefficients and some other comparisons.

The decision to accept or reject the null hypothesis consists of comparing the actual value of the criterion with the tabulated (theoretical) value. If the actual value is less than the tabulated value, then it is concluded that the discrepancy is random and insignificant and the null hypothesis cannot be rejected. The opposite situation (the actual value is greater than the tabulated value) leads to the rejection of the null hypothesis.

When testing statistical hypotheses, the tables of normal distribution, distribution c 2 (read: chi-square), t-distributions (Student distributions) and F-distributions (Fisher distributions).

mob_info