An example of a solution is the empirical distribution function. Empirical distribution function, properties. Variation series. Polygon and histogram

Let X 1 , X 2 , ..., X n-- sampling volume P from a population with a distribution function F(x). If you arrange the sample data in non-decreasing order, the resulting series is called variation series: X (1) , X (2) , ..., X (n)

Example 1. If the sample of volume 4 is as follows: 4, -2, 3, 1, then the variation series looks like this: -2, 1, 3, 4.

Definition 1. The empirical distribution function F is called(x) discrete random variable, whose distribution table looks like this:

As shown in 2.2.1, the distribution function of a discrete random variable

has the following form:

In other words F n (x) = v/n, Where v--number of those sample values X i , which are smaller X.

As can be seen from the graph, the function F n (x) is stepped and has discontinuities at points X (i) and the magnitude of the jump is 1 /n, if the values ​​coincide with each other X i , No. If k values X (i) coincide, then the magnitude of the jump at this point is equal to k/n.

The limiting behavior is of interest F n (x) at P.

Theorem 1. Let X 1 , X 2 , ..., X n --sample size n from the population by distribution function F(x). Then when n co for any x 1 fair

F n (x) P F(x),

or, in other words, for any > 0,

Proof. Let


discrete random variables such that P( i == 0) = q and P( i = 1) = p, i = 1. 2..... P. It's easy to see that


Then by law large numbers(see 2.7.2) for the empirical distribution function F n (x) = 1/n n i=1 i at n we get

F n (x) P F(x),

Before formulating another theorem, we present following definition.

Definition 2. Sequence of random variables 1 , 2 , …, n , … converges to with probability 1 (one) (or almost certainly), if the following equality holds

Now let us formulate (without proof, it can be found in) the following theorem.

Theorem 2 (Glivenko - Cantelli). Under the conditions of the previous theorem, it is true

These results show that at large P the empirical distribution function gives a good approximation to the theoretical distribution function F(x).

Volume samples P from a population with a continuous distribution F(x) in practice they are often subject to grouping. In this case, it is not sample values ​​that are indicated, but the number of sample values ​​that fall within the intervals of some specific partition of the general population (partition of the set of possible values ​​of a random variable that has a distribution function F(x) ). As a rule, intervals are taken of the same length, say h. If we denote by n i number of sample values ​​included in i- interval, then this interval is taken as the base of the height rectangle n i /nh. The resulting figure is called sample histogram. The area of ​​each histogram rectangle is equal to the frequency n i /n the corresponding group. At large P this area will be approximately equal to the probability of falling into the corresponding interval, i.e. will be approximately equal to the integral of the distribution density p( t), calculated over this interval. Thus, the upper part of the histogram contour gives a good approximation for the distribution density.

Example 2. The sensitivity of the 1st channel was tested n = 40 TVs. The test data is shown in the following table, where the first line gives the sensitivity intervals in microvolts, the second - the number of televisions whose sensitivity was found in this interval:

Here the length of the interval h = 50. Let's build a histogram.

ED processing methods are based on the basic concepts of probability theory and mathematical statistics. These include the concepts of general population, sample, empirical distribution function.

Under general population understand all possible parameter values ​​that can be recorded during an unlimited time observation of an object. Such a set consists of an infinite number of elements. As a result of observing an object, a limited-in-volume set of parameter values ​​is formed x 1 , x 2 , …, xn. From a formal point of view, such data represent sample from the general population.

We will assume that the sample contains complete developments before system events (there is no censoring). Observed values x i called options , and their number is sample size n. In order for any conclusions to be drawn from the observation results, the sample must be representative(representative), i.e. correctly represent the proportions of the general population. This requirement is met if the sample size is large enough and each element in the population has the same probability of being included in the sample.

Let the resulting sample have a value x 1 parameter observed n 1 time, value x 2 – n 2 times, meaning xk nk once, n 1 +n 2 + … +nk=n.

A set of values ​​written in ascending order is called variation series, quantities n i – frequencies, and their relationship to the sample size ni=n i /n relative frequencies(frequencies). Obviously, the sum of the relative frequencies is equal to unity.

Distribution refers to the correspondence between observed variants and their frequencies or frequencies. Let nx – the number of observations in which the random values ​​of the parameter X less x. Event Frequency X equal to nx/n. This ratio is a function of x and on sample size: F n(x)=nx/n. Magnitude Fn(x) has all the properties of a function:

distributions: Fn(x) non-decreasing function, its values ​​belong to the segment ;

If x 1 is the smallest value of the parameter, and xk – the greatest, then Fn(x)= 0, When x<x 1 , And FP(xk)= 1 when x>=xk.

Function Fn(x) is determined by ED, which is why it is called empirical distribution function. Unlike the empirical function Fn(x) distribution function F (x) of the population is called the theoretical distribution function, it characterizes not the frequency, but the probability of an event X<x. From Bernoulli's theorem it follows that the frequency Fn(x) tends in probability to probability F(x) with unlimited magnification n. Consequently, with a large volume of observations, the theoretical distribution function F(x) can be replaced by the empirical function Fn(x).

Graph of empirical function Fn(x) is a broken line. In the spaces between adjacent members of the variation series Fn(x) remains constant. When passing through axis points x, equal to the sample members, Fn(x) undergoes a discontinuity, increasing abruptly by the value 1/ n, and if there is a coincidence l observations - on l/n.

Example 2.1. Construct a variation series and graph of the empirical distribution function based on the observation results, table. 2.1.

Table 2.1

The desired empirical function, Fig. 2.1:

Rice. 2.1. Empirical function distribution

With a large sample size (the concept of “large volume” depends on the goals and processing methods, in this case we will consider P big if n>40) for the convenience of processing and storing information resort to grouping EDs into intervals. The number of intervals should be chosen so that the variety of parameter values ​​in the aggregate is reflected to the required extent and at the same time the distribution pattern is not distorted by random frequency fluctuations in individual categories. There are loose guidelines for choosing quantity y And size h such intervals, in particular:

each interval must contain at least 5–7 elements. In extreme ranks, only two elements are allowed;

the number of intervals should not be very large or very small. Minimum the y value must be at least 6 – 7. With a sample size not exceeding several hundred elements, the value y is set in the range from 10 to 20. For a very large sample size ( n>1000) the number of intervals may exceed the specified values. Some researchers recommend using the ratio y=1.441*ln( n)+1;

with relatively small unevenness in the length of the intervals, it is convenient to choose the same and equal to the value

h= (x max – x min)/y,

Where x max – maximum and x min – minimum value of the parameter. If the distribution law is significantly uneven, the length of the intervals can be set to a smaller size in the region of rapid changes in the distribution density;

If there is significant unevenness, it is better to assign approximately the same number of sample elements to each category. Then the length of a particular interval will be determined by the extreme values ​​of the sample elements grouped into this interval, i.e. will be different for different intervals (in this case, when constructing a histogram, normalization by the length of the interval is required - otherwise the height of each element of the histogram will be the same).

Grouping observation results by intervals provides for: determining the range of changes in a parameter X; choosing the number of intervals and their size; counting for everyone i- th interval [ xixi+1 ] frequencies ni or relative frequency (frequency n i) options fall into the interval. As a result, a representation of the ED is formed in the form interval or statistical series.

Graphically, a statistical series is displayed in the form of a histogram, polygon and stepped line. Often histogram represented as a figure consisting of rectangles, the bases of which are intervals of length h, and the heights are equal to the corresponding frequency. However, this approach is inaccurate. Height i- th rectangle z i should be chosen equal ni/ (nh). Such a histogram can be interpreted as a graphical representation of the empirical distribution function fn(x), in it the total area of ​​all rectangles will be one. The histogram helps to select the type of theoretical distribution function for approximating the ED.

Polygon called a broken line, the segments of which connect points with coordinates along the abscissa axis equal to the midpoints of the intervals, and along the ordinate axis equal to the corresponding frequencies. The empirical distribution function is displayed as a stepped broken line: a horizontal line segment is drawn over each interval at a height proportional to the accumulated frequency in the current interval. The accumulated frequency is equal to the sum of all frequencies, starting from the first and up to this interval inclusive.

Example 2.2. There are results of recording signal attenuation values xi at a frequency of 1000 Hz of the switched channel of the telephone network. These values, measured in dB, are presented in the form of a variation series in table. 2.3. It is necessary to construct a statistical series.

Table 2.3

i
xi 25,79 25,98 25,98 26,12 26,13 26,49 26,52 26,60 26,66 26,69 26,74
i
xi 26,85 26,90 26,91 26,96 27,02 27,11 27,19 27,21 27,28 27,30 27,38
i
xi 27,40 27,49 27,64 27,66 27,71 27,78 27,89 27,89 28,01 28,10 28,11
i
xi 28,37 28,38 28,50 28,63 28,67 28,90 28,99 28,99 29,03 29,12 29,28

Solution. The number of digits of the statistical series should be chosen as minimal as possible to ensure a sufficient number of hits in each of them; let’s take y = 6. Let’s determine the size of the digit

h =(x max – x min)/y =(29.28 – 25.79)/6 = 0.58.

Let's group observations by category, table. 2.4.

Table 2.4

i
xi 25,79 26,37 26,95 27,5 3 28,12 28,70
ni
n i=ni/n 0,114 0,205 0,227 0,205 0,11 4 0,136
z i =n i /h 0,196 0,353 0,392 0,353 0,196 0,235

Based on the statistical series, we will construct a histogram, Fig. 2.2, and the graph of the empirical distribution function, Fig. 2.3.

Graph of the empirical distribution function, Fig. 2.3 differs from the graph presented in Fig. 2.1 equality of the change step of the options and the size of the increment step of the function (when constructed using a variation series, the increment step is a multiple

1/ n, and according to the statistical series - depends on the frequency in a particular category).

The considered ED representations are the initial ones for subsequent processing and calculation of various parameters.

As is known, the distribution law of a random variable can be specified in various ways. A discrete random variable can be specified using a distribution series or an integral function, and a continuous random variable can be specified using either an integral or a differential function. Let's consider selective analogues of these two functions.

Let there be a sample set of values ​​of some random volume variable and each option from this set is associated with its frequency. Let further – some real number, A – number of sample values ​​of the random variable
, smaller .Then the number is the frequency of the quantity values ​​observed in the sample X, smaller , those. frequency of occurrence of the event
. When it changes x in the general case, the value will also change . This means that the relative frequency is a function of the argument . And since this function is found from sample data obtained as a result of experiments, it is called selective or empirical.

Definition 10.15. Empirical distribution function(sampling distribution function) is the function
, defining for each value x relative frequency of the event
.

(10.19)

In contrast to the empirical sampling distribution function, the distribution function F(x) of the general population is called theoretical distribution function. The difference between them is that the theoretical function F(x) determines the probability of an event
, and the empirical one is the relative frequency of the same event. From Bernoulli's theorem it follows

,
(10.20)

those. at large probability
and relative frequency of the event
, i.e.
differ little from one another. From this it follows that it is advisable to use the empirical distribution function of the sample to approximate the theoretical (integral) distribution function of the general population.

Function
And
have the same properties. This follows from the definition of the function.

Properties
:


Example 10.4. Construct an empirical function based on the given sample distribution:

Options

Frequencies

Solution: Let's find the sample size n= 12+18+30=60. Smallest option
, hence,
at
. Meaning
, namely
observed 12 times, therefore:

=
at
.

Meaning x< 10, namely
And
were observed 12+18=30 times, therefore,
=
at
. At

.

The required empirical distribution function:

=

Schedule
shown in Fig. 10.2

R
is. 10.2

Control questions

1. What main problems does mathematical statistics solve? 2. General and sample population? 3. Define sample size. 4. What samples are called representative? 5. Errors of representativeness. 6. Basic methods of sampling. 7. Concepts of frequency, relative frequency. 8. The concept of statistical series. 9. Write down the Sturges formula. 10. Formulate the concepts of sample range, median and mode. 11. Frequency polygon, histogram. 12. The concept of a point estimate of a sample population. 13. Biased and unbiased point estimate. 14. Formulate the concept of a sample average. 15. Formulate the concept of sample variance. 16. Formulate the concept of sample standard deviation. 17. Formulate the concept of sample coefficient of variation. 18. Formulate the concept of sample geometric mean.

Sample average.

Let a sample of size n be extracted to study the general population regarding a quantitative characteristic X.

The sample mean is the arithmetic mean of a characteristic in a sample population.

Sample variance.

In order to observe the dispersion of a quantitative characteristic of sample values ​​around its average value, a summary characteristic is introduced - sample variance.

Sample variance is the arithmetic mean of the squares of the deviation of the observed values ​​of a characteristic from their mean value.

If all values ​​of the sample characteristic are different, then

Corrected variance.

The sample variance is a biased estimate of the population variance, i.e. the mathematical expectation of the sample variance is not equal to the estimated general variance, but is equal to

To correct the sample variance, simply multiply it by the fraction

Sample correlation coefficient is found by the formula

where are sample standard deviations of values ​​and .

The sample correlation coefficient shows the closeness of the linear relationship between and : the closer to unity, the stronger linear connection between and .

23. A frequency polygon is a broken line whose segments connect points. To construct a frequency polygon, the variants are plotted on the abscissa axis, and the corresponding frequencies are plotted on the ordinate axis, and the points are connected by line segments.

The relative frequency polygon is constructed in a similar way, except that relative frequencies are plotted on the ordinate axis.

A frequency histogram is a stepped figure consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio. To construct a frequency histogram, partial intervals are laid out on the abscissa axis, and segments parallel to the abscissa axis at a distance (height) are drawn above them. The area of ​​the i-th rectangle is equal to the sum of the frequencies of the i-o interval, therefore the area of ​​the frequency histogram is equal to the sum of all frequencies, i.e. sample size.

Empirical distribution function

Where n x- number of sample values ​​less than x; n- sample size.

22Let us define the basic concepts of mathematical statistics

.Basic concepts of mathematical statistics. Population and sample. Variation series, statistical series. Grouped sample. Grouped statistical series. Frequency polygon. Sample distribution function and histogram.

Population– the entire set of available objects.

Sample– a set of objects randomly selected from the general population.

A sequence of options written in ascending order is called variational nearby, and a list of options and their corresponding frequencies or relative frequencies - statistical series: randomly selected from the general population.

Polygon frequencies is called a broken line, the segments of which connect the points.

Frequency histogram is a stepped figure consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio .

Sample (empirical) distribution function call the function F*(x), defining for each value X relative frequency of the event X< x.

If some continuous feature is being studied, then the variation series can consist of a very large number of numbers. In this case it is more convenient to use grouped sample. To obtain it, the interval containing all observed values ​​of the attribute is divided into several equal partial intervals of length h, and then find for each partial interval n i– the sum of frequencies of the variant included in i th interval.

20. The law of large numbers should not be understood as any one general law associated with large numbers. The law of large numbers is a generalized name for several theorems, from which it follows that with an unlimited increase in the number of trials, average values ​​tend to certain constants.

These include the theorems of Chebyshev and Bernoulli. Chebyshev's theorem is the most general law of large numbers.

The proof of the theorems, united by the term “law of large numbers,” is based on Chebyshev’s inequality, which establishes the probability of deviation from its mathematical expectation:

19Pearson distribution (chi - square) - distribution of a random variable

where are the random variables X 1, X 2,…, X n independent and have the same distribution N(0,1). In this case, the number of terms, i.e. n, is called the “number of degrees of freedom” of the chi-square distribution.

The chi-square distribution is used when estimating variance (using a confidence interval), when testing hypotheses of agreement, homogeneity, independence,

Distribution t Student's t is the distribution of a random variable

where are the random variables U And X independent, U has a standard normal distribution N(0.1), and X– chi distribution – square c n degrees of freedom. Wherein n is called the “number of degrees of freedom” of the Student distribution.

It is used when estimating the mathematical expectation, forecast value and other characteristics using confidence intervals, testing hypotheses about the values ​​of mathematical expectations, regression coefficients,

The Fisher distribution is the distribution of a random variable

The Fisher distribution is used when testing hypotheses about the adequacy of the model in regression analysis, on equality of variances and in other problems of applied statistics

18Linear regression is a statistical tool used to predict future prices based on past data, and is usually used to determine when prices are overheated. Method used least square to construct a “best fit” straight line through a series of price value points. The price points used as input can be any of the following: open, close, high, low,

17. A two-dimensional random variable is an ordered set of two random variables or .

Example: Two dice are tossed. – the number of points rolled on the first and second dice, respectively

Universal method specifying the distribution law of a two-dimensional random variable is the distribution function.

15.m.o Discrete random variables

Properties:

1) M(C) = C, C- constant;

2) M(CX) = C.M.(X);

3) M(X 1 + X 2) = M(X 1) + M(X 2), Where X 1, X 2- independent random variables;

4) M(X 1 X 2) = M(X 1)M(X 2).

Expected value the sum of random variables is equal to the sum of their mathematical expectations, i.e.

The mathematical expectation of the difference between random variables is equal to the difference of their mathematical expectations, i.e.

The mathematical expectation of a product of random variables is equal to the product of their mathematical expectations, i.e.

If all values ​​of a random variable are increased (decreased) by the same number C, then its mathematical expectation will increase (decrease) by the same number

14. Exponential(exponential)distribution law X has an exponential distribution law with parameter λ >0 if its probability density has the form:

Expected value: .

Dispersion: .

The exponential distribution law plays a big role in the theory queuing and reliability theories.

13. The normal distribution law is characterized by failure frequency a (t) or failure probability density f (t) of the form:

, (5.36)

where σ is the standard deviation of the SV x;

m x– mathematical expectation of SV x. This parameter is often called the center of dispersion or the most probable value of SV X.

x– a random variable, which can be taken as time, current value, electric voltage value and other arguments.

The normal law is a two-parameter law, to write which you need to know m x and σ.

The normal distribution (Gaussian distribution) is used to assess the reliability of products that are affected by a number of random factors, each of which has a slight effect on the resulting effect

12. Uniform distribution law. Continuous random variable X has a uniform distribution law on the segment [ a, b], if its probability density is constant on this segment and equal to zero outside it, i.e.

Designation: .

Expected value: .

Dispersion: .

Random value X, distributed according to a uniform law on the segment is called random number from 0 to 1. It serves as the starting material for obtaining random variables with any distribution law. The uniform distribution law is used in the analysis of rounding errors when carrying out numerical calculations, in a series of queuing problems, in statistical modeling of observations subject to a given distribution.

11. Definition. Distribution density of the probabilities of a continuous random variable X is called the function f(x)– the first derivative of the distribution function F(x).

Distribution density is also called differential function. To describe a discrete random variable, the distribution density is unacceptable.

The meaning of the distribution density is that it shows how often a random variable X appears in a certain neighborhood of a point X when repeating experiments.

After introducing the distribution functions and distribution density, the following definition of a continuous random variable can be given.

10. Probability density, probability distribution density of a random variable x, is a function p(x) such that

and for any a< b вероятность события a < x < b равна
.

If p(x) is continuous, then for sufficiently small ∆x the probability of inequality x< X < x+∆x приближенно равна p(x) ∆x (с точностью до малых более высокого порядка). Функция распределения F(x) случайной величины x, связана с плотностью распределения соотношениями

and if F(x) is differentiable, then

Determination of the empirical distribution function

Let $X$ be a random variable. $F(x)$ is the distribution function of a given random variable. We will carry out $n$ experiments on a given random variable under the same conditions, independent from each other. In this case, we obtain a sequence of values ​​$x_1,\ x_2\ $, ... ,$\ x_n$, which is called a sample.

Definition 1

Each value $x_i$ ($i=1,2\ $, ... ,$ \ n$) is called a variant.

One estimate of the theoretical distribution function is the empirical distribution function.

Definition 3

An empirical distribution function $F_n(x)$ is a function that determines for each value $x$ the relative frequency of the event $X \

where $n_x$ is the number of options less than $x$, $n$ is the sample size.

The difference between the empirical function and the theoretical one is that the theoretical function determines the probability of the event $X

Properties of the empirical distribution function

Let us now consider several basic properties of the distribution function.

    The range of the function $F_n\left(x\right)$ is the segment $$.

    $F_n\left(x\right)$ is a non-decreasing function.

    $F_n\left(x\right)$ is a left continuous function.

    $F_n\left(x\right)$ is a piecewise constant function and increases only at points of values ​​of the random variable $X$

    Let $X_1$ be the smallest and $X_n$ the largest variant. Then $F_n\left(x\right)=0$ for $(x\le X)_1$ and $F_n\left(x\right)=1$ for $x\ge X_n$.

Let us introduce a theorem that connects the theoretical and empirical functions.

Theorem 1

Let $F_n\left(x\right)$ be the empirical distribution function, and $F\left(x\right)$ be the theoretical distribution function of the general sample. Then the equality holds:

\[(\mathop(lim)_(n\to \infty ) (|F)_n\left(x\right)-F\left(x\right)|=0\ )\]

Examples of problems on finding the empirical distribution function

Example 1

Let the sampling distribution have the following data recorded using a table:

Picture 1.

Find the sample size, create an empirical distribution function and plot it.

Sample size: $n=5+10+15+20=50$.

By property 5, we have that for $x\le 1$ $F_n\left(x\right)=0$, and for $x>4$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 2.

Figure 3.

Example 2

20 cities were randomly selected from the cities of the central part of Russia, for which the following data on public transport fares were obtained: 14, 15, 12, 12, 13, 15, 15, 13, 15, 12, 15, 14, 15, 13 , 13, 12, 12, 15, 14, 14.

Create an empirical distribution function for this sample and plot it.

Let's write down the sample values ​​in ascending order and calculate the frequency of each value. We get the following table:

Figure 4.

Sample size: $n=20$.

By property 5, we have that for $x\le 12$ $F_n\left(x\right)=0$, and for $x>15$ $F_n\left(x\right)=1$.

$x value

$x value

$x value

Thus we get:

Figure 5.

Let's plot the empirical distribution:

Figure 6.

Originality: $92.12\%$.

mob_info