Misplaced Pages

Empirical process: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 18:22, 24 January 2006 editPierremenard (talk | contribs)1,093 editsm spacing← Previous edit Revision as of 21:22, 24 January 2006 edit undoDmharvey (talk | contribs)1,825 editsm TeX usageNext edit →
(One intermediate revision by the same user not shown)
Line 5: Line 5:
Suppose <math>X</math> is a ] of observations. <math>X</math> can be quite general; for example: the ], some ], a ], a ], or whatever might be of interest. Let <math>X_1, X_2, ..., X_n</math> be ] identically distributed (iid) ]s (rv's), with ] <math>P</math> on <math>X</math>. For a ] <math>A</math>, define Suppose <math>X</math> is a ] of observations. <math>X</math> can be quite general; for example: the ], some ], a ], a ], or whatever might be of interest. Let <math>X_1, X_2, ..., X_n</math> be ] identically distributed (iid) ]s (rv's), with ] <math>P</math> on <math>X</math>. For a ] <math>A</math>, define


:<math>P_n(A) = {1 \over n} card(X_j \in A, j = 1, 2,...n).</math> :<math>P_n(A) = {1 \over n} \operatorname{card}(X_j \in A, j = 1, 2,...n).</math>


If <math>C</math> is a collection of subsets of <math>X</math>, then the collection If <math>C</math> is a collection of subsets of <math>X</math>, then the collection


:{<math>P_n(c): c \in C</math>} :<math>\{P_n(c): c \in C\}</math>


is the ''empirical measure indexed by'' <math>C</math>. The ''empirical process'' <math>B_n</math> is defined as is the ''empirical measure indexed by'' <math>C</math>. The ''empirical process'' <math>B_n</math> is defined as
:<math>B_n</math> = <math>\sqrt n(P_n-P).</math> :<math>B_n = \sqrt n(P_n-P).</math>
and and
:{<math>B_n(c): c \in C</math>} :<math>\{B_n(c): c \in C\}</math>
is the ''empirical process indexed by'' <math>C.</math> is the ''empirical process indexed by'' <math>C.</math>


A special case is the empirical process <math>G_n</math> associated with ''empirical distribution functions'' <math>F_n</math>. A special case is the empirical process <math>G_n</math> associated with ''empirical distribution functions'' <math>F_n</math>.


:<math>G_n(x)</math> = <math>\sqrt n(F_n(x)-F(x)),</math> :<math>G_n(x) = \sqrt n(F_n(x)-F(x)),</math>
where <math>X_1, X_2, ..., X_n</math> are real-valued random variables with distribution function <math>F</math> and <math>F_n</math> defined by where <math>X_1, X_2, ..., X_n</math> are real-valued random variables with distribution function <math>F</math> and <math>F_n</math> defined by
:<math>F_n(x) = {1 \over n} card(X_j \leq x, j = 1, 2,...n).</math> In this case, :<math>F_n(x) = {1 \over n} \operatorname{card}(X_j \leq x, j = 1, 2,...n).</math> In this case,
:<math>C</math> = {<math>(-\infty, x): x \in R</math>}. :<math>C = \{(-\infty, x): x \in R\}</math>.
Major results for this special case include ] statistics, the Glivenko-Cantelli theorem and Donsker's theorem. Moreover, the empirical distribution function <math>F_n</math> of a finite sequence of realizations of a random variable is the very essence of ]. Major results for this special case include ] statistics, the Glivenko-Cantelli theorem and Donsker's theorem. Moreover, the empirical distribution function <math>F_n</math> of a finite sequence of realizations of a random variable is the very essence of ].


Line 53: Line 53:
If the observations <math>X_1, X_2, ..., X_n</math> are in a more general sample space <math>X</math>, we seek generalizations of the Glivenko-Cantelli theorem and Donsker's theorem. Also, we seek other theorems to determine rates of convergence and accuracy of estimation. If the observations <math>X_1, X_2, ..., X_n</math> are in a more general sample space <math>X</math>, we seek generalizations of the Glivenko-Cantelli theorem and Donsker's theorem. Also, we seek other theorems to determine rates of convergence and accuracy of estimation.


The classical ''empirical distribution function'' for real-valued random variables is a special case of the general theory with <math>X</math> = <math>R</math> and the class of sets <math>C</math> = {<math>(\infty, x]: x \in R</math>}. The classical ''empirical distribution function'' for real-valued random variables is a special case of the general theory with <math>X</math> = <math>R</math> and the class of sets <math>C = \{(\infty, x]: x \in R\}</math>.


== References== == References==

Revision as of 21:22, 24 January 2006

The study of empirical processes is a branch of mathematics and a sub-area of probability theory.

The motivation for studying empirical processes is that it is often impossible to know the true underlying probability measure P {\displaystyle P} . We collect observations X 1 , X 2 , . . . , X n {\displaystyle X_{1},X_{2},...,X_{n}} and compute relative frequencies. We can estimate P {\displaystyle P} , or a related distribution function F {\displaystyle F} by means of the empirical measure or empirical distribution function, respectively. Theorems in the area of empirical processes confirm that these are uniformly good estimates or determine accuracy of the estimation.

Suppose X {\displaystyle X} is a sample space of observations. X {\displaystyle X} can be quite general; for example: the real line, some Euclidean space, a space of functions, a Riemannian manifold, or whatever might be of interest. Let X 1 , X 2 , . . . , X n {\displaystyle X_{1},X_{2},...,X_{n}} be independent identically distributed (iid) random variables (rv's), with probability measure P {\displaystyle P} on X {\displaystyle X} . For a measurable set A {\displaystyle A} , define

P n ( A ) = 1 n card ( X j A , j = 1 , 2 , . . . n ) . {\displaystyle P_{n}(A)={1 \over n}\operatorname {card} (X_{j}\in A,j=1,2,...n).}

If C {\displaystyle C} is a collection of subsets of X {\displaystyle X} , then the collection

{ P n ( c ) : c C } {\displaystyle \{P_{n}(c):c\in C\}}

is the empirical measure indexed by C {\displaystyle C} . The empirical process B n {\displaystyle B_{n}} is defined as

B n = n ( P n P ) . {\displaystyle B_{n}={\sqrt {n}}(P_{n}-P).}

and

{ B n ( c ) : c C } {\displaystyle \{B_{n}(c):c\in C\}}

is the empirical process indexed by C . {\displaystyle C.}

A special case is the empirical process G n {\displaystyle G_{n}} associated with empirical distribution functions F n {\displaystyle F_{n}} .

G n ( x ) = n ( F n ( x ) F ( x ) ) , {\displaystyle G_{n}(x)={\sqrt {n}}(F_{n}(x)-F(x)),}

where X 1 , X 2 , . . . , X n {\displaystyle X_{1},X_{2},...,X_{n}} are real-valued random variables with distribution function F {\displaystyle F} and F n {\displaystyle F_{n}} defined by

F n ( x ) = 1 n card ( X j x , j = 1 , 2 , . . . n ) . {\displaystyle F_{n}(x)={1 \over n}\operatorname {card} (X_{j}\leq x,j=1,2,...n).} In this case,
C = { ( , x ) : x R } {\displaystyle C=\{(-\infty ,x):x\in R\}} .

Major results for this special case include Kolmogorov-Smirnov statistics, the Glivenko-Cantelli theorem and Donsker's theorem. Moreover, the empirical distribution function F n {\displaystyle F_{n}} of a finite sequence of realizations of a random variable is the very essence of statistical inference.

Glivenko-Cantelli theorem

By the strong law of large numbers, we know that

F n ( x ) a . s . F ( x ) . {\displaystyle F_{n}(x){\longrightarrow }_{a.s.}F(x).}

However, Glivenko and Cantelli strengthened this result.

The Glivenko-Cantelli theorem (1933):

F n F = sup x R | F n ( x ) F ( x ) | a . s . 0. {\displaystyle \|F_{n}-F\|_{\infty }=\sup _{x\in R}|F_{n}(x)-F(x)|{\longrightarrow }_{a.s.}0.}

Another way to state this is as follows: the sample paths of F n {\displaystyle F_{n}} get uniformly closer to F {\displaystyle F} as n {\displaystyle n} increases; hence F n {\displaystyle F_{n}} , which we observe, is almost surely a good approximation for F {\displaystyle F} , which becomes better as we collect more observations.

Donsker's theorem

By the classical central limit theorem, it follows that

G n ( x ) d i s t G ( x ) {\displaystyle G_{n}(x){\longrightarrow }_{dist}G(x)} ,

that is, G n ( x ) {\displaystyle G_{n}(x)} converges in distribution to a Gaussian (normal) random variable G ( x ) {\displaystyle G(x)} with mean 0 and variance F ( x ) [ 1 F ( x ) ] . {\displaystyle F(x).} Donsker (1952) showed that the sample paths of G n ( x ) {\displaystyle G_{n}(x)} , as functions on the real line R {\displaystyle R} , converge in distribution to a stochastic process G {\displaystyle G} in the space l {\displaystyle l} of all bounded functions f : R R {\displaystyle f:R{\rightarrow }R} . The function space l {\displaystyle l} is used in this context to remind us that we are concerned with distributional convergence in terms of sample paths. The limit process G {\displaystyle G} is a Gaussian process with zero mean and covariance given by

cov = E = F - F(s)F(t).

The process G ( x ) {\displaystyle G(x)} can be written as B ( F ( x ) ) {\displaystyle B(F(x))} where B {\displaystyle B} is a standard Brownian bridge on the unit interval.

If the observations X 1 , X 2 , . . . , X n {\displaystyle X_{1},X_{2},...,X_{n}} are in a more general sample space X {\displaystyle X} , we seek generalizations of the Glivenko-Cantelli theorem and Donsker's theorem. Also, we seek other theorems to determine rates of convergence and accuracy of estimation.

The classical empirical distribution function for real-valued random variables is a special case of the general theory with X {\displaystyle X} = R {\displaystyle R} and the class of sets C = { ( , x ] : x R } {\displaystyle C=\{(\infty ,x]:x\in R\}} .

References

  • P. Billingsley, Probability and Measure, John Wiley and Sons, New York, second edition, 1986.
  • M.D. Donsker, Justification and extension of Doob's heuristic approach to the Kolmogorov-Smirnov theorems, Annals of Mathematical Statistics, 23:277--281, 1952.
  • R.M. Dudley, Central limit theorems for empirical measures, Annals of Probability, 6(6): 899–929, 1978.
  • R.M. Dudley, Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK, 1999.
  • J. Wolfowitz, Generalization of the theorem of Glivenko-Cantelli. Annals of Mathematical Statistics, 25, 131-138, 1954.

External links

Category: