High-dimensional statistics

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger (relative to the number of datapoints) than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

There are several notions of high-dimensional analysis of statistical methods including:

Non-asymptotic results which apply for finite $n,p$ (number of data points and dimension size, respectively).
Kolmogorov asymptotics which studies the asymptotic behavior where the ratio $n/p$ is converges to a specific finite value.

Examples

Parameter estimation in linear models

{\displaystyle Y\in \mathbb {R} ^{n}} — Illustration of the linear model in high-dimensions: a data set consists of a response vector $Y\in \mathbb {R} ^{n}$ and a design matrix $X\in \mathbb {R} ^{n\times p}$ with $p\gg n$ . Our goal is to estimate the unknown vector $\beta =(\beta _{1},\dots ,\beta _{p})\in \mathbb {R} ^{p}$ of regression coefficients where $\beta$ is often assumed to be sparse, in the sense that the cardinality of the set $S:=\{j:\beta _{j}\neq 0\}$ is small by comparison with $p$ .

The most basic statistical model for the relationship between a covariate vector $x\in \mathbb {R} ^{p}$ and a response variable $y\in \mathbb {R}$ is the linear model

y=x^{\top }\beta +\epsilon ,

where $\beta \in \mathbb {R} ^{p}$ is an unknown parameter vector, and $\epsilon$ is random noise with mean zero and variance $\sigma ^{2}$ . Given independent responses $Y_{1},\ldots ,Y_{n}$ , with corresponding covariates $x_{1},\ldots ,x_{n}$ , from this model, we can form the response vector $Y=(Y_{1},\ldots ,Y_{n})^{\top }$ , and design matrix $X=(x_{1},\ldots ,x_{n})^{\top }\in \mathbb {R} ^{n\times p}$ . When $n\geq p$ and the design matrix has full column rank (i.e. its columns are linearly independent), the ordinary least squares estimator of $\beta$ is

{\hat {\beta }}:=(X^{\top }X)^{-1}X^{\top }Y.

When $\epsilon \sim N(0,\sigma ^{2})$ , it is known that ${\hat {\beta }}\sim N_{p}{\bigl (}\beta ,\sigma ^{2}(X^{\top }X)^{-1}{\bigr )}$ . Thus, ${\hat {\beta }}$ is an unbiased estimator of $\beta$ , and the Gauss-Markov theorem tells us that it is the Best Linear Unbiased Estimator.

However, overfitting is a concern when $p$ is of comparable magnitude to $n$ : the matrix $X^{\top }X$ in the definition of ${\hat {\beta }}$ may become ill-conditioned, with a small minimum eigenvalue. In such circumstances $\mathbb {E} (\|{\hat {\beta }}-\beta \|^{2})=\sigma ^{2}\mathrm {tr} {\bigl (}(X^{\top }X)^{-1}{\bigr )}$ will be large (since the trace of a matrix is the sum of its eigenvalues). Even worse, when $p>n$ , the matrix $X^{\top }X$ is singular. (See Section 1.2 and Exercise 1.2 in .)

It is important to note that the deterioration in estimation performance in high dimensions observed in the previous paragraph is not limited to the ordinary least squares estimator. In fact, statistical inference in high dimensions is intrinsically hard, a phenomenon known as the curse of dimensionality, and it can be shown that no estimator can do better in a worst-case sense without additional information (see Example 15.10). Nevertheless, the situation in high-dimensional statistics may not be hopeless when the data possess some low-dimensional structure. One common assumption for high-dimensional linear regression is that the vector of regression coefficients is sparse, in the sense that most coordinates of $\beta$ are zero. Many statistical procedures, including the Lasso, have been proposed to fit high-dimensional linear models under such sparsity assumptions.

Covariance matrix estimation

Another example of a high-dimensional statistical phenomenon can be found in the problem of covariance matrix estimation. Suppose that we observe $X_{1},\ldots ,X_{n}\in \mathbb {R} ^{p}$ , which are i.i.d. draws from some zero mean distribution with an unknown covariance matrix $\Sigma \in \mathbb {R} ^{p\times p}$ . A natural unbiased estimator of $\Sigma$ is the sample covariance matrix

{\widehat {\Sigma }}:={\frac {1}{n}}\sum _{i=1}^{n}X_{i}X_{i}^{\top }.

In the low-dimensional setting where $n$ increases and $p$ is held fixed, ${\widehat {\Sigma }}$ is a consistent estimator of $\Sigma$ in any matrix norm. When $p$ grows with $n$ , on the other hand, this consistency result may fail to hold. As an illustration, suppose that each $X_{i}\sim N_{p}(0,I)$ and $p/n\rightarrow \alpha \in (0,1)$ . If ${\widehat {\Sigma }}$ were to consistently estimate $\Sigma =I$ , then the eigenvalues of ${\widehat {\Sigma }}$ should approach one as $n$ increases. It turns out that this is not the case in this high-dimensional setting. Indeed, the largest and smallest eigenvalues of ${\widehat {\Sigma }}$ concentrate around $(1+{\sqrt {\alpha }})^{2}$ and $(1-{\sqrt {\alpha }})^{2}$ , respectively, according to the limiting distribution derived by Tracy and Widom, and these clearly deviate from the unit eigenvalues of $\Sigma$ . Further information on the asymptotic behaviour of the eigenvalues of ${\widehat {\Sigma }}$ can be obtained from the Marchenko–Pastur law. From a non-asymptotic point of view, the maximum eigenvalue $\lambda _{\mathrm {max} }({\widehat {\Sigma }})$ of ${\widehat {\Sigma }}$ satisfies

\mathbb {P} \left(\lambda _{\mathrm {max} }({\widehat {\Sigma }})\geq (1+{\sqrt {p/n}}+\delta )^{2}\right)\leq e^{-n\delta ^{2}/2},

for any $\delta \geq 0$ and all choices of pairs of $n,p$ .

Again, additional low-dimensional structure is needed for successful covariance matrix estimation in high dimensions. Examples of such structures include sparsity, low rankness and bandedness. Similar remarks apply when estimating an inverse covariance matrix (precision matrix).

History

From an applied perspective, research in high-dimensional statistics was motivated by the realisation that advances in computing technology had dramatically increased the ability to collect and store data, and that traditional statistical techniques such as those described in the examples above were often ill-equipped to handle the resulting challenges. Theoretical advances in the area can be traced back to the remarkable result of Charles Stein in 1956, where he proved that the usual estimator of a multivariate normal mean was inadmissible with respect to squared error loss in three or more dimensions. Indeed, the James-Stein estimator provided the insight that in high-dimensional settings, one may obtain improved estimation performance through shrinkage, which reduces variance at the expense of introducing a small amount of bias. This bias-variance tradeoff was further exploited in the context of high-dimensional linear models by Hoerl and Kennard in 1970 with the introduction of ridge regression. Another major impetus for the field was provided by Robert Tibshirani's work on the Lasso in 1996, which used $\ell _{1}$ regularisation to achieve simultaneous model selection and parameter estimation in high-dimensional sparse linear regression. Since then, a large number of other shrinkage estimators have been proposed to exploit different low-dimensional structures in a wide range of high-dimensional statistical problems.

Topics in high-dimensional statistics

The following are examples of topics that have received considerable attention in the high-dimensional statistics literature in recent years:

Linear models in high dimensions. Linear models are one of the most widely used tools in statistics and its applications. As such, sparse linear regression is one of the most well-studied topics in high-dimensional statistical research. Building upon earlier works on ridge regression and the Lasso, several other shrinkage estimators have been proposed and studied in this and related problems. They include
- The Dantzig selector, which minimises the maximum covariate-residual correlation, instead of the residual sum of squares as in the Lasso, subject to an $\ell _{1}$ constraint on the coefficients.
- Elastic net, which combines $\ell _{1}$ regularisation of the Lasso with $\ell _{2}$ regularisation of ridge regression to allow highly correlated covariates to be simultaneously selected with similar regression coefficients.
- The Group Lasso, which allows predefined groups of covariates to be selected jointly.
- The Fused lasso, which regularises the difference between nearby coefficients when the regression coefficients reflect spatial or temporal relationships, so as to enforce a piecewise constant structure.
High-dimensional variable selection. In addition to estimating the underlying parameter in regression models, another important topic is to seek to identify the non-zero coefficients, as these correspond to variables that are needed in a final model. Each of the techniques listed under the previous heading can be used for this purpose, and are sometimes combined with ideas such as subsampling through Stability Selection.
High-dimensional covariance and precision matrix estimation. These problems were introduced above; see also shrinkage estimation. Methods include tapering estimators and the constrained $\ell _{1}$ minimisation estimator.
Sparse principal component analysis. Principal Component Analysis is another technique that breaks down in high dimensions; more precisely, under appropriate conditions, the leading eigenvector of the sample covariance matrix is an inconsistent estimator of its population counterpart when the ratio of the number of variables $p$ to the number of observations $n$ is bounded away from zero. Under the assumption that this leading eigenvector is sparse (which can aid interpretability), consistency can be restored.
Matrix completion. This topic, which concerns the task of filling in the missing entries of a partially observed matrix, became popular owing in large part to the Netflix prize for predicting user ratings for films.
High-dimensional classification. Linear discriminant analysis cannot be used when $p>n$ , because the sample covariance matrix is singular. Alternative approaches have been proposed based on naive Bayes, feature selection and random projections.
Graphical models for high-dimensional data. Graphical models are used to encode the conditional dependence structure between different variables. Under a Gaussianity assumption, the problem reduces to that of estimating a sparse precision matrix, discussed above.

Notes

^ Lederer, Johannes (2022). Fundamentals of High-Dimensional Statistics: With Exercises and R labs. Springer Textbooks in Statistics. doi:10.1017/9781108627771. ISBN 9781108498029. S2CID 128095693.
^ Wainwright, Martin J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. doi:10.1017/9781108627771. ISBN 9781108498029. S2CID 128095693.
Wainwright MJ. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge: Cambridge University Press; 2019. doi:10.1017/9781108627771
Stein, C. (1956), "Inadmissibility of the usual estimator for the mean of a multivariate distribution", Proc. Third Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 197–206, MR 0084922, Zbl 0073.35602
James, W.; Stein, C. (1961), "Estimation with quadratic loss", Proc. Fourth Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 361–379, MR 0133191
Hoerl, Arthur E., and Robert W. Kennard. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics, vol. 12, no. 1, 1970, pp. 55–67. . Accessed 13 March 2021.
Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the lasso". Journal of the Royal Statistical Society. Series B (methodological). 58 (1). Wiley: 267–88. JSTOR 2346178.
Candes, Emmanuel; Tao, Terence (2007). "The Dantzig selector: Statistical estimation when p is much larger than n". Annals of Statistics. 35 (6): 2313–2351. arXiv:math/0506081. doi:10.1214/009053606000001523. MR 2382644. S2CID 88524200.
Zou, Hui; Hastie, Trevor (2005). "Regularization and Variable Selection via the Elastic Net". Journal of the Royal Statistical Society. Series B (statistical Methodology). 67 (2). Wiley: 301–20. doi:10.1111/j.1467-9868.2005.00503.x. JSTOR 3647580.
Yuan, Ming; Lin, Yi (2006). "Model Selection and Estimation in Regression with Grouped Variables". Journal of the Royal Statistical Society. Series B (statistical Methodology). 68 (1). Wiley: 49–67. doi:10.1111/j.1467-9868.2005.00532.x. JSTOR 3647556. S2CID 6162124.
Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused lasso”. Journal of the Royal Statistical Society. Series B (statistical Methodology) 67 (1). Wiley: 91–108. https://www.jstor.org/stable/3647602.
Meinshausen, Nicolai; Bühlmann, Peter (2010). "Stability selection". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 72 (4): 417–473. doi:10.1111/j.1467-9868.2010.00740.x. ISSN 1467-9868. S2CID 1231300.
Shah, Rajen D.; Samworth, Richard J. (2013). "Variable selection with error control: another look at stability selection". Journal of the Royal Statistical Society. Series B (Statistical Methodology). 75 (1): 55–80. arXiv:1105.5578. doi:10.1111/j.1467-9868.2011.01034.x. ISSN 1369-7412. JSTOR 23361014. S2CID 18211609.
Cai, T. Tony; Zhang, Cun-Hui; Zhou, Harrison H. (August 2010). "Optimal rates of convergence for covariance matrix estimation". The Annals of Statistics. 38 (4): 2118–2144. arXiv:1010.3866. doi:10.1214/09-AOS752. ISSN 0090-5364. S2CID 14038500. Retrieved 2021-04-06.
Cai, Tony; Liu, Weidong; Luo, Xi (2011-06-01). "A Constrained $\ell _{1}$ Minimization Approach to Sparse Precision Matrix Estimation". Journal of the American Statistical Association. 106 (494): 594–607. arXiv:1102.2233. doi:10.1198/jasa.2011.tm10155. ISSN 0162-1459. S2CID 15900101. Retrieved 2021-04-06.
Johnstone, Iain M.; Lu, Arthur Yu (2009-06-01). "On Consistency and Sparsity for Principal Components Analysis in High Dimensions". Journal of the American Statistical Association. 104 (486): 682–693. doi:10.1198/jasa.2009.0121. ISSN 0162-1459. PMC 2898454. PMID 20617121.
Vu, Vincent Q.; Lei, Jing (December 2013). "Minimax sparse principal subspace estimation in high dimensions". The Annals of Statistics. 41 (6): 2905–2947. arXiv:1211.0373. doi:10.1214/13-AOS1151. ISSN 0090-5364. S2CID 562591.
Bickel, Peter J.; Levina, Elizaveta (2004). "Some theory for Fisher's linear discriminant function, naive Bayes', and some alternatives when there are many more variables than observations". Bernoulli. 10 (6): 989–1010. doi:10.3150/bj/1106314847.
Fan, Jianqing; Fan, Yingying (December 2008). "High-dimensional classification using features annealed independence rules". The Annals of Statistics. 36 (6): 2605–2637. arXiv:math/0701108. doi:10.1214/07-AOS504. PMC 2630123. PMID 19169416. S2CID 2982392.
Cannings, Timothy I.; Samworth, Richard J. (2017). "Random-projection ensemble classification". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 79 (4): 959–1035. arXiv:1504.04595. doi:10.1111/rssb.12228. S2CID 88520328.

References

Lederer, Johannes (2022). Fundamentals of High-Dimensional Statistics. Cham: Springer.
Giraud, Christophe (2015). Introduction to High-Dimensional Statistics. Philadelphia: Chapman and Hall/CRC.
Cai, T. Tony; Shen, Xiaotong, eds. (2011). High-dimensional data analysis. Frontiers of Statistics. Singapore: World Scientific.
Bühlmann, Peter; van de Geer, Sara (2011). Statistics for high-dimensional data: methods, theory and applications. Heidelberg; New York: Springer.
Wainwright, Martin J. (2019). High-dimensional Statistics: A non-asymptotic viewpoint. Cambridge, UK: Cambridge University Press.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Categories: