Linear trend estimation - Misplaced Pages

Statistical technique to aid interpretation of data For broader coverage of this topic, see Curve fitting.

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)

This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (July 2019) (Learn how and when to remove this message)

This article's tone or style may not reflect the encyclopedic tone used on Misplaced Pages. See Misplaced Pages's guide to writing better articles for suggestions. (September 2023) (Learn how and when to remove this message)

This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (December 2023) (Learn how and when to remove this message)

(Learn how and when to remove this message)

Linear trend estimation is a statistical technique used to analyze data patterns. Data patterns, or trends, occur when the information gathered tends to increase or decrease over time or is influenced by changes in an external factor. Linear trend estimation essentially creates a straight line on a graph of data that models the general direction that the data is heading.

Fitting a trend: Least-squares

Given a set of data, there are a variety of functions that can be chosen to fit the data. The simplest function is a straight line with the dependent variable (typically the measured data) on the vertical axis and the independent variable (often time) on the horizontal axis.

The least-squares fit is a common method to fit a straight line through the data. This method minimizes the sum of the squared errors in the data series $y$ . Given a set of points in time $t$ and data values $y_{t}$ observed for those points in time, values of ${\hat {a}}$ and ${\hat {b}}$ are chosen to minimize the sum of squared errors

\sum _{t}\left^{2}

This formula first calculates the difference between the observed data $y_{t}$ and the estimate $({\hat {a}}t+{\hat {b}})$ , the difference at each data point is squared, and then added together, giving the "sum of squares" measurement of error. The values of ${\hat {a}}$ and ${\hat {b}}$ derived from the data parameterize the simple linear estimator ${\hat {y}}={\hat {a}}x+{\hat {b}}$ . The term "trend" refers to the slope ${\hat {a}}$ in the least squares estimator.

Data as trend and noise

To analyze a (time) series of data, it can be assumed that it may be represented as trend plus noise:

y_{t}=at+b+e_{t}\,

where $a$ and $b$ are unknown constants and the $e$ 's are randomly distributed errors. If one can reject the null hypothesis that the errors are non-stationary, then the non-stationary series $\{y_{t}\}$ is called trend-stationary. The least-squares method assumes the errors are independently distributed with a normal distribution. If this is not the case, hypothesis tests about the unknown parameters $a$ and $b$ may be inaccurate. It is simplest if the $e$ 's all have the same distribution, but if not (if some have higher variance, meaning that those data points are effectively less certain), then this can be taken into account during the least-squares fitting by weighting each point by the inverse of the variance of that point.

Commonly, where only a single time series exists to be analyzed, the variance of the $e$ 's is estimated by fitting a trend to obtain the estimated parameter values ${\hat {a}}$ and ${\hat {b}},$ thus allowing the predicted values

{\hat {y}}={\hat {a}}t+{\hat {b}}

to be subtracted from the data $y_{t}$ (thus detrending the data), leaving the residuals ${\hat {e}}_{t}$ as the detrended data, and estimating the variance of the $e_{t}$ 's from the residuals — this is often the only way of estimating the variance of the $e_{t}$ 's.

Once the "noise" of the series is known, the significance of the trend can be assessed by making the null hypothesis that the trend, $a$ , is not different from 0. From the above discussion of trends in random data with known variance, the distribution of calculated trends is to be expected from random (trendless) data. If the estimated trend, ${\hat {a}}$ , is larger than the critical value for a certain significance level, then the estimated trend is deemed significantly different from zero at that significance level, and the null hypothesis of a zero underlying trend is rejected.

The use of a linear trend line has been the subject of criticism, leading to a search for alternative approaches to avoid its use in model estimation. One of the alternative approaches involves unit root tests and the cointegration technique in econometric studies.

The estimated coefficient associated with a linear trend variable such as time is interpreted as a measure of the impact of a number of unknown or known but immeasurable factors on the dependent variable over one unit of time. Strictly speaking, this interpretation is applicable for the estimation time frame only. Outside of this time frame, it cannot be determined how these immeasurable factors behave both qualitatively and quantitatively.

Research results by mathematicians, statisticians, econometricians, and economists have been published in response to those questions. For example, detailed notes on the meaning of linear time trends in the regression model are given in Cameron (2005); Granger, Engle, and many other econometricians have written on stationarity, unit root testing, co-integration, and related issues (a summary of some of the works in this area can be found in an information paper by the Royal Swedish Academy of Sciences (2003)); and Ho-Trieu & Tucker (1990) have written on logarithmic time trends with results indicating linear time trends are special cases of cycles.

Noisy time series

It is harder to see a trend in a noisy time series. For example, if the true series is 0, 1, 2, 3, all plus some independent normally distributed "noise" e of standard deviation E, and a sample series of length 50 is given, then if E = 0.1, the trend will be obvious; if E = 100, the trend will probably be visible; but if E = 10000, the trend will be buried in the noise.

Consider a concrete example, such as the global surface temperature record of the past 140 years as presented by the IPCC. The interannual variation is about 0.2 °C, and the trend is about 0.6 °C over 140 years, with 95% confidence limits of 0.2 °C (by coincidence, about the same value as the interannual variation). Hence, the trend is statistically different from 0. However, as noted elsewhere, this time series doesn't conform to the assumptions necessary for least-squares to be valid.

Goodness of fit (r-squared) and trend

Illustration of the effect of filtering on r. Black = unfiltered data; red = data averaged every 10 points; blue = data averaged every 100 points. All have the same trend, but more filtering leads to higher r of fitted trend line.

The least-squares fitting process produces a value, r-squared (r), which is 1 minus the ratio of the variance of the residuals to the variance of the dependent variable. It says what fraction of the variance of the data is explained by the fitted trend line. It does not relate to the statistical significance of the trend line (see graph); the statistical significance of the trend is determined by its t-statistic. Often, filtering a series increases r while making little difference to the fitted trend.

Advanced models

Thus far, the data have been assumed to consist of the trend plus noise, with the noise at each data point being independent and identically distributed random variables with a normal distribution. Real data (for example, climate data) may not fulfill these criteria. This is important, as it makes an enormous difference to the ease with which the statistics can be analyzed so as to extract maximum information from the data series. If there are other non-linear effects that have a correlation to the independent variable (such as cyclic influences), the use of least-squares estimation of the trend is not valid. Also, where the variations are significantly larger than the resulting straight line trend, the choice of start and end points can significantly change the result. That is, the model is mathematically misspecified. Statistical inferences (tests for the presence of a trend, confidence intervals for the trend, etc.) are invalid unless departures from the standard assumptions are properly accounted for, for example, as follows:

Dependence: autocorrelated time series might be modelled using autoregressive moving average models.
Non-constant variance: in the simplest cases, weighted least squares might be used.
Non-normal distribution for errors: in the simplest cases, a generalized linear model might be applicable.
Unit root: taking first (or occasionally second) differences of the data, with the level of differencing being identified through various unit root tests.

In R, the linear trend in data can be estimated by using the 'tslm' function of the 'forecast' package.

Trends in clinical data

Medical and biomedical studies often seek to determine a link between sets of data, such as of a clinical or scientific metric in three different diseases. But data may also be linked in time (such as change in the effect of a drug from baseline, to month 1, to month 2), or by an external factor that may or may not be determined by the researcher and/or their subject (such as no pain, mild pain, moderate pain, or severe pain). In these cases, one would expect the effect test statistic (e.g., influence of a statin on levels of cholesterol, an analgesic on the degree of pain, or increasing doses of different strengths of a drug on a measurable index, i.e. a dose - response effect) to change in direct order as the effect develops. Suppose the mean level of cholesterol before and after the prescription of a statin falls from 5.6 mmol/L at baseline to 3.4 mmol/L at one month and to 3.7 mmol/L at two months. Given sufficient power, an ANOVA (analysis of variance) would most likely find a significant fall at one and two months, but the fall is not linear. Furthermore, a post-hoc test may be required. An alternative test may be a repeated measures (two way) ANOVA or Friedman test, depending on the nature of the data. Nevertheless, because the groups are ordered, a standard ANOVA is inappropriate. Should the cholesterol fall from 5.4 to 4.1 to 3.7, there is a clear linear trend. The same principle may be applied to the effects of allele/genotype frequency, where it could be argued that a single-nucleotide polymorphism in nucleotides XX, XY, YY are in fact a trend of no Y's, one Y, and then two Y's.

The mathematics of linear trend estimation is a variant of the standard ANOVA, giving different information, and would be the most appropriate test if the researchers hypothesize a trend effect in their test statistic. One example is levels of serum trypsin in six groups of subjects ordered by age decade (10–19 years up to 60–69 years). Levels of trypsin (ng/mL) rise in a direct linear trend of 128, 152, 194, 207, 215, 218 (data from Altman). Unsurprisingly, a 'standard' ANOVA gives p < 0.0001, whereas linear trend estimation gives p = 0.00006. Incidentally, it could be reasonably argued that as age is a natural continuously variable index, it should not be categorized into decades, and an effect of age and serum trypsin is sought by correlation (assuming the raw data is available). A further example is of a substance measured at four time points in different groups:


#	mean	SD
1	1.6	0.56
2	1.94	0.75
3	2.22	0.66
4	2.40	0.79

This is a clear trend. ANOVA gives p = 0.091, because the overall variance exceeds the means, whereas linear trend estimation gives p = 0.012. However, should the data have been collected at four time points in the same individuals, linear trend estimation would be inappropriate, and a two-way (repeated measures) ANOVA would have been applied.

Notes

"Making Regression More Useful II: Dummies and Trends" (PDF). Retrieved June 17, 2012.
"The Royal Swedish Academy of Sciences" (PDF). 8 October 2003. Retrieved June 17, 2012.
^ "IPCC Third Assessment Report – Climate Change 2001 – Complete online versions". Archived from the original on November 20, 2009. Retrieved June 17, 2012.
^ Forecasting: principles and practice. 20 September 2014. Retrieved May 17, 2015.

References

Bianchi, M.; Boyle, M.; Hollingsworth, D. (1999). "A comparison of methods for trend estimation". Applied Economics Letters. 6 (2): 103–109. doi:10.1080/135048599353726.
Cameron, S. (2005). "Making Regression Analysis More Useful, II". Econometrics. Maidenhead: McGraw Hill Higher Education. pp. 171–198. ISBN 0077104285.
Chatfield, C. (1993). "Calculating Interval Forecasts". Journal of Business and Economic Statistics. 11 (2): 121–135. doi:10.1080/07350015.1993.10509938.
Ho-Trieu, N. L.; Tucker, J. (1990). "Another note on the use of a logarithmic time trend". Review of Marketing and Agricultural Economics. 58 (1): 89–90. doi:10.22004/ag.econ.12288.
Kungl. Vetenskapsakademien (2003). "Time-series econometrics: Cointegration and autoregressive conditional heteroskedasticity". Advanced Information on the Bank of Sweden Prize in Economic Sciences in Memory of Alfred Nobel. The Royal Swedish Academy of Sciences.
Arianos, S.; Carbone, A.; Turk, C. (2011). "Self-similarity of high-order moving averages". Physical Review E. 84 (4): 046113. Bibcode:2011PhRvE..84d6113A. doi:10.1103/physreve.84.046113. PMID 22181233.
Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman and Hall. pp. 212–220. ISBN 041227630-5.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Categories: