Misplaced Pages

Jackknife variance estimates for random forest

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
This article provides insufficient context for those unfamiliar with the subject. Please help improve the article by providing more context for the reader. (December 2015) (Learn how and when to remove this message)

In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects.

Jackknife variance estimates

The sampling variance of bagged learners is:

V ( x ) = V a r [ θ ^ ( x ) ] {\displaystyle V(x)=Var}

Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as:

V ^ j = n 1 n i = 1 n ( θ ^ ( i ) θ ¯ ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\hat {\theta }}_{(-i)}-{\overline {\theta }})^{2}}

In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as:

V ^ j = n 1 n i = 1 n ( t ¯ ( i ) ( x ) t ¯ ( x ) ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\overline {t}}_{(-i)}^{\star }(x)-{\overline {t}}^{\star }(x))^{2}}

Here, t {\displaystyle t^{\star }} denotes a decision tree after training, t ( i ) {\displaystyle t_{(-i)}^{\star }} denotes the result based on samples without i t h {\displaystyle ith} observation.

Examples

E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.

Here, accuracy is measured by error rate, which is defined as:

E r r o r R a t e = 1 N i = 1 N j = 1 M y i j , {\displaystyle ErrorRate={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij},}

Here N is also the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy:

l o g l o s s = 1 N i = 1 N j = 1 M y i j l o g ( p i j ) {\displaystyle logloss={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij}log(p_{ij})}

Here N is the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. p i j {\displaystyle p_{ij}} is the predicted probability of i t h {\displaystyle ith} observation in class j {\displaystyle j} .This method is used in Kaggle These two methods are very similar.

Modification for bias

When using Monte Carlo MSEs for estimating V I J {\displaystyle V_{IJ}^{\infty }} and V J {\displaystyle V_{J}^{\infty }} , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large:

E [ V ^ I J B ] V ^ I J n b = 1 B ( t b t ¯ ) 2 B {\displaystyle E-{\hat {V}}_{IJ}^{\infty }\approx {\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}

To eliminate this influence, bias-corrected modifications are suggested:

V ^ I J U B = V ^ I J B n b = 1 B ( t b t ¯ ) 2 B {\displaystyle {\hat {V}}_{IJ-U}^{B}={\hat {V}}_{IJ}^{B}-{\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}
V ^ J U B = V ^ J B ( e 1 ) n b = 1 B ( t b t ¯ ) 2 B {\displaystyle {\hat {V}}_{J-U}^{B}={\hat {V}}_{J}^{B}-(e-1){\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}

References

  1. Wager, Stefan; Hastie, Trevor; Efron, Bradley (2014-05-14). "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife". Journal of Machine Learning Research. 15 (1): 1625–1651. arXiv:1311.4555. Bibcode:2013arXiv1311.4555W. PMC 4286302. PMID 25580094.
  2. "Otto Group Product Classification Challenge". Kaggle.
Category: