Generalized Degrees of Freedom

Posted on May 03, 20210 Comments

Several different interpretations:

• the number of variables in the model, and hence are used as a model complexity measure in various model selection criteria (AIC, BIC, GCV)
• the cost of the estimation process and thus can be used for obtaining an unbiased estimation of the error variance.
• the trace of hat matrix, i.e., the sum of the sensitivity of each fitted value w.r.t. the corresponding observed value

For tree-based regression models,

• S-PLUS software uses the number of terminal nodes as the degrees of freedom for calculating the error variance (Venables and Ripley, 1994).
• Friedman (1991) and many of the discussants of the paper proposed various definitions of ad hoc degrees of freedom
• Owen (1991) suggested that searching for a knot cost roughly 3 df.

Major difficulty in handling complex modeling procedures is that the the fitted values are often complex, nondifferetiable, or even discontinuous functions of the observed values, such as

• linear model with variable selection: a small change in $Y$ may lead to a different selected model, resulting in a discontinuity in the fitted values
• tree-based model, which has discontinuous boundaries (?).

Main goal of the paper: develop a concept of generalized degrees of freedom (GDF) that is applicable for evaluation of the final models or fits.

• define GDF as the sum of the sensitivity of each fitted value to perturbations in the corresponding observed value
• nonasmpototic in nature and thus is free of the sample-size constraint
• can be used as a measure of the complexity of a general modeling procedure
• also view the GDF as the cost of the modeling process, so that under suitable conditions, one can obtain an unbiased estimate of the error variance.

Difference between GDF and the traditional degrees of freedom,

• the GDF of a parameter may be substantially larger than 1
• no longer an exact correspondence between the degrees of freedom and the number of parameters
• GDF depends on both the modeling procedure and the underlying true model

In model selection, where inferences are made about the selected model based on the same data, assuming that the selected model is given a priori. The goodness-of-fit statistics from this method are often too optimistic.

The half-normal plot (HNP) is a graphical tool in the analysis of orthogonal experiments. It is used for selection of orthogonal variables while taking into account the selection effect. There is a connection between te proposed theory and the HNP.

Generalized Degrees of Freedom

Motivation from linear regresion,

$\hat\mu = X(X'X)^{-1}X'Y \triangleq HY$

the df can be expressed as

$p = \tr(H) = \sum_i h_ii = \sum_i\frac{\partial \hat\mu_i}{\partial y_i}\,.$

Define a modeling procedure $\cM: Y\rightarrow \hat\mu$,

The GDF for a modeling procedure $\cM$ are given by

$D(\cM) = \sum_{i=1}^nh_i^\cM(\mu)=\frac{1}{\sigma^2}\sum_{i=1}^n\Cov(\hat\mu_i, y_i)\,,$

where

$h_i^\cM(\mu) = \frac{\partial E_\mu[\hat\mu_i(Y)]}{\partial \mu_i} = \lim_{\delta\rightarrow 0}E_\mu\left[\frac{\hat\mu_i(y+\delta e_i)-\hat\mu_i(y)}{\delta}\right] = \frac{1}{\sigma^2}\E[\hat\mu_i(y)(y_i-\mu_i)]\,.$

The last equality comes from the Stein’s lemma.

It measures the flexibility of the modeling procedure $\cM$. If $\cM$ is highly flexible, then the fitted values tend to be close to the observed values. Thus the sensitivity of the fitted values to the observed values would be high, and the GDF would be large.

• Efron (1986): expected optimism by using the covariance form
• Girard (1989): uses a randomized trace definition for degrees of freedom.

Evaluations and Comparisons of Complex Modeling Procedures

Given a modeling procedue $\cM: Y\rightarrow \hat\mu$, and a statistic $$A(\cM, P) = \RSS(\cM) - n\sigma^2 + 2P\sigma^2\,,$$ where $P$ is a constant, the quantity $$E[(\hat\mu - \mu)'(\hat\mu-\mu) - A(\cM, P)]^2$$ is minimized when $P = D(\cM)$. Consequently, $A_e(\cM)\triangleq A(\cM, D(\M))$ is an unbiased estimate for the quadratic loss $(\hat\mu - \mu)’(\hat\mu-\mu)$.

Similar to GCV, define

$\mathrm{GCV}(\cM_k) = \RSS(\cM_k) / (n-D(\cM_k))^2\,,$

where the advantage is that it does not assume a known $\sigma^2$.

An estimate of $\sigma^2$:

$\def\cor{\mathrm{cor}}$ $s^2(\cor) = (y-\hat\mu)'(y-\hat\mu) / (n-D(M))$

A goodness-of-fit measure can be defined as

$R^2(\cor) = 1 - \frac{(y-\hat\mu)'(y-\hat\mu) / (n-D(M))}{y'y/n}$

E(s^2(\cor)) = \sigma^2 iff $$E(y-\hat\mu)'(\hat\mu-\mu)=0\,.$$

If there is a shrinkage effect in the modeling procedure, the estimated variance when GDF are used may underestimate the true variance.

Applications to Nonparametric Regression

Potential applications:

• EAIC or GCV can be used to compare the performance of a nonparametric regresion to a linear model as a way of diagnosing the adequacy of the linear model.
• Select the most suitable nonparametric regression procedure among various alternatives, such as classification and regression trees (CART), projection pursuit regression (PPR), and artifical neural network (ANN).
• selection of variables in nonparametric regression settings by treating variable selection as a special case of selecting modeling procedures.
• evaluation of effect induced by various selection procedures, such as variable selection in linear models and bandwidth selection in nonparametric regression.

The experiments are similar in Ex. 9.5 of ESL

Observations:

• The estimate $\hat\sigma_o^2$ which uses the number of terminal nodes biases downward substantially, while the new variance estimate $\hat\sigma^2$ is almost unbiased.
• Fitting a tree to a pure noise dataset costs substantially more GDF than fitting to a dataset with a clear structure for the first several nodes.
• If the fitted tree is large enough, the GDF is stable across different simulations, whose variance is negligible.

Correcting Bias from Model Selection

To evaluate the goodness of fit of $M_k$, obtained from variable selection, it is often assumed to be given a priori (??).

To correct for the bias, the selection process must be taken into account.

$\cM_k: Y\overset{\text{selection}}{\longrightarrow} M_k(Y)\overset{\text{selection}}{\longrightarrow} \hat Y_{M_k}\,.$

It seems no clear specification on the fitting method, but it should talk about the best subset selection in the linear regression. A recent related work can be found in Ryan Tibshirani (2015).

Assume that $\beta=0, n\ge q$. If $X$ is orthogonal, then $$D(k)=\sum_{i=1}^k\chi^2_{(i)}$$ where $\chi^2_{(1)} > \cdots > \chi^2_{(q)}$ is the expected value of the order statistics of $q$ observations from a $\chi^2(1)$ random variable.

GDF and HNP

The connection provides two important points:

• it establishes GDF and the associated framework as a formalization and extension of the HNP to very general settings.
• the widespread use of HNP in orthogonal experiments provides a successful case for GDF.

Published in categories Note