Select Prior by Formal Rules

Posted on Mar 04, 2019 (Update: Mar 05, 2019)
Tags: Prior, Jeffreys

Larry wrote that “Noninformative priors are a lost cause” in his post, LOST CAUSES IN STATISTICS II: Noninformative Priors, and he mentioned his review paper Kass and Wasserman (1996) on noninformative priors. This note is for this paper.

The paper review the plethora of techniques for constructing so-called “noninformative” priors and discuss some of the practical and philosophical issues that arise when they are used.

Introduction

determination of priors

• personal prior “elicitation”: not been extensively developed and have had relatively little impact on statistical inference.
• find structural rules that determine priors: various schemes have been investigated, which is the topics of this review paper

Jeffreys

1. the fundamental ideas and methods originate with Jeffreys.
2. Jeffreys’ viewpoint evolved toward seeing priors as chosen by convention, rather than a unique representations of ignorance.
3. various different arguments lead to the priors suggested by Jeffreys or to modified versions of Jeffreys’ priors

Issues

1. many of these issues are raised only when the priors derived by formal rules are improper.
2. the paper argues that impropriety per se is not the practically important source of difficulties. When improper priors lead to badly behaved posteriors, it is a signal that the problem itself may be hard; in this situation diffuse proper priors are likely to lead to similar difficulties.
3. the paper claims that reference priors are primarily when the data analyst is unsure whether a sample is “large”.
4. some important outstanding problems are highlighted.

A Concrete Example

Consider the multivariate normal distributions with mean $\mu$ and variance matrix $\Sigma$. $\mu$ and $\Sigma$ may depend on some lower-dimensional parameter vector $\theta$, say $\mu=\mu(\theta)$ with $\Sigma=\sigma^2\I$, we obtain the standard nonlinear regression models, and the structure $\Sigma=\Sigma(\theta)$ includes “components of variance”, hierarchical, and time series models.

Jeffreys’ Method

The concept of selecting a prior by convention, as a “standard of reference” analogous to choosing a standard of reference in other scientific, is due to Jeffreys.

Philosophy

Jeffreys believed in the existence of states of ignorance and subscribed to the “principle of insufficient reason”(require the distribution on the finitely many events to be uniform unless there is some definite reason to consider one event more probable than another. The contentious point is whether it is meaningful to speak of a “definite reason” that does not involve subjective judgment.), but in his reliance on convention he allowed ignorance to remain a vague concept, that is, one that may be made definite in many ways, rather than requiring a unique definition.

Jeffreys can be labeled as “necessarist”(there is one and only one opinion justified by any body of evidence, so that probability is an objective logical relationship between an event A and the evidence B):

• Jeffreys’s viewpoint in his early version book
• A test for adherence to the necessarist viewpoint is whether in this case a uniform distribution is advocated
• He believed in the existence of an “initial” stage of knowledge, and thought it was important to be able to make inferences based on data collected at this stage

Jeffreys did not insist on unqie representations of ignorance.

Rules for Priors in Problems of Estimation

Jeffreys considered several scenarios in formulating his rules, and treated each separately.

• the case of a finite parameter space: he adhered to the principle of insufficient reason in advocating the assignment of equal probabilities to each of the parameter values.
• the case of a bounded interval, the interval $(-\infty, \infty)$, or the interval $(0,\infty)$. Jeffreys took the prior density to be constant. In the second case this of course entails that the prior be improper, that is, that is not integrate.
• the case associated with an unknown standard deviation $\sigma$, he used the prior $\pi_\sigma(\sigma)=1/\sigma$. His chief justification for this choice was its invariance under power transformations of the parameter: If $\gamma=\sigma^2$ and the change-of-variables formula is applied to $\pi_\sigma$, then one obtains $\pi_\gamma(\gamma)=1/\gamma$; thus applications of the rule to $\sigma$ and $\gamma$ lead to the same formal prior.

In a 1946 paper, Jeffreys proposed his “general rule”, which takes the prior to be

$\pi_\theta(\theta)\propto \det(\I(\theta))^{1/2}\,.$

This rule has the invariance property that for any other parameterization $\gamma$ for which is applicable,

$\pi_\theta(\theta) = \pi_\gamma(\gamma(\theta))\cdot\left|\det\left(\frac{\partial \gamma}{\partial \theta}\right)\right|\,,$

The invariance can be illustrated as follows:

For fisher information,

$\tilde I(\xi) = [h'(\xi)]^2I(\theta)$

where $\theta=h(\xi)$. Similarly for the matrix,

$\tilde I(\xi) = [Dh(\xi)]'I(\theta)[Dh(\xi)]\,,$

where $[Dh(\xi)]_{ij,}=\partial h_i(\xi)/\partial \xi_j$.

Jeffreys noted that this rule may conflict with the rules previously stated. For example, in the case of data that follow a $N(\mu,\sigma^2)$ distribution, the previous rule gives $\pi(\mu,\sigma)=1/\sigma$, whereas the general rule gives $\pi(\mu,\sigma)=1/\sigma^2$. He solved this problem by stating that $\mu$ and $\sigma$ should be judged independent a priori and so should be treated separately, which leads back to the more desirable $\pi(\mu,\sigma)=1/\sigma$. When the general rule is applied while holding $\sigma$ fixed, it gives the uniform prior on $\mu$, and when it is applied while holding $\mu$ fixed, it gives the prior $\pi(\sigma)\propto 1/\sigma$.

Jeffreys suggested this modification for general location-scale problems. If there are location parameters $\mu_1,\ldots,\mu_k$ and an additional multidimensional parameter $\theta$, then the prior he recommended becomes

$\pi(\mu_1,\ldots,\mu_k,\theta)\propto \det(\I(\theta))^{1/2}\,,$

where $\I(\theta)$ is calculated holding $\mu_1,\ldots,\mu_k$ fixed.

Bayes Factors

Jeffreys emphasized the distinction between problems of estimation and problems of testing.

Suppose that $Y=(Y_1,\ldots,Y_n)$ follows a distribution in a family parameterized by $(\beta,\psi)$ having a density $p(y\mid \beta,\psi)$. To test $H_0:\psi=\psi_0$ versus $H_A:\psi\in \Psi$. Jeffreys’s method is based on what is now usually called the “Bayes factor”,

$B = \frac{\int p(y\mid \beta,\psi_0)\pi_0(\beta)d\beta}{\int\int p(y\mid \beta,\psi)\pi(\beta,\psi)d\beta d\psi}$

where $\pi_0(\beta)$ and $\pi(\beta,\psi)$ are priors under $H_0$ and $H_A$.

Methods for Constructing Reference Priors

Sometimes the parameter $\theta$ can be written in the form $\theta=(\omega,\lambda)$, where $\omega$ is a parameter of interest and $\lambda$ is a nuisance parameter. In this case reference priors that are considered satisfactory for making inference about $\theta$ may not be satisfactory for making inferences about $\omega$.

Laplace and the Principle of Insufficient Reason

If the parameter space is finite, the Laplace’s rule, or the principle of insufficient reason, is to use a uniform prior that assigns equal probability to each point in the parameter space.

This rule is appealing but is subject to a partitioning paradox.

A natural generalization is to apply the principle of insufficient reason when the parameter space is continuous, and thereby obtain a flat prior (a prior that is equal to a positive constant). A problem with this rule is that this is not parameterization invariant.

Invariance

The simplest example of invariance involves the permutation group on a finite set. It is clear that the uniform probability distribution is the only distribution that is invariant under permutations of a finite set.

Suppose $X$ has a $N(\theta,1)$ distribution with a prior $\pi_1(\theta)$. Let $Y=X+a$, with $a$ being a fixed constant. Then $Y$ has a $N(\phi,1)$ distribution, where $\phi=\theta+a$, let the prior be $\pi_2(\phi)$. Since we are dealing with the same formal mode, then we require that $\pi_1=\pi_2$. On the other hand, because $\phi=\theta+a$, $\pi_1$ and $\pi_2$ can be related by the usual change-of-variables formula. The relationships between $\pi_1$ and $\pi_2$ should hold for every $a$, and this implies that they must both be uniform distributions.

Reexpressed the normal location model in terms of group invariance. Each real number $a$ determines a transformation $h_a:\bbR\rightarrow\bbR$ defined by $h_a(x)=x+a$. The set of all such transformations $H={h_a:a\in \bbR}$ forms a group if we define $h_ah_b=h_{a+b}$. The model is invariant under the action of the group, because $X\sim N(\theta,1)$ and $Y=h_a(X)$ implies that $Y\sim N(h_a(\theta), 1)$. The uniform prior $\mu$ is the unique prior (unique up to an additive constant) that is invariant under the action of the group; that is, $\mu(h_aA)=\mu(A)$ for every $A$ and every $a$, where $h_aA=\{h_a(\theta);\theta\in A\}$.

Now suppose that $X\sim N(\theta,\sigma^2)$. Let $H=\{h_{a,b};a\in\bbR,b\in\bbR^+\}$, where $h_{a,b}:\bbR\rightarrow \bbR$ is defined by $h_{a,b}(x)=a+bx$. Again, $H$ is a group (? $h_{a,b}h_{c,d}=h_{a+c,bd}$?). Define another group $G=\{g_{a,b};a\in\bbR,b\in\bbR^+\}$, where $g_{a,b}:\bbR\times \bbR^+\rightarrow \bbR+\bbR^+$ is defined by $g_{a,b}(\theta,\sigma)=(a+b\theta,b\sigma)$. The prior $P$ is invariant to left multiplication, that is, $P(g_{a,b}A)=P(A)$ for all $A$ and all $(a,b)\in \bbR\times \bbR^+$, has density $\pi(\mu,\sigma)\propto 1/\sigma^2$. Jeffreys preferred the prior $Q$ is invariant to right multiplication, meaning that $Q(Ag_{a,b})=Q(A)$ for all $A$ and all $(a,b)\in \bbR\times \bbR^+$. The prior $P$ and $Q$ are called left Haar measure and right Haar measure.

Generally, right Haar measure is preferred in practice.

Data-Translated Likelihoods

Let $\y$ be a vector of observations and let $L_\y(\cdot)$ be a likelihood function on a real one-dimensional parameter space $\Phi$. The likelihood function is data-translated if it may be written is the form

$L_\y(\phi) = f\{\phi-t(\y)\}$

for some real-valued functions $f(\cdot)$ and $t(\cdot)$, with the definition of $f(\cdot)$ not depending on $\y$. If it is satisfied, Box and Tiao recommended using the uniform prior on $\Phi$, because two different samples $\y$ and $\y^*$ will then produce posteriors that differ only with respect to location.

Kass showed that in one dimension, likelihoods become approximately data translated to order $O(n^{-1})$, which is stronger than the order $O(n^{-1/2})$ implied by the data translatedness of the limiting normal distributions.

For the multidimensional case: Likelihoods may be considered approximately data translated along information-metric geodesics in any given direction, but it generally is not possible to find a parameterization in which they become jointly approximately data translated.

Maximum Entropy

If $\Theta = \{\theta_1,\ldots,\theta_n\}$ is finite and $\pi$ is a probability mass function on $\Theta$, then the entropy of $\pi$, which is meant to capture the amount of uncertainty implied by $\pi$, is defined by $\calE(\pi)=-\sum \pi(i)\log \pi(i)$.

Priors with larger entropy are regarded as being less informative, and the method of maximum entropy is to select the prior that maximizes $\calE(\pi)$.

If no further constraints are imposed on the problem, then the prior with maximum entropy is the uniform prior.

Suppose now that partial information is available in the form of specified expectations for a set of random variables, $\{\E(X_1)=m_1,\ldots,\E(X_r)=m_r\}$. Maximum entropy prescribe choosing the prior that maximizes entropy subject to the given moment constraints. The solution is the prior

$\pi(\theta_i) \propto \left\{\sum_j\lambda_jX_j(\theta_i)\right\}\,.$

Main points of maximum entropy:

1. there is a conflict between the maximum entropy paradigm and Bayesian updating.
2. it also subject to the same partitioning paradox that afflicts the principle of insufficient reason.

The Berger-Bernardo Method

two innovations:

1. define a notion of missing information
2. develop a stepwise procedure for handling nuisance parameters.

Missing Information

Let

$K_n^\pi = \E(K_n(\pi\mid x_1^n, \pi(\theta)))$

be the expected gain in information, where the expectation is with respect to the marginal density. Bernardo’s idea was to think of $K_n^\pi$ for large $n$ as a measure of the missing information in the experiment. He suggested finding the prior that maximizes $K_\infty^\pi = \lim_{n\rightarrow\infty}K_n^\pi$.

Nuisance Parameters

1. define $\pi(\lambda\mid \omega)$ to be the Berger-Bernardo prior for $\lambda$ with $\omega$ fixed.
2. take $\pi(\omega)$ be the B-B prior based on the marginal model $p(x\mid \omega)$.
3. the recommended prior is then $\pi(\omega)\pi(\lambda\mid \omega)$.

other priors based on Bernardo’s missing information idea.

Geometry

Jeffreys noted that the KL number behaves like the square of a distance function determined by a Riemannian metric; the natural volume element of this metric is $\det(\I(\theta))^{1/2}$, and natural volume elements of Riemannian metrics are automatically invariant to reparameterization.

Coverage Matching Methods

One way to try to characterize “noninformative” priors is through the notion that they ought to “let the data speak for themselves”. It may be considered desirable to have posterior probabilities agree with sampling probabilities.

Suppose that $\theta$ is a scalar parameter and $l(x)$ and $u(x)$ satisfy $\Pr(l(x)\le\theta\le u(x)\mid x)=1-\alpha$, so that $A_x=[l(x), u(x)]$ is a set with posterior probability content $1-\alpha$. In general, the frequentist coverage probability of $A_x$ will not be $1-\alpha$. But there are some examples where coverage and posterior probability do agree. For example, if $X\sim N(\theta,1)$ and $\theta$ is given a uniform prior, then $A_x=[x-n^{-1/2}z_{\alpha/2}, x+n^{-1/2}z_\alpha/2]$ has posterior probability $1-\alpha$ and also has coverage $1-\alpha$.

Zellner’s method

Let $Z(\theta)=-\int p(x\mid \theta)\log p(x\mid \theta)dx$ be the information about $X$ in the sampling density. Zellner and Min (1993) suggested choosing the prior $\pi$ that maximizes the difference

$G = \int Z(\theta)\pi(\theta)d\theta -\int \pi(\theta)\log(\pi(\theta))d\theta.$

The solution is $\pi(\theta)\propto \exp\{Z(\theta)\}$, which called the maximal data information prior (MDIP).

Zellner’s method is not parameterization invariant, but the invariance under specific classes of reparameterizations can be obtained by adding the appropriate constraints.

Decision-Theoretic Methods

several authors have used decision theoretic arguments to select priors.

Rissanen’s Method

The motivation is of a much different nature than the other methods considered here.

For a prefix code, assign the codes so that the code lengths are as short as possible, that is, to minimize the inverse of the code efficiency, which is defined as the ratio of the mean code length to the entropy.

Other Methods

1. “indifference prior” by identifying a conjugate class of priors and the selecting a prior from this class that satisfies two properties: that the prior should be improper, and that a “minimum necessary sample” should induce a proper posterior.
2. define the similarity of events $E$ and $F$ by $S(E,F)=P(E\cap F)/(P(E)P(F))$
3. representative point of the probability $P$
4. using finitely additive priors for an exponential family
5. define a least informative prior by finding the prior that maximizes expected gain in Shannon information.

References

Kass, Robert E and Wasserman, Larry. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343-1370.

Published in categories Note