WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Subgradient

Posted on
Tags: Subgradient, Nuclear Norm, KKT Conditions

This post is mainly based on Hastie et al. (2015), and incorporated with some materials from Watson (1992).

Consider the convex optimization problem

where $g_j$ are convex functions.

The Lagrangian $L$ is defined as

where the nonnegative weights $\lambda\ge 0$ are known as Lagrange multipliers.

For differentiable problems, the famous KKT (Karush-Kuhn-Tucker) conditions relate the optimal Lagrange multiplier vector $\lambda^*\ge 0$, also known as the dual vector, to the optimal primal vector $\beta^*\in\bbR^p$:

  • Primal feasibility: $g_j(\beta^*)\le 0$ for all $j=1,\ldots,m$
  • Complementary slackness: $\lambda_j^*g_j(\beta^*)=0$ for all $j=1,\ldots,m$
  • Lagrangian conditions: $0=\nabla_\beta L(\beta^*;\lambda^*) = \nabla f(\beta^*)+\sum_{j=1}^m\lambda_j^*\nabla g_j(\beta^*)$.

For nondifferentiable functions, there is a natural generalization of the notion of gradient that allows for a more general optimality theory.

Given a convex function $f:\bbR^p\rightarrow \bbR$, a vector $z\in\bbR^p$ is said to be a subgradient of $f$ at $\beta$ if $$ f(\beta') \ge f(\beta) + \langle z,\beta'-\beta\rangle\quad\text{for all }\beta'\in \bbR^p\,. $$ The set of all subgradients of $f$ at $\beta$ is called the subdifferential, denoted by $\partial f(\beta)$.

Under mild conditions, the generalized KKT theory can still be applied using the modified condition

The remaining parts will give some examples of subgradients.

Subgradients for Lasso

The necessary and sufficient conditions for a solution to

take the form

Here each $s_j$ is an unknown quantity equal to $\sign(\beta_j)$ if $\beta_j\neq 0$ and some value lying in $[-1,+1]$ otherwise, that is, it is a subgradient for the absolute value function $f(\beta)=\vert \beta\vert$,

Subgradients for Nuclear norm

Given a matrix $\Theta\in \bbR^{m\times n}$ (assume $m\le n$), by SVD decomposition,

The nuclear norm is the sum of the singular values,

The subdifferential $\partial \Vert\Theta\Vert_*$ of the nuclear norm at $\Theta$ consists of all matrices of the form $\Z = \sum_{j=1}^mz_ju_jv_j^T$, where each for $j=1,\ldots,m$, the scalar $z_j\in \sign(\sigma_j(\Theta))$.

Note that we need to verify that

for all $\Theta’\in\bbR^{m\times n}$. The inner product for matrix $\A$ and $\B$ is defined by $\langle \A, \B\rangle=\tr(\A^H\B)$

For simplicity, consider the case that all singular values are larger than 0, then

since $\V\U^T\U_*$ can be viewed as the new left singular vectors.

Subgradient for Orthogonally Invariant Norms

As commented by @Lepidopterist, Watson (1992) completely and thoroughly discussed the derivation of subgradient of (nuclear) matrix norms.

Generally, the subdifferential (or set of subgradients) of $\Vert A\Vert$ is defined by

It is readily established that $G\in \partial \Vert A\Vert$ is equivalent to the statements

  1. $\Vert A\Vert = \tr(G^TA)$
  2. $\Vert G\Vert^*\le 1$

where

and $\Vert \cdot\Vert^*$ is the polar or dual norm to $\Vert \cdot \Vert$.

The class of Orthogonally Invariant Norms consists of norms such that

for any orthogonal matrices $U$ and $V$ of orders $m$ and $n$ respectively. All such norms can be defined by

where $\bsigma = (\sigma_1,\ldots,\sigma_n)^T$ is the singular values of $A$, and $\phi$ is a symmetric gauge function. For nuclear norm, $\phi(\bsigma_n) = \sum_{j=1}^n\sigma_j$.

$$ \partial \Vert A\Vert = \mathrm{conv}\{UDV^T, A=U\Sigma V^T,\bfd\in \partial \phi(\bsigma)\}\,, $$ where $\mathrm{conv}\{\}$ signifies the convex hull of a set (the smallest convex set that contains the set). Here $U$ and $V$ are orthogonal matrices, while $D$ is $m\times n$ matrix except down the main diagonal.

I think it does not make much differences for different forms of SVD decomposition.

References


Published in categories Note