# Subgradient

##### Posted on

This post is mainly based on Hastie et al. (2015), and incorporated with some materials from Watson (1992).

Consider the convex optimization problem

where $g_j$ are convex functions.

The Lagrangian $L$ is defined as

where the nonnegative weights $\lambda\ge 0$ are known as **Lagrange multipliers**.

For differentiable problems, the famous **KKT (Karush-Kuhn-Tucker)** conditions relate the optimal Lagrange multiplier vector $\lambda^*\ge 0$, also known as the dual vector, to the optimal primal vector $\beta^*\in\bbR^p$:

**Primal feasibility:**$g_j(\beta^*)\le 0$ for all $j=1,\ldots,m$**Complementary slackness:**$\lambda_j^*g_j(\beta^*)=0$ for all $j=1,\ldots,m$**Lagrangian conditions:**$0=\nabla_\beta L(\beta^*;\lambda^*) = \nabla f(\beta^*)+\sum_{j=1}^m\lambda_j^*\nabla g_j(\beta^*)$.

For nondifferentiable functions, there is a natural generalization of the notion of gradient that allows for a more general optimality theory.

**subgradient**of $f$ at $\beta$ if $$ f(\beta') \ge f(\beta) + \langle z,\beta'-\beta\rangle\quad\text{for all }\beta'\in \bbR^p\,. $$ The set of all subgradients of $f$ at $\beta$ is called the subdifferential, denoted by $\partial f(\beta)$.

Under mild conditions, the generalized KKT theory can still be applied using the modified condition

The remaining parts will give some examples of subgradients.

## Subgradients for Lasso

The necessary and sufficient conditions for a solution to

take the form

Here each $s_j$ is an unknown quantity equal to $\sign(\beta_j)$ if $\beta_j\neq 0$ and some value lying in $[-1,+1]$ otherwise, that is, it is a subgradient for the absolute value function $f(\beta)=\vert \beta\vert$,

## Subgradients for Nuclear norm

Given a matrix $\Theta\in \bbR^{m\times n}$ (assume $m\le n$), by SVD decomposition,

The nuclear norm is the sum of the singular values,

The subdifferential $\partial \Vert\Theta\Vert_*$ of the nuclear norm at $\Theta$ consists of all matrices of the form $\Z = \sum_{j=1}^mz_ju_jv_j^T$, where each for $j=1,\ldots,m$, the scalar $z_j\in \sign(\sigma_j(\Theta))$.

Note that we need to verify that

for all $\Theta’\in\bbR^{m\times n}$. *The inner product for matrix $\A$ and $\B$ is defined by $\langle \A, \B\rangle=\tr(\A^H\B)$*

For simplicity, consider the case that all singular values are larger than 0, then

since $\V\U^T\U_*$ can be viewed as the new left singular vectors.

## Subgradient for Orthogonally Invariant Norms

As commented by @Lepidopterist, Watson (1992) completely and thoroughly discussed the derivation of subgradient of (nuclear) matrix norms.

Generally, the subdifferential (or set of subgradients) of $\Vert A\Vert$ is defined by

It is readily established that $G\in \partial \Vert A\Vert$ is equivalent to the statements

- $\Vert A\Vert = \tr(G^TA)$
- $\Vert G\Vert^*\le 1$

where

and $\Vert \cdot\Vert^*$ is the polar or dual norm to $\Vert \cdot \Vert$.

The class of **Orthogonally Invariant Norms** consists of norms such that

for any orthogonal matrices $U$ and $V$ of orders $m$ and $n$ respectively. All such norms can be defined by

where $\bsigma = (\sigma_1,\ldots,\sigma_n)^T$ is the singular values of $A$, and $\phi$ is a symmetric gauge function. For nuclear norm, $\phi(\bsigma_n) = \sum_{j=1}^n\sigma_j$.

*I think it does not make much differences for different forms of SVD decomposition.*