# WeiYa's Work Yard

##### Posted on Apr 08, 2019

This post is mainly based on Hastie et al. (2015), and incorporated with some materials from Watson (1992).

Consider the convex optimization problem

where $g_j$ are convex functions.

The Lagrangian $L$ is defined as

where the nonnegative weights $\lambda\ge 0$ are known as Lagrange multipliers.

For differentiable problems, the famous KKT (Karush-Kuhn-Tucker) conditions relate the optimal Lagrange multiplier vector $\lambda^*\ge 0$, also known as the dual vector, to the optimal primal vector $\beta^*\in\bbR^p$:

• Primal feasibility: $g_j(\beta^*)\le 0$ for all $j=1,\ldots,m$
• Complementary slackness: $\lambda_j^*g_j(\beta^*)=0$ for all $j=1,\ldots,m$
• Lagrangian conditions: $0=\nabla_\beta L(\beta^*;\lambda^*) = \nabla f(\beta^*)+\sum_{j=1}^m\lambda_j^*\nabla g_j(\beta^*)$.

For nondifferentiable functions, there is a natural generalization of the notion of gradient that allows for a more general optimality theory.

Given a convex function $f:\bbR^p\rightarrow \bbR$, a vector $z\in\bbR^p$ is said to be a subgradient of $f$ at $\beta$ if $$f(\beta') \ge f(\beta) + \langle z,\beta'-\beta\rangle\quad\text{for all }\beta'\in \bbR^p\,.$$ The set of all subgradients of $f$ at $\beta$ is called the subdifferential, denoted by $\partial f(\beta)$.

Under mild conditions, the generalized KKT theory can still be applied using the modified condition

The remaining parts will give some examples of subgradients.

The necessary and sufficient conditions for a solution to

take the form

Here each $s_j$ is an unknown quantity equal to $\sign(\beta_j)$ if $\beta_j\neq 0$ and some value lying in $[-1,+1]$ otherwise, that is, it is a subgradient for the absolute value function $f(\beta)=\vert \beta\vert$,

Given a matrix $\Theta\in \bbR^{m\times n}$ (assume $m\le n$), by SVD decomposition,

The nuclear norm is the sum of the singular values,

The subdifferential $\partial \Vert\Theta\Vert_*$ of the nuclear norm at $\Theta$ consists of all matrices of the form $\Z = \sum_{j=1}^mz_ju_jv_j^T$, where each for $j=1,\ldots,m$, the scalar $z_j\in \sign(\sigma_j(\Theta))$.

Note that we need to verify that

for all $\Theta’\in\bbR^{m\times n}$. The inner product for matrix $\A$ and $\B$ is defined by $\langle \A, \B\rangle=\tr(\A^H\B)$

For simplicity, consider the case that all singular values are larger than 0, then

since $\V\U^T\U_*$ can be viewed as the new left singular vectors.

## Subgradient for Orthogonally Invariant Norms

As commented by @Lepidopterist, Watson (1992) completely and thoroughly discussed the derivation of subgradient of (nuclear) matrix norms.

Generally, the subdifferential (or set of subgradients) of $\Vert A\Vert$ is defined by

It is readily established that $G\in \partial \Vert A\Vert$ is equivalent to the statements

1. $\Vert A\Vert = \tr(G^TA)$
2. $\Vert G\Vert^*\le 1$

where

and $\Vert \cdot\Vert^*$ is the polar or dual norm to $\Vert \cdot \Vert$.

The class of Orthogonally Invariant Norms consists of norms such that

for any orthogonal matrices $U$ and $V$ of orders $m$ and $n$ respectively. All such norms can be defined by

where $\bsigma = (\sigma_1,\ldots,\sigma_n)^T$ is the singular values of $A$, and $\phi$ is a symmetric gauge function. For nuclear norm, $\phi(\bsigma_n) = \sum_{j=1}^n\sigma_j$.

$$\partial \Vert A\Vert = \mathrm{conv}\{UDV^T, A=U\Sigma V^T,\bfd\in \partial \phi(\bsigma)\}\,,$$ where $\mathrm{conv}\{\}$ signifies the convex hull of a set (the smallest convex set that contains the set). Here $U$ and $V$ are orthogonal matrices, while $D$ is $m\times n$ matrix except down the main diagonal.

I think it does not make much differences for different forms of SVD decomposition.

## References

Published in categories Note