Suitably encouraged by friend and colleague, (Jacob Steinhardt)[http://jsteinhardt.wordpress.com] to write a technical post on recent work, I am going to talk about a simple derivation that I found some difficulty in finding online; the subgradient or subderivatives of a matrix norm, and more precisely that of the nuclear norm.

There are number of scenarios where you are looking for the right matrix to solve a problem while satisfying some constraints. To take a concrete example, albeit one that’s been beaten to death, the popular Netflix competition provided competitors with the ratings each user gave to the small set of movies he/she watched, with the task of predicting the user’s ratings for the remaining movies. This problem could be represented as an \(N \times M\) matrix, \(R\), where \(N\) and \(M\) are respectively the number of users and movies. Each entry describes the rating a user gave for a particular movie. There are infinite matrices that could satisfy the given data; one possible scheme would be to fix those entries to given values and fill the rest of the matrix with randomness.

Without having any information whatsoever on the unspecified entries of \(R\), we need to make some assumptions about the way users behave. A common assumption made is that the number of factors, \(F\), that differentiates which users like which movies is fairly small, maybe on the order of 100s. This sort of assumption allows you to *decompose* your \(N \times M\) matrix into a \(N \times F\) matrix, \(U\), describing these \(F\) factors that each person has and a \(F \times M\) matrix, \(V\), describing how those factors influence ratings of movies. Of course, you might recognise this to look suspiciously similar to the (singular value decomposition)[http://youtu.be/R9UoFyqJca8], \(R = U \Sigma V^T\), where \(\Sigma\) is a diagonal matrix whose entries scale relate the features in \(U\) to those in \(V\). This is no doubt a naive model but the approach has worked extremely well in the NetFlix challenge and many similar problems.

The optimization problem we’re looking at here looks for a matrix that is very similar to ratings we’ve seen, with an additional constraint that \(F\) be small, which, now that we’ve seen the SVD picture, translates to requiring the rank of the matrix to be small. This turns out to be a non-smooth constraint which makes our problem a bit more interesting.

Now, to get into the details of how we’d actually optimize this problem. The answer usually is gradient descent. However, when you use a non-smooth objective function, there are points where the gradient isn’t defined. This is a problem; gradients are clearly an integral enough part of gradient descent to figure in the name. In order to work around this obstacle, (people use a generalisation of gradients)[http://en.wikipedia.org/wiki/Subgradient_method], called (subgradients)[http://en.wikipedia.org/wiki/Subderivative], which I will briefly describe in the sequel.

The second half of my title mentions matrix norms. A common part of any objective function in machine learning is a “regularization” term. Regularization plays the role of preventing the solution you get from your optimisation problem from “(overfitting)[http://en.wikipedia.org/wiki/Overfitting]” your data^{1}. One of the most effective ways to regularize your solution is by placing a penalty on it for being “too large”. Using a norm helps your quantify that statement in a more mathematically precise way; you penalize your solution for having too large a norm. In our case, we’d like to penalize having too large a rank; which is equivalent to placing a limit on the number of non-zero singular values of \(R\), i.e. placing the \(L_{0}\) norm on the singular values. \(L_0\) norms require you to search the combinatorial space of solutions, so a common relaxation that works is the \(L_1\) norm, or the sum of the absolute values of the singular values. This is the nuclear norm or the trace norm, \(\|A\|_{*} = \sum \sigma_i(A)\), where \(\sigma_i(A)\) is the i-th singular value of \(A\).

In the rest of the article, I’ll describe the properties of the subgradients of a matrix norm, and the exact form for the nuclear norm.

I will usually use capitalised variables for matrices, e.g. \(A\), boldface lower case variables for vectors \(\mathbf{v}\) and normal lower case variables for scalar constants. To keep the discussion simple, I’m going make the assumption that all my matrices are real-valued, though most of this discussion readily extends to any field. I will assume that if you know what a field is, you know how to make the mapping yourself.

(Matrix norms)[http://en.wikipedia.org/wiki/Matrix_norm], \(\|A\|\) have the following properties, most of which are analogous to the conventional vector norms you would have likely come across, for example the Euclidean norm, \(\|\mathbf{v}\| = \sqrt{v_i^2 + v_j^2 + v_k^2}\) or the Manhattan norm, \(\|v\| = |v_i| + |v_j| + |v_k|\),

- \(\|A\| \ge 0\) for any \(A\) and \(\|A\| = 0\) iff \(A = 0\).
- \(\|c A\| = c \|A\|\) for any \(c\).
- \(\|A + B\| \le \|A\| + \|B\|\) for any \(A\) and \(B\). This is the triangle inequality that makes \(\|\cdot\|\) a norm.
- Some norms have the further “sub-multiplicative” property, \(\|AB\| \le \|A\|\|B\|\), for all \(A\), \(B\).

The inner product of every entry of two matrices, \(\langle A, B \rangle = \sum_{i,j} A_{ij} B_{ij}\) can be succinctly written down as \(\trace{A B}\); this is an easy fact that I’ll leave to you to verify. The plus side is that it lets us express the subgradient property in this following nice way.

This sort of vague definition has, of course, lead anything and everything to be called regularization.↩