A Random Walk With Drift.

Paper: Operator Shifting for General Noisy Linear Operators

Cover Image for Paper: Operator Shifting for General Noisy Linear Operators
Philip A. Etter
Philip A. Etter

Published in SIAM Journal on Mathematics of Data Science

Paper Link

Introduction

This paper is a continuation of research from my previous paper on elliptic operator shifting, I would recommend reading this blog post first. Assuming that you are already familiar with the basics of operator shifting, allow me to quickly reintroduce the basic premise. First, we assume that we want to solve a linear system of the form

where is a matrix sampled from a distribution parameterized by a set of variables . Unlike the previous setting where we assumed that was symmetric positive definite, in this paper, we make no such assumption. Operator shifting works by trying to find a good matrix shift and replace the naive estimator of the matrix inverse with a "shifted" version:

In the case of elliptic operators, this shift has the effect of reducing both the bias and the variance of the estimator. This is due to an analogue of Jensen's inequality for matrix functions. However, the matrix variant of Jensen's inequality no longer holds for general matrix distributions, making the theoretical framework for operator shifting considerably weaker -- unless we make additional assumptions. A core goal of the paper, therefore, is to find a minimal set of additional assumptions that must be made before one can make mathematical guarantees about the efficacy of operator shifting.

In this post, I'll quickly summarize the three key counterexamples we present in the paper that may allow the theory to fail. Each of these three counterexamples elucidates a different mode of failure, and sheds some light onto the critical mathematical structure in elliptic matrices that is lost when moving to general matrices. In the subsequent content of this exposition, we will measure error as:

where is an arbitrary matrix inner product norm.

The Elliptic Case

Recall, that for appropriate choice of (in particular, the choice of that is the analogue of the James-Stein estimator for matrices) the optimal satisfied

As well as

This means, roughly, that shrinking the operator towards the origin is always a sensible move for elliptic matrices. Moreover, the amount of error reduction one can achieve is at least the ratio of the error to the second moment. We will see in the subsequent pathological case studies that this is (surprisingly) not true when one moves to general matrices.

A Note on Matrix Inner Product Norms

As it turns out, the choice of norm becomes significantly more restricted for general matrices than for elliptic ones. In general, a matrix inner product norm will have the form

where is a vector distribution with second moment matrix and is a vector norm induced by a positive definite matrix (i.e., ). Thus, the matrix norm measures the average length in the vector norm of the image under of a vector sampled from . Note that the expectation can be written in closed form as

Note that when , this reduces to the Frobenius norm,

It therefore possible for the distribution to favor some directions over others, in which case the matrix norm will place more weight on those directions. We call these situations anisotropic. It turns out that substantial anisotropy can destroy bounds on .

Pathology #1: Anisotropy

For our first example to illustrate the importance of anisotropy, we take the ground truth to be , and we let the noisy version of the operator be , where has two outcomes with equal probability:

One can run the numbers to see that the inverse will always have the form:

If we take to be a singleton distribution with one atom at , then we happen to have that

For the simple shift of and using L2 error , the estimator error is given by

the minimum error is achieved by , showing us the lack of isotropic can clearly generate a negative optimal shift factor. This means that in this situation, it is actually optimal to grow the estimate rather than shrink it! In particular, one should note that without a requirement of positive-definiteness of , it is possible to create a situation where is always "larger" than its expectation in a specific direction. The anisotropic error measurement can then focus only on this direction and tell us that the optimal strategy is to grow the estimator rather than shrink it. This runs counter to the intuition behind operator shifting in the symmetric positive-definite case, where such a situation is not possible.

Pathology #2: Outlier-Masking

The second pathology has to do with the lack of symmetry in the noise distribution. This can lead to an effect where outliers in the distribution can hide imbalances in the distribution closer to the mean, and these imbalances can be drastically amplified when the noisy matrix is inverted. For this example, consider and let the noisy version of the operator be , where has three outcomes with equal probability:

The corresponding atoms of are

For the simple shift of and using L2 error , the estimator error is given by

One can verify that by taking the derivative this reaches its minimum at . The message of this example is that outliers in the matrix noise can mask distribution imbalances in the region near that can cause to both lie in the direction of while at the same time being dominated by . Indeed, we have that (it bears repeating that such a feat is impossible in the symmetric positive-definite setting where ). The importance of noise symmetry is that it forces the distribution of to be balanced in the region around , even if the distribution contains large outliers.

Pathology #3: Ill-Conditioning

The third and final pathology deals with the conditioning of the matrix and measurement of the estimator error. It fundamentally arises from the fact that if is very ill-conditioned, then measurement of the error

can place incredible weight on the direction of 's smallest singular vector. To see this play out, we consider as the matrix:

where . We take and for the noise , we reuse the distribution from pathology #1:

The inverses then become approximately

Whereas the ground truth is

The error is given by:

which means that the optimal must be of the order as grows very small and the matrix becomes very ill-conditioned.

This example therefore demonstrates the importance of conditioning in the ground truth matrix . It is perfectly possible that lies close to a singular matrix, while the outcomes of are moved away from singularity by the noise imparted by . If this is the case, will be small in magnitude compared to and shrinking the operator further will not reduce average error in the Frobenius sense.

Note that symmetric positive-definite setting avoids this issue, since if is symmetric positive-definite everywhere, it is impossible for to be close to the origin without a significant chunk of the probability distribution also lying close to the origin. This ensures that will always spectrally dominate , and shifting the operator towards will reduce error.

Therefore, any theory regarding the optimality of shrinkage must somehow rule out this possibility. The solution is to counteract this ill-conditioning in the measurement norm . In contrast to the norm (), note that the residual norm matrix for this problem is given by

Using the residual norm matrix changes the error measurement from the domain of to the codomain,

where is our estimate for the solution of the linear system. The above quantity is often called the "residual" in linear algebra, hence the name of the norm, and it avoids the pitfalls discussed in this section. We remark that the residual norm is often used as an objective in nonsymmetric iterative methods.

Main Theorem

The main theorem of our paper proves the positivity of the shift factor under the following assumptions:

  1. Mean-Zero Noise: We assume that the noise matrix is mean-zero, i.e., .
  2. Isotropy (Ruling out Pathology #1): We assume that . This means that there is no preferred direction in which we care about the accuracy of the estimation of the inverse.
  3. Noise Symmetry (Ruling out Pathology #2): We assume that the distribution of the matrix is symmetric about its mean, namely that has the same distribution as .
  4. Residual Norm (Ruling out Pathology #3): We specifically choose our vector norm of interest to be the residual norm .

Main Theorem: Under the assumptions (1) through (4) above, we always have .

Note that this result is significantly weaker than the corresponding result for elliptic methods. Moreover, it is entirely possible to force even under these constraints. The key to this would be to construct a matrix ensemble with the property that

However, if one rules out this weaker pathology, such that the probability of this even is not one, then we have a stronger version of the main theorem:

Main Theorem (Alternate): Under the assumptions (1) through (4) above and isn't equal to its transpose almost surely, then we always have .

However, even this stronger version doesn't provide error reduction guarantees in the same way that elliptic operator shifting does. This is because the asymmetry of the matrix is directly related to how well operator shifting will work -- the farther away is from its negative transpose, the more it is possible to de-bias the naive estimator . The same pathology is again not possible in the elliptic case.