A Random Walk With Drift.

Paper: Operator Shifting for Noisy Elliptic Operators

Cover Image for Paper: Operator Shifting for Noisy Elliptic Operators
Philip A. Etter
Philip A. Etter

Published in Springer Research in the Mathematical Sciences

Paper Link

Introduction

In this paper, we examine linear systems corrupted by noise and consider the problem of how to improve the accuracy of linear solves using statistical techniques. In essence, we suppose that we have a linear system of the form

where is a symmetric positive definite matrix sampled from a distribution parameterized by a set of variables . For example, may be the Hessian of a mini-batch of a machine learning optimization problem. Or perhaps, might be a scattering operator across an inhomogenous background medium, where represents the background. Our work assumes that while the true may be unknown (i.e., we only have access to a noisy sample of ), we have a model for how the sample may be produced from the ground truth . The goal is then to try to approximate the solution of the "ground truth" linear system

using only our noisy sample and the corresponding noisy parameters .

In the literature there are a number of existing techniques for producing better via regularization, i.e., instead of solving the linear problem

we might instead solve the linear problem with regularization

where is an appropriate regularizer. It is common to have be the result of a prior placed on such that the result of the optimization problem above coincides with the maximum a posteriori (MAP) estimate of -- see, for example, Tikhonov regularization.

Our Approach

In contrast, our paper considers the above problem from the point of view of statistical estimation of . That is, instead of using the naive estimate as our estimate of the solution operator , we want to build a correction to the naive estimator which produces a uniformly better estimation error,

where is an appropriate matrix norm, where the shift operator is a matrix function of and the shift factor is a scalar quantity that determines the magnitude of the correction. In fact, this formulation supersedes the problem of estimating itself because the estimation error of is just the estimation error of under a correct choice of matrix norm. A scale-invariance analysis suggests that the should be linear in , and hence, we explicitly consider corrections of the form

where are matrices. Of course, when , we have the very simple form of isotropic scaling of ,

In this respect, the approach is reminiscent of James-Stein estimation, except applied to matrix inverses instead of vector quantities. Like Stein estimation, the key to the approach is to try to estimate the shift factor which produces the largest reduction in estimation error,

where is an arbitrary matrix inner product norm. Why do we expect this procedure to be effective? Well, there are two fundamental reasons. One of them is related to James-Stein estimation and the other is a special property of the matrix inverse.

Jensen

  1. First, note that inversion is a convex operation. If we have a random variable and we use to attempt to estimate the quantity , we will actually potentially overshoot by a very substantial amount, as seen in the figure above. Indeed, strict Jensen's inequality tells us that . The same fact also applies to matrices. If and are always positive semi-definite matrices, then it can be shown and hence the naive estimator will overshoot on average. We therefore interpret operator shifting as an attempt to correct this bias.

  2. The second reason is the same reason why James-Stein estimation works. Recall that the error of an estimator can be decomposed into error from bias and error from variance (see Bias-Variance Tradeoff). Therefore, not only can we expect operator shifting to not only reduce the bias of the naive estimator (as noted in (I)), but we can also expect it to reduce the variance as well! Furthermore, unlike the James-Stein setting, the reduction of bias and variance are not even in conflict with each other. In this sense, operator shifting is basically a free lunch as it works to reduce both bias and variance simultaneously.

Indeed, one of the fundamental results (Theorem 4.2) we show in the paper is that under mild assumptions one can expect an optimal shift factor to always fall within the range

Note that the quantity is the ratio of the error of the naive estimator to its second moment. Furthermore, this range means that optimal operator shifts give a relative error reduction of at least

Estimating the Optimal Shift Factor

Note that solving the fundamental optimization problem

is unfortunately not possible because we do not know . Our approach to produce an estimate of is to perform a bootstrap procedure: we treat as the ground truth and then draw synthetic samples of to obtain a Monte Carlo approximation of the above optimization function. Afterwards, optimization boils down to solving a quadratic in , which can be done in closed form,

However, this quantity involves , which means that every Monte Carlo sample will require a full matrix solve of a new matrix. For many applications this can be prohibitively expensive. In the latter half of the paper we show that it is possible to build a series of monotonically increasing approximations to such that

and that each only involves the term and a polynomial of at least in . This means that one can precompute a matrix inverse for once and use it for each Monte Carlo sample, instead of recomputing from scratch for each sample. This offers a substantial amount of compute time savings in estimating the optimal operator shift.

Conclusion

We show in the paper that using this technique can substantial reduce error in a number of computations involving noisy Graph Laplacians. However, the scope of the paper is limited to elliptic (i.e., symmetric positive definite) operators. We also study the case of general matrices in a separate paper. Needless to say, the case of general matrices admits substantially less mathematical theory due to a number of interesting pathological cases that can cause strange behavior in the estimation problem. If the reader is interested, we recommend reading our blog post on the follow-up paper.