4 min read

Approximation goes both ways

An interesting thing about approximation is that it goes both ways. That is: If \(f(x)\) approximates \(g(x)\) under conditions \(Z\), then it's also true that \(g(x)\) approximates \(f(x)\) under conditions \(Z\).

The most interesting example to me is approximating a discrete thing with a continuous thing. For example, a 2020 arXiv theory paper on machine learning found that "Every Model Learned by Gradient Descent is Approximately a Kernel Machine" (Pedro Domingos), and the author proves their result by approximating a discrete process (gradient descent) with a continuous one. Another example: In his 2016 (?) book The Art of Doing Science and Engineering, Richard Hamming illustrates roughly one way of deriving Stirling's approximation for factorials, which works by approximating a discrete sum with a continuous integral. These examples are interesting to me because they approximate in the opposite direction from what I'm used to.

I'm going to get into some detail about both examples.

1.a. Gradient descent

(Disclaimer: I am not an expert. I am not a machine learning theory person. I never took a real analysis course. Take that as you will.)

Here is my rough understanding of how the discrete-by-continuous approximation works in the gradient descent paper. To prove their result they use what is, per the paper, a standard framework of approximating the discrete process of gradient descent with a continuous process. The actual proof of the kernel machines result is complicated and I'm not going to try to explain it because I don't understand it myself. But I will try to explain the framework in detail.

To explain what that means, let's look at a slightly more specific example. Suppose you train a 3 billion parameter model for some integer number of steps \(N\), say \(N = 1000\), with some small learning rate \(\epsilon > 0\), say \(\epsilon = 10^{-7}\). You can think of this training process as producing a discrete curve of \(N + 1\) points in "weight space" \(R^{3 \times 10^9}\): The initial weights, then the weights after step 1, then the weights after step 2, et cetera. Let's now suppose that we hold \(N \cdot \epsilon\) constant so that our discrete curve's length in weight space remains roughly constant: As we increase \(N\) we decrease \(\epsilon\) to compensate so that we take \(N\) steps of displacement \(\epsilon\).

(This isn't quite true because it ignores the fact that we're using the gradients at each step. At each step we calculate a new gradient based on the current weights. So, the total curve length depends on each of the gradients, not just on the number of steps and the learning rate. However, ignoring that for now...)

This lets us take the limit and get a continuous thing. We can take the limit as \(\epsilon\) goes to 0, which means \(N\) goes to infinity, so that we can think of our weights as tracing out a continuous curve in weight-space under this infinitesimally small learning rate. Instead of discrete steps \(j \in \mathbb{N}_0\), we have continuous real-valued times \(t \in [0, 1]\). Instead of having a discrete curve (also known as a list) of weights at each training step, \(w_0, w_1, w_2, \dots, w_N\), which maps each index 0 through \(N\) to points in \(R^{3 \times 10^9}\), we have a continuous curve \(c(t)\), which maps \([0, 1]\) to points in \(R^{3 \times 10^9}\). These should "stay close to each other" so long as the learning rate \(\epsilon\) is sufficiently small.

We can then use this continuous curve to prove things that are approximately true about the discrete process. This should work to the extent that the continuous curve is approximately the same as the discrete curve we started out with. For example, one might try to prove that the result of the learning process is the same as a certain theoretical kernel machine.

2.b. Stirling's approximation

Here's my rough recollection of the discrete-by-continuous part of the argument Hamming gave. First you take the logarithm of \(n!\) which breaks it up into a sum \(\ln n! = \sum_{j=1}^n \ln j\). Next, you recall that integrals are related to sums, so you could approximate this by using a smooth integral \(\int_1^n \ln x \;\mathrm{d}x = [1/x]_1^n = n \ln n - 1\). Next, to relate this back to the sum of logs, we can try using the trapezoid rule, which when you expand it out tells us that:

\[\int_1^n \ln x \;\mathrm{d}x = n \ln n - n + 1 \approx \frac{1}{2}\ln 1 + \ln 2 + \ln 3 + \dots + \ln (n - 1) + \frac{1}{2} \ln n\]

Note that the \(\ln 1\) term is zero, since \(\exp(0) = 1\), so the coefficient of \(\frac{1}{2}\) on that term doesn't matter and we can change it freely. So the above equation implies that:

\[\ln 1 + \ln 2 + \ln 3 + \dots + \ln (n - 1) + \ln n \approx n \ln n - n + 1 + \frac{1}{2}n\]

Finally, you can take the exponential to get:

\[n! \approx (n/e)^n \cdot e \sqrt{n}\]

(For reasons I haven't tried especially hard to understand, Hamming ends up using a factor of \(C\) close to \(e\) here rather than using \(e\) itself, and \(C\) ends up being \(\sqrt{2\pi n}\).)

So there you go. You approximate the discrete thing by a continuous thing. The discrete thing is the product which we convert to a sum of logs. The continuous thing is an integral.

2.

This is interesting because it's the opposite of my canonical examples of what approximation looks like. In my calculus classes, usually we were interested in approximating continuous things with discrete things. For example, approximating an integral with a Riemann sum, with the trapezoid rule, with the midpoint rule, or with Simpson's rule. Or: Approximating \(f'(x)\) by looking at for example \((f(x) - f(x + h))/h\) for some small \(h\). Similarly, approximating \(\lim_{x \to x_0} f(x)\) by evaluating \(f(x_0 + \delta)\) for some small displacement \(\delta\). Approximating things in the opposite way seems unusual and therefore interesting.

It is slightly boring that these are both integrals. Are there other examples of discrete-by-continuous approximations that don't involve integrals? I'd be interested to know.