From Solomonoff Induction to the Relationship between World, Human and Machine (Part 2)

A thorough illustration of the beauty of Solomonoff induction, its optimality, the reason why such an optimal prediction machine can’t be built to achieve AGI and yet it still has a profound impact.

Sep 04, 2024

In part 1 of the series, we showed that the essence of successful inductive inference (making predictions based on past experiences) lies in making reasonable, and right amounts of assumptions of the environment that generates the data that we observe. The assumption made in Solomonoff’s induction theory is that the observed data is generated by an unknown, potentially stochastic Turing Machine (which is equivalent to a computer program with access to a random number generator).

To more formally state the prediction problem Solomonoff tried to solve, let’s denote this unknown program as 𝜇, which generates an endless bit stream, say 011011100001010… Before the next bit is generated, we can predict the bit and its probability, given the bits that have been observed so far.

Solomonoff’s inductive inference theory answers the following question - how can we minimize the accumulated error between our predicted probability and the true probability?

The High Level Idea

Solomonoff’s induction technique starts with assigning a probability for any prefix X of the endless sequence. Let’s call this probability Pr_m[X] which is the probability we assign to X according to our model m. If we can derive such a probability for arbitrary prefix, we can then calculate the probability of the next bit being 1 as

\(Pr_{m}[x_{n+1}=1|x_1x_2...x_n]=\frac{Pr_m[X1]}{Pr_m[X]}\)

where X=x₁x₂…x_n and Pr_m[X1] is the probability that the sequence has a prefix of “X followed by 1”.

But how can we assign a reasonable probability to an arbitrary prefix X? If we use the frequentist definition of probability, we have to repeatedly run the program many many times. But in our setting, we’ve got to see X only once.

Since the only information we know about X is that it is the prefix of an endless sequence generated by some program, to derive the probability of observing X, we would need to consider the distribution of all the possible programs out there, and each program’s probability of generating X.

That’s the high level idea of Solomonoff’s “Algorithmic Probability”, which I will explain in detail in the next section.

Algorithmic Probability

To derive Pr_m[X] for arbitrary prefix X of an endless sequence generated by some unknown program, we will start by considering all the computer programs of some programming language. The applicable programming language here are very similar to our regular programming languages, with two small constraints:

Programs written in this language cannot go back and modify what it has already outputted.
Programs written in this language can start to output without having the whole program read and will continuously read more of its program when they are needed for generating more output.

You can think of these language constraints as python running in interactive command line mode. It can output as you enter code line by line and you can never go back and overwrite what it already outputs. The reason why we need these two constraints will become clear when we discuss the exact formula for Pr_m[X].

this is called a prefix Turing machine, which has the same computing power as the regular Turing machine.

As we observe the generated prefix X , we keep track of those programs whose generated output starts with X so far, and discard those that do not. After discarding incompatible programs, there will be some remaining programs that have a prefix that’s also compatible with X. We will discard those as well and only keep the “minimal” programs. For example, say X is “0101001001”. Consider the following 3 programs:

Program A:

print("01"*2, end="")

Program B:

print("01"*2, end="")
print("001"*2, end="")

Program C:

print("01"*2, end="")
print("001"*2, end="")
while True:
  print("0001"*2, end="")

Program A’s output is not compatible with X so it will be discarded. Both program B & program C’s output is compatible X, but because B is a prefix of C, we will only consider program B for now.

The idea of keeping track of all compatible hypothesis is deeply connected the philosophical principle of plenitude)

So with observed prefix X, we get to a set of compatible programs. Let’s denote the set as P(X) and as mentioned above, none of the programs in P(X) are prefixed with other programs in P(X). Now we can assign a probability to the observed sequence X using the following equation:

\(Pr_m[X]=\sum_{p\in{}P(X)}2^{-|p|}\)

where |p| is the length of the program p.

The simplest way to understand this equation is as follows. If we represent all the programs as a sequence of bits, we can then randomly generate a program by continuously flipping an unbiased coin. For program p with length |p|, the probability of generating p after |p| coin flips is 2^-|p|. Since our programming language is a prefix Turing machine, all programs with p as a prefix will also generate output that starts with X, and thus 2^-|p| is the probability of X being generated by any program with p as the prefix. By summing over all p ∈ P(X), Pr_m[X] is essentially the probability that X is the prefix of the output by a randomly generated program of a chosen programming language. This probability is called algorithmic probability as it assigns a probability to every string by considering all the possible algorithms that generate it. As you can see, requiring none of the programs in P(X) are prefixed with other programs in P(X) ensures that we are not double counting probabilities in Pr_m[X].

In the formula, shorter programs are assigned higher probability. This can be considered as a formalization of the Occam’s razor, which basically states that if everything else is equal, shorter explanations should be preferred over longer ones.

Algorithmic probability naturally combines the power of Turing machine, the principle of plenitude and Occam's razor. As we will see in the next section, this combination also gives us great theoretical results. Algorithmic probability is an elegant and powerful concept.

Making Predictions with Algorithmic Probability

Now that we have a way to assign probability to an arbitrary prefix X, we can derive the conditional probability

\(Pr_{m}[x_{n+1}=1|x_1x_2...x_n]=\frac{Pr_m[X1]}{Pr_m[X]}\)

and use it to make predictions. For example, whenever Pr_m[x_n+1|x₁x₂…x_n] is larger than ½, we can predict x_n+1to be 1, otherwise we predict it to be 0.

Let’s further study the prediction’s effectiveness in a specific setting where the unknown program is deterministic. In fact, it is easy to prove that if we follow the above rule to keep predicting the next bit in the endless sequence, we will make no more than K mistakes in total, where K is the length of the shortest possible program that generates the sequence, when written in the same language as the choice for the prediction algorithm.

K is also called the Kolmogorov Complexity of the sequence, which was discovered by Kolmogorov independently, a few years after Solomonoff’s initial discovery.

To show why this is true, remember that there is at least a program of length K that can generate the endless sequence. This means that the algorithmic probability of the sequence is no less than 2^-K. More precisely, for arbitrarily large n, we have:

\(Pr_{m}[x_1x_2...x_n]\ge 2^{-K}\)

Rewriting this using chain rule of probability, we get:

\(Pr_m[x_1]Pr_m[x_2|x_1]Pr_m[x_3|x_1x_2]...Pr_m[x_n|x_1x_2...x_{n-1}]\ge 2^{-K}\)

Notice that in the above formula, the number of items on the left whose value is less than ½ should be no larger than K. Because our prediction rule makes a mistake if and only if Pr_m[x_i+1|x₁x₂…x_i] is less than ½, the number of mistakes we make accumulatively thus should be less than K.

The General Case

Solomonoff’s induction can also be applied to the general case where the data generator 𝜇 may be stochastic, i.e., a program which can leverage a random number generator anywhere in its code. The proof of the general theory is beyond the scope of this post, but Solomonoff showed that

\(\mathrm{E}[\sum_{i=1}^{\infty}(\mathrm{Pr}_m[x_i|x_1x_2...x_{i-1}]-\mu(x_i|x_1x_2...x_{i-1}))^2] \le \frac{ln(2)}{2}K\)

which is to say, using Solomonoff’s induction technique, the expected accumulated square error of its predicted probability of the next bit is bounded by the Kolmogorov complexity of the sequence multiplied by ln(2)/2. This is called the completeness of Solomonoff induction and Solomonoff induction is the only known induction technique that can achieve such a tight bound.

The Incomputability of Solomonoff Induction

If Solomonoff induction is the optimal predictor, why haven’t we built it and used it for discovering new science, materials, drugs, and solving unsolved math problems? That would be the true AGI or ASI that people have been searching for.

If you think about implementing algorithmic probability, you will start to notice the problem. To calculate algorithmic probability of a prefix X up to some precision, one has to search through all valid programs in decreasing order of probability, run the program as a subroutine to generate up to |X| number of bits and check if it equals to X. But since you can’t determine whether a program will stop or not (Turing’s halting problem), your program will either be stuck forever (if you let it run the subroutine indefinitely), or cannot guarantee a certain precision (if you decide to kill the subroutine before it finishes). In other words, algorithmic probability (and thus Solomonoff induction as well) is incomputable.

Interestingly, it turns out that, for any prediction technique that’s complete, it has to be incomputable as well. We run into another theoretical impossibility when trying to solve something too general (in this case, being too general means a very weak inductive bias)!

One can prove it by construction. For every computable predictor, you can construct a computable data generator that takes the predictor as a subroutine. When generating the next bit, the data generator would run the subroutine to get its prediction of the probability of the next bit and it will always choose the one that the method assigns lower probability. Under this data generator, the error of the predictor will be unbounded thus cannot be complete.

The proof highlights that prediction is a game between the predictor and the data generator, and unfortunately, the data generator will have the upper hand.

Solomonoff Induction’s Evolution, Applications & Implications

Solomonoff invented algorithmic probability and his induction technique in 1960-1964. Kolmogorov in 1965, Chaitin in 1966 published their independent discovery of algorithmic complexity. Their work became the foundation of Algorithmic Information Theory, which has made a profound impact on machine learning, cryptography and many other fields.

Within machine learning, their pioneering work has turned into a general direction of AI/ML which is based on discrete program search & synthesis. Much effort has been devoted to applying it to reinforcement learning settings and making it tractable, practically useful by restricting the complexity of the data generator, reducing the program search space, in sacrifice of its theoretical completeness.

Though discrete program search is not the mainstream ML approach at the moment largely because it is still too ineffective / compute intensive, the idea has been fundamental to the most advanced AI systems, including Google DeepMind’s funsearch system that solved an unsolved math problem. Many people believe that will give AI the ability to reason, to generalize, which is what’s lacking in current statistical ML models.

And to me, what is equally important is Solomonoff’s theory of inductive inference’s philosophical implications - the relationship between the world as the data generator and us as the predictor, what has driven us to get where we are, and what our technology can and cannot do. This discussion will be the topic of the last part of the series.

The Unscalable

Discussion about this post