From Solomonoff Induction to the Relationship between World, Human and Machine (Part 1)
In a series of posts, I will introduce Solomonoff's theory of inductive inference, which is an idealized model for universal prediction and deep dive into the philosophical implications of the theory.
One of the brain's primary functions and strengths is its ability to make predictions. When interacting with the real world, our brain anticipates what it will next perceive based on our surroundings and past experiences. It also forecasts the body's actions to prepare for potential consequences and future actions. This predictive approach is similarly applied when solving abstract problems. We anticipate the best route to take based on the current issue and our past experiences, continually updating our predictions as we work through the problem.
The process of making predictions based on past experiences, known as "inductive inference," has been a topic of debate among philosophers for centuries. It's evident that it works in practice, but attempts to justify it inevitably rely on assumptions derived from inductive inference, resulting in self-referencing. What’s worse, there's a "no free lunch theorem" (NFL) suggesting that any kind of inductive inference can’t do better than random guesses when averaging over all possible scenarios.
To illustrate the concept of NFL, let's consider a sequence of 100 characters, consisting only of 0s and 1s, generated by an unknown procedure F. You are given the first 50 characters of the sequence, and your task is to predict the remaining 50 characters. If the first 50 characters are all 0s, how would you predict the next 50 characters? If you assume that the characters are generated by randomly flipping a possibly biased coin, i.e., all the characters are independent and identically distributed (i.i.d), you should predict the next 50 characters as all 0s and you would get a very high success rate if your assumption is correct. But what if the unknown procedure F generates random balanced brackets where 0 stands for “(“ and 1 stands for “)”? In that case, the next 50 characters should be all 1s and your success rate will be zero.

NFL shows that inductive inference (including machine learning) is a game between the predictor and the data generator (the environment that produces the data that we observe), and that right assumptions (aka inductive bias) about the generator have to be made to make successful predictions. As we can see from this toy example, i.i.d., which is the assumption behind a vast majority of the machine learning methods, is not always beneficial to the predictor. The assumption is artificially injected by humans to approximate the real world and to make machine learning scalable, which essentially pushes the machine to learn something that may be different from reality.
To elaborate on this point, let’s take the training of LLMs as an example, where each document from the internet is treated as an independent generation of text. This is not true because we know that concepts, knowledge, and information of the world are all connected and they build on top of each other. LLMs see a very different world from ours and hence they hallucinate in “our world”. At the same time, each document is an independent application of rules of languages. LLM’s view of languages matches our world and hence they are able to produce language structures that perfectly match expectations in “our world”.
While i.i.d assumption is too strong an assumption of the data generator for building a general inductive inference model, the condition in NFL where all possibilities of the generator is considered equally likely seems implausible as well - the world as we experience, follows certain rules that stay consistent in different spaces and over time. In conclusion, we need a model for the data generator that’s not as strong as I.I.D, but not as weak as what NFL assumes.
In the toy example I mentioned above, the data generator only generates 100 characters which is not very interesting because the problem is gone when the last character is generated. Let's consider a more interesting case where the data generator keeps generating new bytes as time passes by. Examples are the text generated on the Internet, or the change of the earth atmosphere’s temperature overtime, which can be represented as an unlimited sequence of 0s and 1s generated by a stochastic generator. By computation theory, we know that the vast majority of probability distribution of such unlimited sequences cannot be generated by a computable data generator (i.e, a Turing machine, or equivalently, a computer program). So, if we believe the world is a giant Turing machine, or our computing capability is limited to Turing machines, we should probably consider only computable data generators.
With the above provided background, now we can introduce the setup of Solomonoff’s theory of inductive inference ([1], [2], [3], [4]), which was established almost half a century ago. Suppose an unknown computer program, aided by a random number generator for stochastic generation, is continuously producing a sequence of 0s and 1s, denoted as x1x2x3… Each time a new bit xn is produced, you estimate P(xn+1=0|x1x2…xn) using your predictor. As you can see, the set up of the problem is very general, without assuming any kind of relationship between different parts of the sequence. The only assumption is the sequence comes from a computable distribution.
Solomonoff’ theory seeks to answer the following questions: does a universal predictor that ultimately converges to the unknown program exist? If so, how quickly can it converge? Solomonoff’s answer to these questions is yes, there is such a universal predictor, and it converges pretty fast to the unknown program. There is only one caveat, albeit a big one.
That concludes part 1 of this series of blog posts and in part 2, we will delve into the details of Solomonoff induction.
References
[1] Solomonoff, Ray J. "A formal theory of inductive inference. Part I." Information and control 7.1 (1964): 1-22.
[2] Solomonoff, Ray J. "A formal theory of inductive inference. Part II." Information and control 7.2 (1964): 224-254.
[3] Solomonoff, Ray. "Complexity-based induction systems: comparisons and convergence theorems." IEEE transactions on Information Theory 24.4 (1978): 422-432.
[4] Solomonoff, Ray J. "Algorithmic probability: Theory and applications." Information theory and statistical learning (2009): 1-23.