On the Reasoning Limitations of LLM & Machine Learning
LLMs seem to know about anything. Beyond its vast knowledge, can it reason as good as humans?
Never Underestimate Prompt Engineers
I came across this hilarious exchange on twitter the other day. There was one guy who believed transformer based LLMs can't truly learn new problems outside of their training set and they can't perform long-term reasoning. So he put up a challenge on twitter to solve a problem by prompting an LLM. Whoever could prove him wrong would get a $10K prize.
When I saw the challenge, it was already over, but I decided to give it a try anyway because it looked like a very straightforward simulation that even a 9 year old kid could easily do. After simple prompts failed, I tried the last thing which was asking LLM to write a python function and then simulate the execution of the program line by line and output the result. Given the length of input is fairly short, simulating the program should easily fit the limit of output tokens. The LLMs that I tried successfully wrote the python program but failed hilariously when simulating the execution of the function. Sometimes it skipped lines and sometimes it hallucinated the outcome of one line of code. If it were a human who behaved like this, I would suppose they copied the code from somewhere. Anyway, like I mentioned earlier, the challenge was already over when I saw it, so I had no motivation to keep trying.
The outcome of the challenge was, LLMs can solve the problem (the winner made Claude 3 Opus work, so at least some LLM can). The 10K prize was claimed on the next day of the challenge.
Does solving the challenge prove LLMs have “advanced reasoning capabilities”? Unfortunately no. If you look at the winning prompt and the explanation, you will be more amazed by the creativity of the author of the prompt than the reasoning capability of the LLM. Essentially, the author designed a set of machine instructions that was much easier for LLMs to follow than python code, including very specific things like added numeric indices of the symbols so that LLM doesn’t mess up when the same symbol occurs multiple times in the sequence. Combined with multiple demonstrations in the prompt, he still couldn’t get a 100% success rate (the winning criteria was 90% success rate).
I guess the lesson from the story is, you can underestimate the capabilities of LLMs, but you should never underestimate the creativity of prompt engineers!
Theoretical Limitations of Transformers
It is not a secret that transformer based LLMs have great capability of approximate retrieval but very inconsistent performance regarding reasoning & planning. One extremely simple problem that has been shown to be hard for transformers is called PARITY, which is basically counting whether there is an even or odd number of 1s in a string of 0s and 1s.
To confirm this phenomenon myself, I gave the problem to Gemini pro 1.5. I randomly generated a string of length 32 with 0s and 1s and asked the LLM. I ran the experiment 20 times and Gemini gave me the correct result 45% of the time, which means it was mostly randomly guessing. I tried this with Claude 3 Sonet, which showed similar results.
Why is PARITY so hard for Transformers? The best explanation I can find is from this recent paper. PARITY is a highly sensitive function in that flipping any bit will completely change the result. The paper proves that when a transformer is fit to such high sensitivity functions, an arbitrary small perturbation to weights will result in a large loss for sufficiently long inputs, making it impossible to learn. In the real world, LLMs are all trained on lots of different kinds of problems, instead of optimizing for a particular problem, so the result is even worse than what’s theoretically predicted.
Is the PARITY problem a very special case? I am afraid not. You can change the problem to counting the number of 1s, or sum the digits up, it remains a high sensitivity function. Even for functions that are less sensitive to the input, lots of them have a region that is highly sensitive to input, and so transformers won’t work well for those regions too - for example, decide the connectivity of a graph when the graph is close to a tree, majority of a 0/1 string when the number of 0s and 1s are roughly equal, etc.
One last thing I want to point out is that, the above limitation is with regard to the learnability of transformers. There are also theoretical limitations regarding the expressivity of transformers, i.e what kind of problems transformers can/cannot solve if you can “handpick” the optimal weight. In reality, it is nearly impossible to get to the optimal weight, so expressivity based analysis gives a looser bound on a transformer's real capabilities than learnability based analysis. But even based on expressivity analysis, transformers are still proved to have significant limitations (see this paper as an example).
Workarounds And Their Limitations
The limitations mentioned in the above section mostly applies to cases when transformers are required to directly output the answer. There are workarounds for these limitations if transformers are allowed to output intermediate steps or using external tools.
The first workaround is process supervision, letting the Transform learn to use more tokens to imitate humans reasoning process before outputting the final answer. This can be done in as part of the training, or through in context learning with few-shot examples, like the twitter example in the first section. A good few-shot example decomposes the original problem into subproblems that are easier to fit for a transformer. And just from a pure computation’s perspective, increasing the number of input tokens and output tokens increases the compute spent on the problem, allowing the model to fit more complex functions.
The limitations of process supervision, however, are that 1) there is no universally applicable reasoning process to follow. In lots of cases (again, like the twitter example above), coming up with a good prompt is much harder than writing a program yourself. 2) Compared to creative writing, reasoning is a rigid process where a small mistake in earlier step leads to totally wrong outcome. Transformer’s next-token prediction however naturally leads to accumulated mistakes when the reasoning process becomes long.
The second workaround is that instead of doing the reasoning themselves, transformers can write a program and hand it over to a computer to actually execute the logic, just like GPT4 and Gemini do. The twitter example would be easily solved with this technique.
Let transformers translate problems into programs while letting computer programs carry out the reasoning. That sounds like a great plan, doesn’t it?
The problem is, translating a problem into a buggy, inefficient program is easy. It is understanding the behavior of the code, debugging it and optimizing it that requires the highest level of reasoning skills (in computation theory, debugging & analyzing time complexity of a general program is undecidable problem). If a transformer cannot even “mentally” simulate the execution of code, can you trust it to debug or optimize the code?
General Limitations of Reasoning through Machine Learning
Fundamentally, machine learning models’ strength has been predicting appropriate outcome for problems that require similar amount of time to solve, without human intervention of the internal process. This can work very well for things like self-driving because humans drive by intuition and we spend roughly the same time to make different decisions when driving. But general reasoning is different. It ranges from problems that take seconds to solve, to problems that take hours, years, or centuries to solve, to problems that will remain unsolvable forever.
So should models learn to follow human’s thinking process, which is not their strength, or should they learn to directly predict the outcome, which is much much harder to train given the depth of inference and the enormous compute needed for deep reasoning? Either way, I believe a major breakthrough in machine learning will be needed to make significant progress.
Conclusions
For now, it appears that only human brains demonstrate both great intuitions and advanced reasoning capabilities. What’s more, we somehow know where the limit of our brain’s raw capabilities are, and we build tools to extend our capabilities. We leverage ML models to scale out our understanding of languages, images and videos, and to help us understand such things as weather, protein structures, which we don’t have intuition about. And, we also need code to help us extend our reasoning & planning capabilities, to the scale that’s impossible for a human brain to handle.
When GPT-3 came out, I wrote a similar article to discuss the practical limitations of GPT-3 by that time. Those practical limitation can be resolved over time.
That article also discussed the theoretical limitation of regression based solution (any statistical based machine learning solution that uses loss to quantify the effectiveness of the solution), and reasoning/logical limitation (not even human being is able to solve).
Glad to see that most of the practical limitations have been overcome within 1 year. Unfortunately to see people still hyper on the current breakthrough of AI tech and think it can solve every problem.
This article can be seen at https://yexijiang.substack.com/p/back-to-untitled-1a93f4026086.