Learning: Fast & Slow
In this post, I conjecture that the biggest difference between LLM and the human brain is not how slow or deep that they think, as lots of people conjectured, but how slow or deep they learn.
Early this year, “inspired” by the AGI hype, I started my journey on this “The Unscalable” blog, with a mission of discovering, understanding and promoting the value of human’s unscalable efforts - the value of things we do that seem to be replaceable by cheaper, easily replicable technology. The inspiration came from many angles, one of which was a scientific and engineering perspective. As someone with a deep theoretical background who worked on ML engineering for a long time, I know the existence of different tradeoff theorems and I made tradeoffs in every engineering decision. When LLM suddenly emerged as a totally different, yet cheaper, faster, better quality alternative of human brains which have been evolved for millions of years for the environment that we live in, something just didn’t feel right to me - some tradeoff must have been made by its design.
In my first post The AGI Rush, I discussed the empirical evidence that all human creations, for scalability reasons, always assume an ideal, constrained environment, and it relies on unscalable humans to take care of the corner cases and the interaction of unpredictable outer environments. In this blog post, I am sharing another tradeoff that I started to realize in recent months, one that is more technically interesting, one that points to the limitation to the general statistical ML approach.
Learning Fast vs Learning Slow
Perhaps the most striking thing about an LLM when compared with a human brain is how fast and how scalable it is in absorbing so much information. It would take a human thousands of years to read all text on the internet, but an LLM can go through all of them in a couple of days or hours during training, depending on how much you parallelize the training. That horizontal scalability is what makes so many people think silicon intelligence is much superior to our shabby carbon intelligence.
But, if we believe our brain has been well optimized through a long history of evolution, then it must be near optimal in some way. By making the learning process highly parallelizable, LLM gains something but it must have lost something as well, right? What is it?
Just reflecting on my own brain’s learning experience, when I grew up, I have found myself to be a slow learner compared to lots of my classmates, or at least, much slower at the beginning. When I was presented with a brand new concept, I just didn’t know what to do with it. It felt disconnected from what I already knew. I couldn’t draw a picture of it in my mind and my brain felt itchy when I thought about it. I had to work very hard to visualize it and connect it with the other things in my brain. However, after I succeeded in truly understanding the brand new concept, working on related concepts or problems just felt like a breeze.
By learning in parallel, what LLM (and any other highly parallelizable ML approach) loses is the connections of all the information it absorbs. The cost that it saves at learning/training time has to be paid at inference time.
LLM’s Cognitive Problems
Like me, you might have encountered many times when LLM first gave the wrong information, and when questioned, apologized and provided the correct information. You might have wondered why LLM couldn’t just get it right from the beginning, since it actually had the right information in its “mind”. The reason is that the right and wrong information independently coexist in the LLM’s mind and the LLM actually doesn’t know about their coexistence (superposition is the fancy word for this phenomenon). It is your questioning that prompts the retrieval of the flip side of information.
Humans absorb new information in a different way. When new information comes, we vet against our existing knowledge system for consistency. We may accept or reject the new information, or change our existing belief, based on the vetting process. No matter what the outcome is, we form a connected, coherent knowledge system. Humans make mistakes too. For example, it was not until Russell discovered his paradox that we realized the inconsistency in set theory. But we discovered and learned from those mistakes. The ZFC axiomatic system was developed and the paradox was addressed.
LLM’s “mind” is not just incoherent, but also “incomplete”, in the sense that it can’t retrieve all needed information stored in its weights.
Imagine on the internet there is not a single document talking about all the kids of Elon Musk or how many kids he has. Instead, each one of them are mentioned in separate documents. When you ask the LLM how many kids Musk has, can it give you the right answer? No. In its weights, it encodes some probability distribution of Musk’s kid to be name_1, name_2, …, name_n, but there is no representation of an aggregated view of them.
If you ask the LLM, “How many kids does Elon Musk have?” It is going to hallucinate, likely based on people with similar names.
If you ask the LLM with a CoT prompt, something like “How many kids does Elon Musk have? List them all first and then count.” LLM (a well finetuned one) may be able to output things like “Elon Musk’s kids are name_1, name_2, …” but has no idea when to stop generating names! Given most people don’t have as many kids as Musk (12 as of this writing BTW), it is most likely going to end early and miss some names.
While I am using counting Elon Musk’s kids as an example, it is highlighting a common case where concepts that connect to each other are spread in different documents. Moreover, the problems that I highlight here are not just about LLMs but about all statistical models that learn in a horizontally scalable way.
Chain of Thought
“Chain of thought” (CoT) generation (either through prompting or fine-tuning) is an interesting technology that has the power of connecting disconnected knowledge in LLMs - with limitations that I will show later.
When you ask an LLM “what is the altitude difference between mountain A and mountain B?”, it is likely going to hallucinate if the altitudes of A & B never appear together in internet documents and CoT is not used. With CoT, this is what LLM can do at an abstract level:
It first retrieves a “CoT template” for solving this problem - I first need altitude of A, and altitude of B, then do the subtraction.
It retrieves altitude of A.
It then retrieves altitude of B.
Finally, it retrieves its ability to do subtraction.
As you can see, CoT connects 3 pieces of disconnected information together, which allows it to solve the problem.
OpenAI’s o1 series of LLMs take this approach much further, by tuning the model to generate a long series of “thought steps” before finalizing the answer. In this way, the model is able to connect many pieces of disconnected knowledge together; it can even retrieve alternative knowledge if the knowledge retrieved so far doesn't lead to a satisfying answer.
Does it solve the problems that I mentioned in the previous section?
To answer this question, let’s directly test o1 to see if we can gain any insights. Recently I asked o1:
In the early 20th century or late 19 century, was there an optimism that humans already knew pretty much we should know about the physics laws that governed the universe? Please list concrete examples.
O1 made a long, detailed list of examples, one of which was Lord Kelvin's "Two Clouds" Speech:
When I questioned the accuracy of the statement, o1 corrected its mistake and apologized:
While this is just an example and it may be (luckily or expensively) fixed in future versions, it highlights the fundamental problem of LLM’s disconnected presentation of knowledge. CoT relies on clues from the problem description or generated text to generate the next step, however, as shown in this example and the “counting Elon Musk’s kids” thought experiment, those clues are not always there. For humans, those clues lie in the connected knowledge representation that we get from slow learning, or lie in experiencing the real world that our knowledge representation connects to.
Of course, one can always mitigate some of these problems by having more training data (e.g. human curated data that help connect disconnected information), having longer CoT (e.g. generating more thoughts to check everything statements it makes), but these mitigations will always be fragmented, unreliable, inefficient and lacking novelty.
The fragmentation, unreliability, inefficiency and lack of novelty of thinking is the price LLM (and other statistical ML models) pays by learning too fast & too cheaply.
The Superpower of the Unscalable Humans
Over the last two years, many parents of young kids (including myself) and early career starters started to feel uncertain about their or their kids’ future because of the rapid emergence of intelligent systems that appears to be on a trajectory of quickly surpassing human capabilities and taking over humans’ jobs. Lots of people jump onto the train of LLM, for fear of missing out on the last opportunity before AGI is achieved. LLMs have been promoted as the tutelary deity for individuals, companies and countries, without which one deems to become inferior.
I hope this post can provide people with a more grounded picture, or inspire people to have a deeper thinking / research of the reality of the technology today. As we can see, nothing about the human’s unique strength and individual values have fundamentally changed, for nothing that we have created so far can replace a human’s unscalable effort to deeply understand the world, the domain and the people around, let alone our ability to make things happen.