What LLM Is and Is Not: a Philosophical and Practical Overview

This is the redacted version of a Google/YouTube internal tech talk I gave before a hackathon event.

Jul 10, 2024

Two Perspectives of LLM

Since the release of GPT4 in March 2024, there have been two different perspectives of what LLM really is. The first perspective is that the state of art LLM is at or close to the level of so-called AGI, or artificial general intelligence, which is capable of performing most, if not all human intellectual work. This perspective did not just come from users of GPT4, but also from academic research in the field. For example, one week after the release of GPT4, Microsoft research published a paper titled “Sparks of Artificial General Intelligence: Early experiments with GPT-4”. In the paper, they tested an early version of GPT4 on a wide range of tasks, and concluded:

“In all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.”

The paper gained lots of attention and has gotten over 2000 citations within just a year. But the most striking “evidence” that AGI is coming, probably came from the worries and fears expressed by some prominent figures in the field of AI and cognitive science.

Douglas Hofstadter, a notable cognitive scientist, said the following in a podcast in June 2023:

“It's like a tidal wave that is washing over us at unprecedented and unimagined speeds. And to me, it's quite terrifying because it suggests that everything that I used to believe was the case is being overturned… It's a very traumatic experience when some of your most core beliefs about the world start collapsing.”

Geoffrey Hinton, the “godfather” of deep learning, said in May 2023:

“...my confidence that this wasn’t coming for quite a while has been shaken by the realization that biological intelligence and digital intelligence are very different, and digital intelligence is probably much better.”

As I write this and re-read the words from them, I can deeply empathize with the motion from their traumatic experience.

But not everybody agreed that either AGI is coming or LLM is the way to AGI. Yann LeCun, another godfather of deep learning, has constantly dismissed the idea that LLM is the path to AGI. For example, this is what he said on X in Feb 2023:

“Before we reach Human-Level AI (HLAI), we will have to reach Cat-Level & Dog-Level AI. We are nowhere near that. We are still missing something big. LLM's linguistic abilities notwithstanding. A house cat has way more common sense and understanding of the world than any LLM.”

This perspective of LLM is shared by lots of other AI researchers as well and it is not just opinions; it is backed up by lots of evidence. While LLM performs amazingly well on linguistic and knowledge based benchmarks, they fail badly at planning tasks (e.g. PlanBench) and tasks that require little knowledge but ability to abstract & learn new patterns with a few examples (e.g. ARC-AGI). They are easily tricked by simple adversarial examples, and they easily change their “mind” when being slightly confronted.

So which perspective is closer to truth? Is LLM on the path to become AGI?

AGI is a Great Aspirational Goal, but Building it Is Alchemy

The fact is, nobody in the world knows what human intelligence is, what intellectually humans are capable of and how to measure it. The examples that I showed above where lots of scientists in the field were shocked by GPT4’s performance highlighted the fact. They were either wrong before or after they changed their mind, and they changed their mind not because of new theoretical discovery or scientific experiments, but because of anecdotal evidence - highlighting our lack of scientific understanding of human intelligence.

Because we don’t know what human intelligence is and how to measure it, the status quo right now is “making it perform at human level on as many benchmarks as possible”. However, nobody knows how many benchmarks an AI needs to crack before one can truly claim that it is “AGI”. What’s worse, lots of benchmarks’ test data are already leaked into LLM’s huge corpus of training data, which makes the statistics coming from the benchmarks even less useful.

So, AGI is a great aspirational goal for pushing AI research forward, but for someone who is working on LLM and thinks that they are building AGI, I think they are essentially doing a modern version of alchemy:

The ancient alchemist didn’t know what gold is made of (or the cause of aging, in Chinese alchemy), and similarly, the modern alchemists don’t know what human intelligence is made of.
Each ancient alchemist had their own secret ingredients and processes, and similarly, the each modern alchemist has their own secretive training data and training recipes.
The ancient alchemist might have produced something that looked like gold, or gotten some real gold because they dropped some gold into the ingredients, and similarly, the modern alchemists create an illusion of consciousness, and generate human-level LLM responses by letting LLM memorize & interpolate human responses.

If LLM is not on the path to AGI, what is it then?

Think of LLM as a Human-made, Unique Type of “Intelligence”

I put “intelligence” in quotation marks because there are lots of researchers who believe LLM doesn’t have true intelligence. I am more OK with calling them intelligence because there are already many different kinds of intelligence out there in nature (check out this YouTube video where young chimpanzees demonstrated far superior visual working memory than us), so it doesn’t really matter if we add one more to the bucket - intelligence is not well defined anyway.

LLM’s uniqueness is rooted in how they are created, as demonstrated by the picture below.

LLM has a very different kind of “brain” than humans’ - very simple but highly parallelizable transformer architecture.
Humans learn by directly perceiving the world but LLM is trained on human generated data, which lacks context. At the same time, LLM consumes all texts generated on the web, much more knowledge than any single human can consume.
Human’s learning is sequential, less scalable but we see the connections among different things; LLM makes I.I.D assumptions of the training data (e.g. web documents) and thus learns in parallel but it doesn’t learn the connections across different documents.

The unique way of creating LLM gives them unique characteristics compared to other kinds of intelligence. They have mastered linguistic tasks, because linguistic patterns are repeated over and over again in their training corpus. They are great at approximate retrieval from their huge training corpus. They are good at hallucination because they have been forced to predict the next token with little context during training, and they are bad at reasoning/planning likely because of a combination of the limitation of the architecture, the I.I.D assumption and the lack of context during training, and next-token prediction.

But at the end of the day, just like any other human-made tools, they rely on humans to produce, update and integrate. With the right integration, their mastery of language and huge knowledge base can be leveraged, and their hallucination can either be mitigated, or become a great feature instead. They cannot plan themselves, but their knowledge and ability to generate approximate solutions makes it a great input to more rigid planning tools. Without the integration with human’s capabilities and needs, they are just junks.

LLM Is a Great Toy to Play with

Through integrations, LLM is already a great tool for many people. Every day, hundreds of millions of users are using AI chatbots; code and code review co-pilots have greatly boosted our engineering velocities.

However, the best attitude towards LLM might be that, just thinking of yourself as a 7 year old kid full of curiosity and you are given a brand new toy to play with. And this is not an ordinary toy, because as shown in the Google IO demo, it is a very cool toy that can do “magic”. This is not an ordinary toy because it is very very expensive to build. Moreover, compared to early last year, this toy now has a much nicer user interface, with greatly improved capabilities:

It can now understand texts, images and videos;
It can write code, execute code and use external tools;
The context window has gone from a few thousand tokens in the initial GPT4 launch to 2M tokens in latest Gemini models, which means that the model can summarize thousand-page documents and analyze hour-long videos;
It is getting better and yet smaller - Gemma2 9B outperforms GPT3.5 in various benchmarks, albeit it is only 1/40 the size of the latter and can easily run locally on a MacBook;
…

With so many capabilities, easily-accessible APIs, and options to run it on cloud or locally, there are endless ways you can play with it and have fun.

The Unscalable

Discussion about this post