The Training Data Gap between LLM and Human Intelligence

Jan 31, 2024

Recently Yann LeCun made an insightful point by comparing the size of data a 4 year old child has been trained on and the size of training data of the current LLM:

- LLM: 1E13 tokens x 0.75 word/token x 2 bytes/token = 1E13 bytes.
- 4 year old child: 16k wake hours x 3600 s/hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes/s = 1E15 bytes.
In 4 years, a child has seen 50 times more data than the biggest LLMs.

Isn’t it insane that the largest scale LLMs today which cost hundreds of millions of dollars to build & train are only trained on a fraction of data that a 4 year old has seen?

And, let’s just not forget that more than 100 million newborns every year in the world, which means there are more 100 millions new instances are being trained every year, with their own unique training data - the local environment, the interaction of their unique parents, siblings and friends, whereas today’s LLMs are all trained on very similar training data - the text you can find from the internet.

But that’s not all. Humans are lifetime learners which means that there are actually 8 billions unique training instances ongoing! Moreover, humans are not trained from scratch which means they have inherited lots of built-in optimizations from evolution!

It may be just a matter of scale but the scale really matters. Or it may not be a matter of scale because that scale is something a single system cannot handle so humans will keep finding new subproblems which can be reduced to solvable complexity.

Let’s celebrate that there will be an insane amount of interesting problems to solve to build AIs that are more & more capable. Let’s celebrate the diversity of human beings and the uniqueness of each of us. Let’s allow our kids & ourselves to explore, to experience, to break rules & conventional wisdom, because it is from those unique training data where true innovation comes from.

The Unscalable

Discussion about this post