Four Simple but Profound Lessons I Learned about ML and AI

The first two lessons are historically influential thesis that have shaped today’s AI/ML research and industry, while the last two lessons tell the flip side of the story.

Feb 17, 2025

Machine Learning as an Experimental Science

These days, when you read any published machine learning research that presents new or improved methods, it’s hard to miss the abundance of numbers comparing performance on standardized benchmarks against previous work. Most papers also include ablation studies, where the impact of different components on the performance is thoroughly tested.

This was not the case 30 - 40 years ago. Before the 1990s, it was prelavent that new algorithms were proposed with limited empirical evaluation, often relying on theoretical properties, small, ad-hoc simulations or anecdotal evidence.

In 1988, Pat Langley published a seminal meta-research paper titled “Machine Learning as an Experimental Science” which marked the shift of paradigm in machine learning research and development. In the paper, Langley insightfully pointed out that “as a science of the artificial”, machine learning has “complete control over the learning algorithm and the environment”. Because of this, “machine learning occupies a fortunate position that makes systematic experimentation easy and profitable”. At the end of the paper, he visionarily concluded:

Although experimental studies are not the only path to understanding, we feel they constitute one of machine learning's brightest hopes for rapid scientific progress, and we encourage other researchers to join in this evolution.

Today, we have all witnessed the rapid progress due to this evolution (though the outcome might not be as scientific as Langley hoped). “Machine learning is an experimental science” has become the motto of machine learning researchers and practitioners. Constructing test sets that represent the problem to be solved always comes first, and, instead of focusing on rigid theoretical proofs, one would instead focus on robustness of test sets, and the ability of fast trial and error of hypotheses.

Langley’s paper provided the methodology for conducting machine learning research and development, but it didn’t answer the question of how to develop machine learning algorithms. 70 years’ AI research, especially the last 10 years’ progress in deep learning has provided a broad stroke answer to that question, and the answer is probably best summarized by Rich Sutton’s “The Bitter Lesson”.

The Bitter Lesson

Sixteen years ago, I was an intern at Baidu and my intern project was to classify different formats of web pages - news, forum, Q&A, etc. The structure of my 3-months project looked roughly like this:

Build a different classifier for each type of web page. For each classifier:
Manually create a dataset and split into training & test set;
Call a hand-written library to extract the web page into different sections - URL, title, navigation breadcrumb, main content, etc;
Segment the content in each section into bag of words;
Use something like TF-IDF to identity and keep the top X most important words;
Build a linear classifier using those words as features;
iterate.

As you can see, there was lots of human engineering involved in step 3 to 5 and each step resulted in a loss of signal. Those feature engineering was necessary at that time because of the extreme constraint of training and inference resources, and the lack of training algorithms to learn from the most raw input. But we lacked those training algorithms because hardware resource constraints didn’t incentivise such research, so compute was the ultimate bottleneck. Today, such a project would likely end up taking a BERT model that has been trained on all internet web pages, fine tuning it or freezing its hidden layers to build a classifier layer on top, and classifying all interested classes at once. The classifier would be much faster to build, with much higher quality.

Moreover, the need for building these classifiers might just be gone, because the output of my intern project was likely input to another system, and if enough compute is available, that system might well just learn end to end, eliminating another human engineering that causes loss of signal.

In 2019, drawing from learnings from 70 years of AI research in speech recognition, computer vision, and superhuman systems like DeepBlue and AlphaGo Zero, Rich Sutton, one of the founders of modern reinforcement learning, famously concluded in The Bitter Lesson:

General methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. … Researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

The Bitter Lesson is not without its controversy, but the idea of focusing on general method that leverages computation has become the major force that drives the rapid progress of machine learning models in cracking one benchmark after the other, pointing towards an “AGI” future that was once only imaginable in science fiction.

The Better Lesson

While Sutton’s bitter lesson offers great insight, one in the field should be quick to find out that it only tells part of the story. While researcher’s domain knowledge has been greatly reduced in feature engineering, their knowledge has been applied to the design of network architecture, training schedule, data mixtures and distributed training infrastructure, which are central to the current wave of AI advances. In fact, there are so many secret recipes and human interventions in months-long training of frontier large language models, such that it becomes one of the most experience demanding and labor intensive jobs in the tech world.

Another illusion that stems from the viewpoint of “computation power solves everything” is that you can keep scaling up computation and eventually a single god-like model will solve all problems of humanity. The fact is that in a competitive environment, you will never have “enough” computation power. A general-purpose LLM won’t give better video recommendations than YouTube’s algorithm, and they can’t detect fraud, scam or spams better than algorithms deployed by banks or email service providers. The reason that general purpose LLMs are looking so promising today is that they are applied to areas where there were no other non-manual solutions. Once a field has gathered enough domain specific data and economic value, specialized solutions will win over general ones.

So I think the better lesson to be learned is the dynamics between human insight and computation cost, generalization and specialization:

Given a specific method, there will be a far better, more general version of that method in the future, once enough computation becomes available;
If humans continue to exponentially reduce cost per unit of computation (which is not guaranteed), the far better general method will come much sooner than many people expect, because humans have a poor intuition of exponential growth.
The advances in AI/ML have resulted from the interplay between accumulation of modeling insights and reduction of computation cost. Every AI researcher and practitioner should look at the solution space holistically when searching for the highest ROI direction.

By the way, I borrowed the name “The Better Lesson” from Rodney Brooks’ blog post, which he posted as a rebuttal to Sutton’s The bitter lesson.

AI is a Science of the Artificial

The name “Artificial Intelligence” perfectly captures the essence of the technology, which is optimized and tested in an artificial environment for an artificial goal. Being “a science of the artificial” is the biggest advantage of AI, as Pat Langley pointed out in 1988, but it turns out to be the source of its biggest problems as well.

One of the most important and challenging problems for the leaders in large scale ML/AI products is to align their AI’s artificial goal with their product goals in the real world. They can try to optimize their AI for a longer term objective which aligns better with product goals, but it always turns out to be too noisy to model well, because the real world is just too complex for an artificial environment to capture. They can also optimize for instant rewards which unfortunately doesn’t align well with real goals, and because of this, how to use these instant rewards to achieve long term goals become more of an art than science. In either case, they need to deal with the “reward hacking” problem, when AI improves the artificial metrics without making intended product improvements.

In an artificial environment, the data distributions are constant, or at the very least, changing in a predefined way. AI works pretty reasonably in this predictable environment. However, the real world is constantly changing in an unpredictable way, thus lots of human care must be taken to keep your AI fresh and robust to hacks and incidents from the external environment. Every ML practitioners who wants to apply ML/AI to their product or workflow should ask themselves this question - would I like to keep your algorithm less optimal while being agile and simple to maintain, or would I like it to be better optimized while being harder to maintain or change because of its heavyweight? Which one is the higher ROI thing to do? The answer would vary case by case, but even if the answer is AI, one should always keep its artificial nature in mind.

That AI has been the science of the artificial should be a gentle reminder for all ML practitioners and AI researchers, especially for those who believe AI will automate humans away and supersede humanity. They might want to ask themselves - is the AI that I am creating or panicking about the super intelligence of the real world, or is it the super intelligence of the ivory tower?

The Unscalable

Discussion about this post