Can Reasoning Models Think Visually?
Is thinking in visuals the next big thing for improving large reasoning models’ reasoning capabilities? What would it take for reasoning models to become visual thinkers?
In the past weeks, I’ve gotten quite obsessed with the DeepSeek app, asking the R1 model various questions while watching it dump its mind. One thing that’s particularly interesting, is seeing how the model reasons using natural languages, for problems that humans would normally rely on vision, either by forming a picture within your mind, or drawing something on the paper to assist reasoning.
Reasoning Model’s Algebraic Mind
One thing I have found since the release of OpenAI O1 model, is that these reasoning models like to think about problems using algebra. For example, when I asked DeepSeek R1 to prove the concurrency of a triangle’s three altitudes, it chose to use analytical geometry:
First, it assigned coordinates for the three vertices.
Then, it deduced three equations that present the altitudes.
Finally, it proved the solution of one pair of equations equals the solution of another pair of equations, thus proving the concurrency.
This phenomenon is not surprising given the purpose of algebraic methods were to provide abstraction for a class of problems, allowing them to be solved mechanically. I remember when a teacher introduced me to competitive math in grade 4, I hadn’t learned using equations yet. The words that I heard most frequently from the teacher were “let’s diagram this!” and “suppose that …”. Every word problem seemed to require a bit of creativity or imagination to solve. Later on I learned using equations and suddenly lots of these word problems became mechanical to solve. However, I still liked trying to solve them in the old way especially when the algebraic way is tedious, because there are problems where diagrams and imagination yield much more elegant solutions.
If these reasoning models excel at solving problems with algebraic reasoning, how good are they when solving problems that need thinking in visuals? Can language imitate visuals? Or maybe these models do have the visuals in their hidden representations which we can’t see?
Reasoning Model’s Ability to Reason Visually
I created a few puzzles to test the reasoning models’ ability to reason visually and the following is one of them:
From left to right, draw a vertical line and a circle with the same height at the very bottom of a blank sheet of paper next to each other, such that the bottom of the line and the circle touches the edge. Stand the paper upright on a flat mirror with the shapes at the bottom. Look towards the area where the paper and the mirror meet. What number formed by the shapes would you see?
Following is a visualization of the problem (of course, it was not shown to the models).
All of the models that I tested gave the wrong answer. Both DeepSeek R1 and OpenAI O3-mini gave an answer of 10, and Gemini 2.0 Flash Thinking gave an answer of 1, which indicated all of them had no or very limited visual thinking. I am pretty sure future LLMs can get it the answer right, but what it is interesting is that, in their reasoning tokens, the models repeatedly mentioned “visualize”, “imagine”, etc. Apparently, these words in their “minds” do not quite carry the same meaning as ours.
As I read through their thinking tokens, I started to feel sympathy for these models. Gemini thought the line and the circle would switch sides in a mirror laying flat and didn’t consider the shapes in the paper and the mirror together. R1 spent lots of tokens reasoning what a number it is with a vertical line and a circle next to each other, whether it is 6, 9 or 10. It was also trapped in a cycle of self-negation, switching back and forth between “yes, the reflection will complete the circle” and “no, there is already a complete circle on the paper”.
I wouldn't be surprised if LLMs in the near future solve this problem given that it is so simple, but if these “PhD level” models were able to draw a picture like the one above (which I did with the help of Claud), identifying the number 18 would be a simple task for them.
The Power of Visualization
Visualization only gives us shorter, more elegant solutions than algebraic methods or helps us solve tricky problems that require understanding humans’ perception of shapes, it is the source of creativity.
Einstein was a famous example of a visual thinker. In a letter written in 1945, he famously said:
"The words or the language, as they are written or spoken, do not seem to play any role in my mechanism of thought. The psychical entities which seem to serve as elements in thought are certain signs and more or less clear images which can be 'voluntarily' reproduced and combined. .... This combinatory play seems to be the essential feature in productive thought before there is any connection with logical construction in words or other kinds of signs which can be communicated to others"
A lot more anecdotal evidence from scientists on the importance of visual thinking for science discovery can be found here.
Not everyone is a good visual thinker; but we can always draw diagrams in a paper, or manipulate objects in front of us to gain insights. Drawing diagrams was what my teacher taught me to find insights, and if those insights are novel, it would be an innovation.
Existing Research
As I contemplate the idea of extending reasoning models’ capabilities with visual thinking, I searched online for prior research and found this very recent paper with a very similar idea, which they call “MVoT”. In one of the approaches, they prompted GPT4o to generate problem-specific instructions to generate visualization based on its chosen action, observe the visualization, and decide whether to take the next action, or make a conclusion.
There are a few important limitations on this approach though:
They used human crafted prompts tailored towards each benchmark, and an external program to drive the iterative process. In other words, it is just a computer program driven, agentic workflow for particular benchmarks, instead of a model/system that can invoke visual thinking by itself.
The paper showed MVoT prompting improves over GPT4o with other prompting techniques, which is not surprising., It didn’t compare it with reasoning models, which would be a more informative comparison.
In another approach, they tried fine-tuning a model that can generate interleaved text and images, on the target benchmark, and showed that the MVoT tuning has some limited benefits compared with other model tuning approaches.
As far as I can tell, the research in this direction is still pretty early.
Challenges Ahead
What would it take for reasoning models to become visual thinkers? What are the challenges to overcome?
For one, it is very challenging for today’s multi-modal models to reliably generate visualizations. For the “paper-mirror” example above, I was the one who crafted the prompt that focuses on what matters for the problem, and I had to iterate on the prompt a couple of times to get the right visualization. Most important of all, I was able to drive this iterative process because before I started, I already had the visualization in my mind. Someday, improvements in LLM with multi-modal output may help these models “fake” a visual mind, but without major breakthroughs and paradigm shifts, it is clear that these LLMs don’t have real understanding of the structure of the physical world, and they don’t understand the connections between components that it generated.
So, when can reasoning models truly think visually? I don’t have an answer. What I believe though, is that when that day comes, these models will become much more human-like.
For the mirror test, how about o3-mini-high, and gemini 2.0 experimental? I won't surprise if they fail as well as they are the same generation. Just want to see how the most advanced solution compare with their lightweight versions.