LLMs and the Power of Review

At the most fundamental level, Large Language Models (LLMs) do not actively reason — they predict. LLMs use probability to produce what someone would have likely said in response to a prompt. They do not consider whether the response is correct as they are writing it. However the same statistical output means they can actually do a good job of reviewing their own output, based upon what someone probably would say having reviewed that work.

At first this seems counter-intuitive — that AI could be successful at discovering its own errors. Considering human reasoning, it might seem that if an LLM could review its own output, then it should have been able to have produced the correct output from the beginning. The clarification comes from realizing that LLMs are not reasoning at all, but merely producing probable output. The probable output of producing an answer is not the same as the probable output of reviewing that answer after it is produced.

Chain of Thought

In many ways the process of review is similar to the additional linguistic layers introduced by "Chain of Thought". One of the formative AI papers in 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, notes that by prompting a model to first introduce an explanation of its reasoning, the model will more often produce a correct answer.

Let's look at an example. Suppose you ask the model, "If there are four apples and five bananas, but someone eats two apples and one banana spoils, how many fruits in total are left?" Just producing the answer, the LLM might respond:

Answer: five

It may or may not produce the correct answer. But if you prompt the model to first explain how it arrives at the answer, this increases the chances of the correct answer being produced:

If someone eats two of the four apples, two are left. If one of the five bananas spoils, there are only four left. Two apples and four bananas are six fruits in total.
Answer: six

The authors of the "Chain of Thought" article explain these results in more rigorous technical terms, but here is how I think of it intuitively:

Going on probability, it may or may not be probable that a correct "bare" answer is produced in direct response to a question.
However, it is more probable that an explanation of how to produce the correct answer is correct (assuming it is within the "domain knowledge" of the model); put another way, there is a low probability of producing an incorrect procedure for finding the answer.
Given a correct process for finding an answer, it is unlikely that someone would follow a correct set of instructions with the wrong answer.

The point is that we are layering semantic indirection, which is already a fundamental concept of information engineering. (See the GlobalMentor, Inc. course lesson on indirection.) The more semantic layers we have — explaining how something will be planned, producing the actual plan, actually following the plan, etc. — makes it more likely that the end result will be correct or optimal.

Thus these "semantic layers" are substituting for reasoning. Or put another way, they are documenting what one would have reasoned had one actually reasoned. In an upcoming article, I'll explain how pragmatically the difference between "as if one had reasoned" and "did in fact reason" may not make much difference in practice — or may even arguably be a false distinction.

Semantic Layers and AI Primitives

Thus asking an LLM to review its work can be seen as a semantic layer of indirection. The actual "answer" or response is one layer of probabilistic output. The review of that output is another layer. From this view, it makes sense that the LLM could review its own output. In fact it should be able to review its own review of its own output!

What you may have noticed here is that we are producing more powerful "reasoning", or at least answers equivalent to additional reasoning, by adding these semantic layers. In upcoming articles I'll refer to these sorts of components as "primitives" and we'll discuss how we can combine these primitives in chains and harnesses to produce more useful responses.

In AI Engineering by Chip Huyen (O’Reilly, 2025, ISBN 978-1-098-16630-4), the author notes that planning is a fundamental component of creating an agent, and reflection (i.e. having a model review its own plan) is an important step in producing the best execution:

[S]olving a task typically involves the following processes. Note that reflection isn’t mandatory for an agent, but it’ll significantly boost the agent’s performance:

Plan generation: come up with a plan for accomplishing this task. A plan is a sequence of manageable actions, so this process is also called task decomposition.

Reflection and error correction: evaluate the generated plan. If it’s a bad plan, generate a new one.

Execution: take the actions outlined in the generated plan. This often involves calling specific functions.

Reflection and error correction: upon receiving the action outcomes, evaluate these outcomes and determine whether the goal has been accomplished. Identify and correct mistakes. If the goal is not completed, generate a new plan.

Here you can see that "reflection" has been added as a "primitive" operation in a larger pipeline of steps. This is a recurring theme we'll be discussing on this site.

And Such are Humans

Stepping back, these sorts of semantic layers already prove beneficial to humans as well. If you can remember having to "show your work" in school, you probably have also realized the benefits that "showing your work" brings in producing the right answer. High school courses don't just require report writing — they also teach procedures for writing an outline, collecting information, organizing the material, synthesizing the report, and finally reviewing the result before submitting it. These semantic layers of planning, organization, and review help humans produce quality output by "reifying" the reasoning process across several steps.

On more than one occasion I've encountered a perplexing software development problem, and upon taking the time to meticulously describe the problem in a draft question on Stack Overflow, discovered the answer myself through the process of explaining the issue (after which I canceled submitting the question). This phenomenon was made famous by the term "rubber ducking" which grew out of a story in The Pragmatic Programmer by Andrew Hunt and David Thomas (Addison-Wesley, 2000, ISBN 0-201-61622-X):

A very simple but particularly useful technique for finding the cause of a problem is simply to explain it to someone else. The other person should look over your shoulder at the screen, and nod his or her head constantly (like a rubber duck bobbing up and down in a bathtub). They do not need to say a word; the simple act of explaining, step by step, what the code is supposed to do often causes the problem to leap off the screen and announce itself.

I have a thesis that, because LLMs statistically generate responses from being trained on human language, LLMs can also exhibit human behavior, which is encoded in the language humans use. Thus knowledge of human behavior may have some bearing on understanding LLMs, and understanding LLM behavior can provide insight into human behavior. The articles on this site will thus investigate to what extent AI and humans can be understood in light of each other. In addition, as AI moves from mere chat-partners to full-blown agents, it is likely that not only will developers on your team be using AI — it may be that soon some of your team members are actually AI agents! This blog takes the view that leadership in the AI age must consider not only how humans and AI behave, but how they influence each other and work together.

Try it Yourself

In these articles I'll continually tie concepts to real-world experiments you can try and see benefit in your own work. Let's start with something I use myself: I've started submitting a follow-up prompt to the LLM immediately after every major implementation:

Reflect on the implementation you just completed. Did you learn anything new from the implementation? Did you run into any unexpected hurdles? Did you have to rely on hacks, kludges, or inferior designs to get around roadblocks? If refactoring was involved, was the resulting implementation equivalent to the elegant implementation you would have produced if you had implemented this from scratch; if not, how does the result differ?

Try it yourself! You'll be amazed to discover the corners the LLM cut, the antipatterns it implemented, the bad practices it followed, and the kludges it threw together to get itself out of a bind it found itself in. Now that you have an appreciation for how LLMs (and humans) benefit from semantic layers of review, it might make more sense that an LLM can realize its own failings by checking its own work, even when it couldn't avoid those failings from the start.