How do you measure reasoning quality?

January 3rd, 2024 by Gregor Betz

In another article, we've discussed why sound reasoning is key when building advanced LLM apps. Chain-of-thought design patterns bring benefits (better performance, effective division of labor, explainability and greater accountability) if, and only if, the LLM is able to consistently produce sound reasoning traces without major flaws.

But ... what exactly is sound reasoning (in LLMs), and how do you measure it?

That's a tricky question — which requires us to have a closer look at the concept of reasoning itself. Specifically, we'll draw a distinction between: (a) an LLM's computational ("reasoning") processes, and (b) the argumentative texts generated by an LLM ("reasoning traces"). This will allow us to explain Logikon's approach to measuring reasoning quality in LLMs. In a nutshell, Logikon assesses the argumentative quality of texts that are generated as intermediary outputs in multi-step LLM-apps (i.e., an LLM's reasoning traces).

Reasoning: Computational Process vs Argumentative Text

Let's consider a simple Q/A problem:

Example 1

<|user|>
C: René Descartes (31 March 1596 - 11 February 1650): was a French philosopher,
scientist, and mathematician, widely considered a seminal figure in the emergence
of modern philosophy and science.
Q: Did Descartes ever see the Eiffel Tower (y/n)?
<|assistant|>
A: No.

This is an example of what NLP scholars call a multi-hop Q/A task: Answering the question requires to recall different pieces of information, some of which are given in the context (Descartes died in 1650), others of which are not (the Eiffel Tower was built in 1889), and to infer the answer from these facts in one or more inference steps (No one has seen any building that was constructed after their death, so ...).

The AI assistant's answer in Example 1 is correct. Presumably, it has been arrived at by a (hidden) computational process that involves the reasoning steps outlined above.¹

Let's modify the example a bit:

Example 2

<|user|>
C': René Descartes was a French philosopher, scientist, and mathematician.
He died in 1650. The Eiffel Tower was built in 1889.
So, Descartes never saw the Eiffel Tower.
Q: Did Descartes ever see the Eiffel Tower (y/n)?
<|assistant|>
A: No.

It's much easier to arrive at the correct answer now. And it's obvious why: Solving Example 2 involves just a simple retrieval of the correct answer from the context. No reasoning process, it seems, is called for at all.

What's more: All the reasoning required to answer the question in Example 2 — based on biographic facts about Descartes — is already contained in the context. Or, more precisely, the context is an argumentative text that makes explicit the reasoning steps required to answer the question.

Here is how an LLM finetuned on chain-of-thought data might take on the original example:

Example 3

<|user|>
C: René Descartes (31 March 1596 - 11 February 1650): was a French philosopher,
scientist, and mathematician, widely considered a seminal figure in the emergence
of modern philosophy and science.
Q: Did Descartes ever see the Eiffel Tower (y/n)?
<|assistant|>
R: Descartes died in 1650. The Eiffel Tower was built in 1889.
So, Descartes never saw the Eiffel Tower.
A: No.

The AI assistant generates an argumentative text (R) that derives the answer from given and (correctly) retrieved facts. We refer to such argumentative texts as the LLM's reasoning traces.

No matter whether the production of good reasoning traces is evidence for the execution of good reasoning processes,² it's the reasoning trace — the explicitly stated argumentation — which allows the LLM to arrive at the correct answer with high confidence.³

That's why Logikon contends with analyzing the quality of the LLM's explict reasoning traces, rather than the quality of its elusive reasoning processes.

	Reasoning Process	Reasoning Trace
kind	computational process	argumentative text
structure	causal chain (information processing)	logical inferences, arguments
accessibility	hidden, implicit	explicit, logged
intelligibility	intransparent	transparent

Argumentative Reasoning Quality

In light of the above, Logikon's approach to measuring LLMs' reasoning quality boils down to assessing the argumentative quality of LLM reasoning traces. For this very purpose, we can resort to the sophisticated tools and manifold methods of argumentation analysis developed in the critical thinking literature.

📚

Critical thinking scholars develop normative accounts of good argumentation and investigate how to teach persons to reason well. It's a broad, heterogeneous and interdisplicinary field of research, with contributions from philosophy, psychology, logic, education, linguistics, and other disciplines. Highly recommended readings that provide a background to Logikon's approach are:

🌐 Joe Lau's Critical Thinking Web (opens in a new tab)
📖 Critical Thinking: A Concise Guide (opens in a new tab) (Bowell & Kemp, 2020)
📖 Reason and Argument (opens in a new tab) (Feldman, 2013)
🗺️ Christian Voigt's Argdown (opens in a new tab) argument mapping tool

Let's go through some of the criteria that are used to assess the quality of natural language argumentation.

Effectiveness (is not all)

One of the main goals of reasoning is 'truth' — to find out what is the correct answer to a question, or the right solution for a problem. While humans may argue for all sorts of motives (entertainment, persuasion, etc.), tracking the truth is the main goal when asking LLMs to reason.

A first and obvious measure of reasoning quality is therefore its effectiveness:⁴ Does the reasoning trace help the LLM to find the correct answer? Does the argumentation allow it to solve the problem?

Gauging the truth-tracking effectiveness is the major way how (the reasoning traces generated in accordance with) different deliberative prompting strategies and design patterns for LLM apps are assessed in the literature.⁵

And if you have a reliable test dataset with ground truth, you can and should use it to evaluate the effectiveness of your LLM's reasoning traces.

But that's not all.

Because:

Flawless reasoning is typically more effective than flawed reasoning. Avoiding reasoning errors, or, more generally, striving for high reasoning quality, therefore helps to improve reasoning effectiveness.
Reasoning can be effective while being flawed (yielding the right answer for the wrong reasons). Truth-tracking effectiveness is therefore not a sufficient indicator of general reasoning quality.
Flawed reasoning, even if leading to the right answer, fails to bring other benefits of deliberative prompting. It doesn't explain, it doesn't enable coordination, and it doesn't facilitate AI governance.
In problems without well-defined ground truth, effectiveness cannot be assessed. This happens, for example, in ambiguous decision situations involving high uncertainty and conflicting values. In such cases, decisions should at least be well justified, which requires reasoning standards other than effectiveness.

So besides the extrinsic effectiveness, which evaluates reasoning traces as a means to an end (truth) and disregards their internal composition, other intrinsic measures of reasoning quality are called for.

Status of assumptions

One argues for a statement from given assumptions (premises). Reasoning quality depends on the status of these assumptions.

Example 4

<|assistant|>
Descartes died in 1650.
The Eiffel Tower was built in 1883. ❌
So, Descartes never saw the Eiffel Tower.

The AI's reasoning (trace) in Example 3 leads to the right conclusion, enabling the AI to give the correct answer, yet it relies on a factually false premise: The Eiffel Tower was not build in 1883. That mistake diminishes the reasoning's quality.

There are, besides factual accuracy, further properties of premises that are relevant for determining an argumentation's overall quality. Thus, a reasoning trace may be flawed because its assumptions

are simply implausible (plausibility);
mutually contradict each other (internal consistency); or
don't fit into the broader background against which the reasoning trace has been set forth (contextual coherence).

Inferential correctness

Even if the assumptions of a reasoning trace are true (and its conclusion is correct), it may suffer from committing a logical fallacy: the assumptions just don't warrant the conclusion.

Example 5

<|assistant|>
Descartes died in 1650.
Pascal died in 1662.
So: ❌
Descartes was born before Pascal.

Example 5 illustrates the case of an incorrect inference from premises to an argument's conclusion.

The correctness of an inference can be assessed in terms of its logical validity and argumentative (inductive) strength.

Issue relevance

A reasoning trace may, in addition, fail to address the issue or overarching question at hand.

Example 6

<|user|>
Did Descartes ever see the Eiffel Tower?
<|assistant|>
Descartes died in 1650. Pascal died in 1662.
So, Descartes died before Pascal.

Seen in isolation, the argument in Example 6 is fine (factually correct premises, valid inference, true conclusion). It may even make it more likely that the LLM comes up with the correct answer to the user's question (effectiveness).

Its flaw, however, consists in its lack of relevance for the overarching question. The argument fails to address the issue under discussion. In particular, it doesn't justify or refute any alternative answers to the problem to-be-solved.

Presentation clarity

Another quality criterion for reasoning traces is the transparency and clarity of presentation.

Example 7

<|user|>
Q: Did Descartes ever see the Eiffel Tower (y/n)?
<|assistant|>
R: The inventor of analytic geometry would have admired the Eiffel Tower,
which was build in 1889. Had Descartes not died two centuries before, in 1650. 
A: No.

The reasoning trace in Example 7 presents a good argument in a somewhat jumbled way: Descartes is referred to by different designators; the speculation about his admiration of the Eiffel Tower is besides the point; the reasoning trace at most alludes to, but doesn't explicitly state the conclusion; and therefore the text's surface composition doesn't correspond with the argument's inferential structure.

More generally, a clear presentation of an argument is characterized by:

absence of distractors and redundancies;
completeness of presentation;
terminological consistency;
correlation between text's linguistic surface structure and deep logical structure of reasoning.

The Role of Rational Reconstruction

To assess the intrinsic quality of a reasoning trace (status of assumptions, inferential correctness, issue relevance, presentation clarity), the assumptions (premises) and conclusion of the argumentation have to be identified and stated explicitly.

The problem is: It's not always as straightforward to read off premises and conclusion from an argumentative text as in Example 3. Just recall Example 7.

The process of identifying premisses and conclusion in an argumentative text is conceived of, in the critical thinking and argumentation theory literature, as a (rational) reconstruction of the argument. It involves a careful interpretation and logical analysis of the text at hand, and doesn't always yield a unique result.

Reconstructing the argument to-be-evaluated is arguably the most challenging part of assessing the quality of reasoning traces.

Logikon builds robust AI workflows (opens in a new tab) based on LLMs that automatize the process of argument reconstruction.

What’s Next

We'll soon have further posts on:

Reasoning quality and AGI
AI safety through reasoning
Reason-able and reason-responsive LLMs

Stay tuned!

Notes

This is of course what LLM sceptics doubt: The AI bot has, they argue, memorized the correct answer during training, or is exploiting statistical cues in the question text to guess (with high probability) the correct answer. This illustrates that assessing computational processes that underlie answer generation is methodologically challenging. Logikon's approach, we'll see, allows one to steer clear of these debates. ↩
Is an LLM's ability to generate a sound and effective reasoning trace evidence for its ability to perform computational reasoning processes? Does it even show that the LLM has actually arrived at its answer through such a reasoning process? Or does the generated reasoning trace in fact explicate the actual reasoning process? That's far from clear. Maybe the LLM assistant has just memorized the argument which helps it to give the correct answer. — And some philosophers (in the tradition of Wittgenstein, Ryle and Dennett) would insist that it's futile, if not outright meaningless to ask whether the explicit reasoning traces of an LLM are faithful to the actual computational processes that underlie their very generation. ... All of which illustrates that computational and cognitive processes are elusive and hard to assess empirically. Logikon's approach allows you to remain agnostic about these perplexing issues. ↩
On the causal role of reasoning traces see, for example, Mishra and Thakkar (2023) (opens in a new tab) and Tan (2023) (opens in a new tab). Compare also the "success stories" section of our awesome list (opens in a new tab). ↩
Scholars in epistemology refer to this truth-tracking effectiveness as the veritistic value of argumentation. ↩
See, for example, our awesome deliberative prompting (opens in a new tab). ↩