Why should I care about reasoning quality when building an LLM app?

December 13th, 2023 by Gregor Betz Sebastian Cacean David Houghton

Advanced LLM-based agents are widely seen as capable of "reasoning". For example, they can

ask questions to clarify a given problem,
generate texts that explore possible solutions,
repeat the key evidence statements from the context,
deliberate about multiple decision options,
present and weigh pros and cons,
justify a final answer, and
explain a decisions that has been made.

If you build an LLM app, the potential benefits of including such reasoning steps are huge and manifold. But their realization depends crucially on the actual quality of the reasoning produced.

In this post, we'll look at different design patterns for advanced LLM apps that try to unlock the reasoning potentials of LLMs. Furthermore, we'll discuss how this might translate into tangible benefits, emphasizing that the quality of reasoning is paramount. After all, flawed reasoning might prove to be more detrimental than having no reasoning in your LLM app at all.

Reasoning in LLMs

Instructing an LLM to reason explicitly about the task it's supposed to solve, or the decisions it must take, can have multiple advantages.

Performance
Reasoning increases accuracy.
Coordination
Reasoning allows multiple agents to collaborate efficiently on a task (division of labor).
Explainability
Reasoning makes GenAI systems intelligible, increasing trust and adoption.
Governance
Reasoning makes AI decisions defensible and helps to meet compliance rules.

Let's briefly look at different design patterns for advanced LLM apps to see how LLMs can be turned into reasoners, and why the above benefits may accrue.

Chain of Thought

Chain-of-thought (CoT) is a pillar of prompt engineering. It's as effective as simple: Just ask the LLM to produce a problem analysis before submitting its answer. Compare:

<|user|>
C: {general business information}{financial data}{proposed strategies}
Q: Should our company expand in market {x}?
<|assistant|>
A: Yes.

The basic CoT idea has given rise to an abundance of more sophisticated prompting strategies and design patterns that refine the original idea (e.g., Divide-and-Conquer instructions), combine it with alternative approaches (e.g., reasoning and tool-use, CoT-enhanced RAG), or increase its generality and applicability (e.g., metacognitive prompting, self-correction, auto-generated CoT demonstrations).¹

LLM-based Multi-Agent Systems

One of 2023's many fascinating trends in AI has been the development and deployment of multi-agent systems based on LLMs. Rather than having a single LLM-based agent (with planning, reasoning, tool-use, and decision-making capabilities) tackle a complex task, you deploy a swarm of agents that collectively solve the task by dividing it into subtasks and coordinating their actions.²

The following illustrates such a swarm of LLM-based agents in the ChatDev (opens in a new tab) environment, a multi-agent simulation platform for developing and testing software.

Example swarm of LLM-based agents in ChatDev.

LLM-based agents in a multi-agent system will not only produce internal reasoning traces as part of their planning and task execution, they will communicate their reasoning to other agents and collectively deliberate about decisions to be taken on the system level.

Rationales and XAI

Data-driven ML systems have been criticized for being opaque and unintelligible. This has led to a surge of interest in explainable AI (XAI) and the development of a variety of techniques for explaining the decisions of ML systems.

But the advent of LLM-based AI systems that master natural language (can reason, reflect and self-correct) and gradually reach ever higher levels of AGI³ is a game changer in the XAI arena.

The black-box is dead, long live the glass-box!

With CoT planning and reasoning on the level of individual AI agents, and collective deliberation on the swarm level, LLM-based AI systems become as transparent and intelligible as human agents (assuming the human's inner monologues were accessible and their interpersonal communications meticulously logged).

Over and above letting LLMs reason about their tasks before they submit their answers, we can also ask them to explain their decisions after they have been made.

<|user|>
C: {general business information}{financial data}{proposed strategies}
Q: Should our company expand in market {x}? Let's think step by step.
<|assistant|>
While the plan to expand in market {x} offers a good opportunity to increase short-term revenue,
it likely stretches our logistics and supply chain to the limit, and prevents us from meeting our
long-term growth targets in our key markets.
A: No.

As recently illustrated for autonomous driving, an LLM-based agent cannot only be used to explain decisions it has taken itself, but render intelligible the decisions of other AI systems, too.⁴

Benefits of Deliberating AIs

Performance. CoT and multi-agent design patterns have been shown to boost the accuracy of LLM-based AIs in a variety of tasks.⁵ Exploring alternative answers, giving reasons for an answer, checking one's arguments, and revising one's thinking in the light of what one's peers have to say are truth-conducive epistemic strategies that seem to work for humans and machines alike.⁶

Coordination. Division of labor requires that individual agents coordinate their behavior, which in turn requires that they understand why other agents do what they do.⁷ Communicating reasoning helps agents to understand each other's intentions, allows them to adapt their goals and to choose means carefully so that they engage in mutually supportive behavior rather than unintentionally obstructing each other.

Explainability. Reasoning traces show how an AI arrived at its answer, which makes it transparent and intelligible. Being able to independently probe the AI's reasoning, users can assess reliability and gradually develop trust. This seems crucial for wider adoption, especially in cases where a ground-truth is lacking and performance tests are difficult.

Governance. Reasoning (ex post or ex ante) ensures that a decision taken by an AI system is defensible and can be justified to, e.g., stakeholders or important customers. It also helps an organization to keep being accountable as well as to meet compliance rules, e.g., by demonstrating that a decision was made in a fair and unbiased way.

All these benefits accrue under the assumption that the reasoning produced is sound. But what if it's not?

The Perils of Poor Reasoning

Poor reasoning can be worse than no reasoning at all. Erroneous reasoning tends to decrease an AI's performance;⁸ it misleads AI agents and humans about each others' intentions, preventing them from coordinating their actions;⁹ it undermines trust, once detected, and hinders further adoption of the AI system; and it poses an additional governance risk if used in compliance contexts.

Objectives	with sound reasoning	with flawed reasoning
Performance	✅ increased accuracy	❌ decreased accuracy
Coordination	✅ efficient collaboration	❌ no effective division of labor
Explainability	✅ intelligible, trustworthy	❌ unintelligible, untrustworthy
Governance	✅ defensible decisions, save	❌ indefensible, extra risk

Conclusion

Building LLM-based agents that reason explicitly about their tasks is a promising approach to unlock the full potential of LLMs. It can increase performance, facilitate multi-agent coordination, improve explainability, and address governance and safety issues. But the quality of the reasoning produced is paramount. Poor reasoning can be worse than no reasoning at all.

What’s Next

We'll have a follow-up post on measuring reasoning quality, explaining Logikon's critical thinking approach.

Notes

Check out our Deliberative Prompting Guide and our list of awesome deliberative prompting resources (opens in a new tab) on CoT and related design patterns. ↩
Wang et al. (2023 (opens in a new tab)) is a continuously updated survey of LLM-based agent frameworks, including multi-agent systems. Du et al. (2023 (opens in a new tab)) show that multi-agent debate outperforms self-check and 'Condorcet jury' setups. Notable frameworks for setting up LLM-based agent swarms: AutoGen (opens in a new tab), and ChatDev (opens in a new tab). ↩
On levels of AGI compare the recent Morris et al. (2023 (opens in a new tab)). ↩
See the research (opens in a new tab) at wayve.ai. ↩
E.g. Suzgung et al. (2022 (opens in a new tab)) or Zhong et al. (2023 (opens in a new tab)), for more see our awesome list (opens in a new tab). ↩
See Du et al. (2023 (opens in a new tab)) and Chen et al. (2023 (opens in a new tab)). ↩
A theory of agency along these lines has been developed by Michael Bratman; cf., e.g., here (opens in a new tab) or here (opens in a new tab). ↩
See, e.g., Mishra and Thakkar (2023 (opens in a new tab)). ↩
These problems arise specifically (but not exclusively) with fallacious means-ends reasoning, e.g.: Agent-1 has goal A and says so. Agent-2 knows that bringing about M is sufficient (but not necessary) for reaching A. Agent-2 wrongly infers that agent-1 intends to bring about M—and therefore brings about M itself (to help agent-1). But this prevents agent-1 from choosing the optimal means to reach A. ↩