29/10/25 1:44 PM

Understanding LLM Limitations in Legal Reasoning: What New Research Reveals

Dr Paul Hunter – Chief Scientist, Kylie Petersen – Chief of Client Success at EDT

As legal professionals turn increasingly to AI for tasks like evidence review and legal analysis, or consider doing so, it's critical to understand the true capabilities and limitations of Large Language Models (LLMs). Recent advances in reasoning models have generated significant interest, with claims that these systems can perform sophisticated analytical tasks. However, a new study from researchers at the University of Pittsburgh and Technical University of Munich reveals substantial limitations when these models face complex, hierarchical legal reasoning tasks.

The Challenge: Case-Based Legal Reasoning

Case-based reasoning is fundamental to common law legal systems. Legal professionals must draw analogies to favourable precedents while distinguishing their current case from unfavourable ones. This process requires abstracting specific facts into broader legal concepts, weighing conflicting evidence, and understanding how different facts contribute to an overall legal argument. It is not simple pattern matching.

In the recent study, the research team developed a formal framework to test whether LLMs can identify "significant distinctions": factual differences between a current case and a precedent that are critical enough to justify a different outcome. Building on CATO (“Case Argument TutOrial”), an influential AI system from the 1990s that pioneered the use of hierarchical factor structures for legal reasoning, they broke this complex task into three increasingly difficult subtasks:

Identify distinctions: Find all factual differences that could make the precedent a poor analogy
Analyse argumentative roles: Determine whether each distinction can be "emphasised" or "downplayed" by examining a hierarchical structure of legal concerns
Identify significant distinctions: Synthesise the previous analyses to determine which distinctions truly matter

The framework uses a hierarchical structure where concrete factual elements (called "factors") provide evidence for abstract legal concerns, which in turn support higher-level legal issues. This multi-level abstraction mirrors how legal professionals actually reason about cases.

Performance Degradation Across Task Complexity

The researchers evaluated five reasoning models (GPT-5, Gemini 2.5 Pro, Gemini 2.5 Flash, Qwen3-thinking, and GPT-OSS-120b) along with a non-reasoning baseline model. Each was tested on 253 instances of moderate complexity.

The results reveal a significant pattern. All tested models achieved perfect accuracy (100%) on identifying surface-level distinctions. However, performance degraded sharply as complexity increased. On the hierarchical reasoning task, accuracy dropped to 65-92%. On the integrated analysis task requiring synthesis of all previous steps, accuracy collapsed to just 11-34%.

These are state-of-the-art reasoning models, yet on the most complex task (the one that actually integrates everything together) they achieve correct results less than a third of the time at best.

This performance pattern reveals a fundamental limitation: current LLMs excel at pattern recognition and simple comparative analysis, but struggle with the kind of multi-step, hierarchical reasoning that characterises expert legal analysis.

Why Mathematical Reasoning Falls Short for Legal Reasoning Tasks

Most reasoning models use reinforcement learning on mathematical and logical reasoning tasks. This approach does provide benefits: the Qwen3-thinking model achieved 79% accuracy on Task 2 compared to just 30% for its non-thinking counterpart. That represents a substantial improvement.

However, mathematical reasoning and legal reasoning require different cognitive structures. The hierarchical nature of legal reasoning presents unique challenges. Models must navigate multiple levels of abstraction, understanding how concrete facts provide evidence for intermediate concerns, which in turn support high-level legal issues. They must also apply complex rules about when evidence is "blocked" by countervailing factors or when a distinction can be "downplayed" by alternative evidence.

The systematic failure on integrated reasoning tasks suggests that current post-training approaches, while effective for mathematical problems, do not adequately prepare models for the multi-step, hierarchical nature of legal argument. Training has been optimised for one type of reasoning without successfully generalising to legal domains.

Why These Limitations May Remain Hidden

For legal practitioners, a particularly important concern is how easily these limitations could remain hidden in practice. As the researchers note, when LLMs work with narrative legal text, especially in areas well-represented in their training data, they may appear to reason effectively by retrieving and adapting existing solutions. This is pattern matching, not genuine reasoning.

This can mask their underlying deficit in genuine hierarchical problem-solving. An AI legal assistant may appear capable on routine tasks, confidently generating analyses that seem sophisticated. However, when faced with novel situations requiring true hierarchical reasoning (the kind that defines expert legal analysis) the system lacks the necessary capabilities.

Managing Expectations: Where LLMs Excel

This study should not be interpreted as evidence that AI cannot assist with legal work. On the contrary, understanding these limitations allows legal professionals to deploy AI strategically where it delivers substantial benefits. The technology is evolving rapidly, and it has already demonstrated significant value for well-defined legal tasks.

Consider what the study actually demonstrates. Perfect accuracy on identifying distinctions (the first task) represents a significant achievement. This surface-level pattern recognition and comparative analysis is precisely where current LLMs deliver exceptional performance. Legal teams are realising substantial time, efficiency, consistency, and quality benefits with tasks including:

Document summarisation: condensing lengthy legal documents into digestible summaries with consistent quality across large document sets
Chronology building: extracting and organising temporal information from case files with speed and accuracy that would be impractical manually
Translation and transcription: handling multilingual documents and converting audio to text with high accuracy and immediate turnaround
Information retrieval: finding relevant documents and clauses across large document sets far more quickly than manual review
Image recognition: extracting text and information from scanned documents with consistent accuracy

These capabilities represent genuine advances in legal practice efficiency. AI excels at processing large volumes of documents consistently, operating without fatigue, and maintaining quality across repetitive tasks. The time savings are substantial, freeing legal professionals to focus on higher-level reasoning and strategic analysis where human expertise remains essential.

We are also observing promising results in our own experiments with RAG-based systems for legal research and document analysis. The technology continues to improve, and staying current with these developments enables legal practices to gain competitive advantages in efficiency and service delivery.

Deployment Decisions: Making Informed Choices

The study provides the kind of rigorous evaluation that legal professionals need to make informed deployment decisions. This formal research adopts a structured evaluation methodology that differs significantly from typical vendor benchmarks, which often rely on anecdotal evidence or simplified performance metrics. Rather than impressive demonstrations or vague claims, this approach reveals precisely where reasoning breaks down, providing a more reliable and nuanced understanding of LLM capabilities and limitations.

This distinction matters for strategic deployment. Current LLMs deliver significant value for information processing, document handling, and routine analytical tasks, offering measurable improvements in efficiency, consistency, and cost-effectiveness. For these document-intensive processes, initial review tasks, and information extraction activities, AI offers compelling advantages. However, for tasks involving complex legal reasoning where accuracy is paramount, human expertise remains essential.

The question is not whether to adopt AI, but how to deploy it strategically to maximise benefit while managing risk appropriately. Identifying and understanding these research findings supports defensible decision-making while limiting the risk of overreach: empowering legal professionals to leverage AI effectively while maintaining appropriate safeguards. This clarity is particularly valuable for organisations making decisions about AI adoption in legal contexts, helping them understand where to invest in AI capabilities for immediate operational benefits and where to maintain appropriate caution.

Implications for Legal Practice

In practical terms, understanding these capability boundaries enables more effective workflow design. In evidence review, for instance, surface-level document classification may be sufficient for initial processing, and therefore automation can streamline initial sorting and identification tasks. However, the nuanced legal reasoning necessary for assessing admissibility or establishing a legal threshold requires human expertise. This distinction between routine classification and complex legal analysis illustrates the productive division of labour that current AI capabilities support.

We understand how important it is for review teams to grasp proven capability together with established limitations when informing their evaluation processes and workflow development. Through our expert advisory partnerships with clients, we remain mindful of their compliance and legal obligations, helping them navigate the complexities of AI adoption in ways that deliver genuine operational benefits while maintaining appropriate risk management and professional standards.

The legal domain, with its explicit hierarchical structures and formal reasoning patterns, offers an ideal testing ground for understanding both AI capabilities and limitations. Studies like this one provide the grounded assessment needed while remaining open to future possibilities. Ongoing research and targeted training focused on legal-specific reasoning patterns will be key to evolving these capabilities. The authors suggest that specialised training for legal reasoning may help address current limitations, and continued development in this direction holds promise.

For now, current capabilities support viewing these systems as legal assistants rather than lawyers. While that may change as the technology develops, this productive division of labour enables legal professionals to leverage AI's strengths in information processing and routine analysis while maintaining human expertise for complex reasoning tasks.

The full paper, "Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning," is available on arXiv at https://arxiv.org/pdf/2510.08710. It provides detailed methodology for understanding what these systems can and cannot do, and where the practical opportunities exist for legal practice.

Blog, ediscovery, edt, artificial intelligence, large language models