Law is ready for AI, but is AI ready for law?
Acknowledgements
This op-ed arose from the workshop on Automation in Legal Decision-Making: Regulation, Technology, Fairness organised on 14 November 2024 at the University of Copenhagen under the aegis of the LEGALESE project. The workshop aimed at inspiring critical thinking about the growing role of automation in legal decision-making and its implications for fairness and explainability in legal decision-making. Participants from multiple disciplines were invited to contribute and discuss blog posts reflecting on a range of pressing themes, including the role of human oversight, the challenges of explainability, and the importance of interdisciplinary dialogue for effective and responsible innovation. By engaging with both cautionary tales and constructive approaches, the event fostered a space for discussion on how legal, technical, and societal perspectives can come together to shape ‘fair’ ADM practices.
Funding: The author received funding from the Innovation Fund Denmark, grant no 0175-00011A.
Artificial intelligence is rapidly entering high-stakes fields, including the legal domain. The promise? Models that perform various tasks, such as predicting court decisions and automating legal reasoning, supposedly making legal processes faster, more transparent, and more consistent. Among the many broad applications of legal AI, legal judgment prediction (LJP) models have drawn particular attention, especially in recent years. Since the advent of large language models (LLMs), research on this topic has exploded. Yet much of this work feels like a significant waste of resources. In previous work with McBride, Legal Judgment Prediction: If You Are Going to Do It, Do It Right (2023), we highlighted how most LJP models miss the mark. Many fail to add real value and may even squander resources better spent elsewhere. This op-ed shifts to an extension of prediction models – those that claim to offer legal reasoning alongside predictions. Yet these “explainable” models suffer from the same problem: many claim to provide transparency but fall short, offering explanations too thin to meet the interpretability needs required to be usable. This critique applies not only to LJP, but also to other systems that potentially use predictive modeling to support (legal) decision-making, including those involved in algorithmic content moderation, automated notice-and-action mechanisms, or risk assessments under the Digital Services Act.
There are two general approaches in explainable legal prediction models that use machine learning that we see in most academic papers today. The first type predicts outcomes, then provides an explanation of how it arrived at that decision. Malik et al. (2021) propose a Court Judgment Prediction and Explanation (CJPE) task, where their model predicts case outcomes and highlights sentences as explanations. While the Indian Legal Documents Corpus (ILDC) dataset they use is valuable, it has a significant flaw: it contains facts available only post-decision, so these models aren’t actually predicting outcomes, as those facts are not available before the decisions are made. We discussed this limitation in our paper Medvedeva and Mcbride (2023). Other works, like Ye et al. (2018) and Wu et al. (2022), share this issue – models that rely on post-decision facts to generate explanations ultimately provide explanations for non-predictions. Though this approach might eventually help with models that use facts available before a decision, few are trying to develop such forward-looking models.
Valvoda and Cotterell (2024) take a similar approach, but focus on identifying the precedents AI models rely on versus those prioritised by judges. While this offers some insight into how AI might mimic – or fail to mimic – human precedent use, it too relies on post-decision facts and doesn’t actually explain new predictions. Such models might show patterns, but they still fall short of producing meaningful explanations.
A second approach attempts to generate “legal reasoning” that mirrors judicial explanations. Jiang and Yang (2023) proposed Legal Syllogism Prompting (LoT), which guides models through a syllogistic reasoning process. Machine learning-driven methods like these are relatively new in law, succeeding symbolic AI approaches like rule-based systems and argumentation frameworks (Sergot et al., 1986; Ashley and Rissland, 1988), which struggled with challenges like managing open-textured legal rules and conflicting norms. Yet ML-driven models are far from generating true legal reasoning. Tested on the CAIL2018 dataset (Xiao et al., 2018), the system presented in Jiang and Yang (2023) showed some performance gains, but real legal reasoning is not simply syllogistic. Legal cases are complex, requiring statutory interpretation, ethical balancing, and more, which syllogisms can’t capture. Such a framework may work on a narrow set of cases but lacks flexibility for the broad range of real legal cases. Worse, it can produce reasoning that appears rigorous without genuinely adding legal depth. Other prompting techniques, such as “legal reasoning prompting” and “chain-of-thought prompting” (Ye et al., 2018), also aim to structure responses like judicial logic. At times, these outputs are indistinguishable from human-generated text in fluency but lack the standards of legal reasoning needed by professionals.
Moreover, these techniques have rarely, if ever, been evaluated for legal soundness, leaving us with models that may look coherent but offer no assurance of legal validity or accuracy. Without rigorous evaluation, these outputs risk being persuasive but legally empty.
Evaluation is a major problem for these models. Metrics like ROUGE (Lin, 2004), used in some of the papers (e.g. Jiang and Yang, 2023) reward surface similarity but fail to assess reasoning quality, coherence, or legal soundness. “Reasoning steps” generated in models using chain-of-thought prompting are frequently evaluated only for word overlap, missing the substantive quality of each step. Manual evaluation by legal experts can add rigour but is costly, subjective, and impractical at scale.
If research on legal prediction models continues on this path, the impacts could be considerable and problematic. Academically, this trend risks promoting technically impressive but substantively weak models, skewing research toward novelty at the expense of real-world relevance. In practice, these models could mislead legal professionals into trusting AI-driven explanations that lack interpretability. Superficial explanations could mean judgments that compromise fairness.
Addressing these challenges is no easy task. Training models with “legal soundness” as a core criterion could help, but it’s a costly, labour-intensive process requiring legal expertise. However, just because we lack an ideal metric doesn’t mean we should rely on those that fail to capture what matters. What’s needed is a metric that can assess not only clarity but also the logical, ethical, and legally based soundness that defines legal decision-making. The way forward isn’t about pushing models to simply “sound” legal; it’s about creating tools that can genuinely incorporate the complexities of legal reasoning. Practical relevance and interpretability should be the guiding principles – not just to avoid wasted resources, but to create models that can truly support the demands of the legal systems. Instead of aiming for grand, all-encompassing systems, we should focus on small, incremental steps grounded in rigorous science, building models that actually do what they claim.






