Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol
There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?
I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over
I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.
In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.
The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.
Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.
On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.
There are certain areas of law work that are about analyzing large amounts of texts, drawing conclusions and writing other texts based on that and nothing more. That is literally the bread of LLMs.
Those types of lawyers should be the first in line for unemployment, not programmers, not even close.
This is a pretty limited introductory course based on what it says in the methods of the paper itself.
The quality of LLMs depends heavily on, among other things, how you word your questions.
Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.
That's the entire point, though!
The legal academy is supposed to have outlying opinions on things and present novel philosophical answers to questions. (And questions to answers!) So in addition to the statistical arguments against this paper made elsewhere, to me it doesn't real much new information.
If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.
Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.
Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.
But, it makes me wonder, will clients be able to use these AI-attorney systems in the future, in the court. Where they basically either just parrot what the model is instructing them to do, or - I dunno - give the model permission to speak for them (while waiving liabilities).
I have no doubt that some complex AI system can perform better than a bottom-tier, overworked lawyer.
I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...
Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.
It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.
THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .
That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?
That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.
If you think about it and extract sematics of any law you get something that looks familiar, sort of like code. Of course there's some complexities where certain phrases can mean different things, but legal papers in a way are written like they're programming languages already especially when it comes to law.
First we would have to define a language that can handle ambigious operations and we alread y have this with programatic proofs where n should land in x. So in the end I'd assume it would look something like this in a two party dispute:
This is very simplified and pseudo like language, writing out a full contract would be as long as a real contract.
DEFINE DEFENDANT "A Corp"
DEFINE PLAINTIFF "B Corp"
DEFINE CONTRACT CONTRACT(PLAINTIFF, DEFENDANT, 3054-41-95)
// attaching extracted requirements, definitions and obligations of contract
FACT PLAINTIFF delivered(goods) ON 7054-34-99
FACT DEFENDANT paid(0) OF CONTRACT.amount
CLAIM breach WHEN obligation(DEFENDANT, "pay") IS NOT satisfied
PROVE breach:
REQUIRE PLAINTIFF performed
REQUIRE DEFENDANT.paid < CONTRACT.amount
ASSERT delay WITHIN reasonable(time)
IF PROVE(breach):
AWARD PLAINTIFF (CONTRACT.amount - DEFENDANT.paid) + interest()
ELSE:
DISMISS
Then you would run a proof based LLM to generate it into target language and since we already had an example of this from one of the AI labs we know it works. Automatic citations and supporting proof would be automatically populated from reviewed legal -> DSL extracted papers as supporting evidence.I am sure that many AI labs are working on something similar already and we will see something like that in the near future as proof based llms evolve.
Attorneys will be using LLMs for convenience but they will not disappear, because there needs to be an ultimately human responsible of the decisions.
Reading it makes me extremely suspicious on how cherry picked this was
My experience then (this was back before "Attention Is All You Need", I hadn't met the output of generative models) was that students tended to produce work that did not have a proper thread of reasoning in it. There was a tendency to repeat things they had read but rehashed in various ways.
Reviewing some of their texts it was clear that much of the writing - by law tutors - was of the same kind. Much was incorrect. The fact that someone at some time had said a particular case was a proposition for something, meant that got repeated from book to book. Many authors simply didn't read their sources or check their references. Students repeated what they had been told incuriously.
Note: this was a graduate level course. Not wet about the ears undergraduates.
The worst material was little potted notes produced for law students. Utterly awful material in most cases.
Anyway, when LLM's became a thing, a lot of what did not feel right about their output and many of their error patterns, reminded me of the experience of teaching masters' students.
One of the saving graces of English court room practice (when I did that sort of thing) was that judges would say to you "where does it say that?" in a case you cited. You had better have them all at your fingertips and know exactly where you had cited. That avoided a lot of hallucination.
Just a random remark which might be of interest.
But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.
When AI clears the knowledge bar in a domain, the remaining moat becomes trust, accountability, and local regulatory context. That's actually good news for niche SaaS builders targeting specific jurisdictions: the generic AI layer commoditizes, but the "AI + local compliance + human accountability" bundle still has real pricing power.
Curious whether anyone has seen this play out already in contract review or compliance tooling outside the US.
75% win rate seems pretty good!
Paper link: https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...
My understanding is that Civil Law (most of the world excluding UK, US, AU) is like a program: you feed it a situation, it outputs a decision, every once in a while you edit it.
Common Law (UK, US) isn't really a program, but you could stretch and say it's a state machine that has been running since the country started. Every interaction sets a new precedent and changes the state. But the programming analogy falls apart because no one in the right mind would design such a program.
LLMs might actually be the best example of such a program though: Common Law is basically one long chat with an LLM, hundreds of years long.
Before LLMs came along, a Common Law system seemed to have a finite time limit before it's co-opted by wealthy people with the resources to read the whole history. Now I think maybe can push it a bit further.
But it's still a terrible program.
Julian Nyarko
Professor of Law
Co-Chair Stanford Law AI Initiative
Senior Fellow, Stanford Institute for Human-Cented AI (HAI)
LOL!NotebookLM was considered slightly better than 2.5 Pro by the evaluators.
So no wonder on this point.
One thing I want to mention: Law != Justice.
So while LLMs are awesome at the law study they will suck at justice. Just because one has to solve very emotional problems with it at times. And LLMs are not that good at finding the correct emotion.
By the time any research study is done on AI is published the models are already 0.5-1 generation ahead. Even this bullish outcome for AI models and their ability to perform useful work does not reflect how good they are now.
The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.
I think, in the right hands, this could be huge.
I mean, LLM's do OK with tutoring, but it depends more of how unique the questions are, not how difficult they are.
Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.
Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.
Making people believe that the 14 year old girl is a slut that was raping your poor client- THAT is lawyering.
Just massive data where you either do calculations or interpretation.
You will replace 100 lawyers with AI and have a single lawyer to review what the AI outputs and stamp their name on it for accountability.
Recently, I tasked Opus 4.6 to study a new Czech building permit law in conjunction with some waste disposal regulations and the result was disappointing. The model could not stop drawing conclusions from obsolete regulations in its training dataset, even when given the fulltext of the new law. The usual "you are totally right" also applied and its conclusions were most of the time obviously wrong even to a human with cursory knowledge of the subject.
I ended with studying the relevant regulations myself over the weekend.
Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.
https://fortune.com/article/rise-in-elite-students-seeking-a...
and where they wanted to ban words such as "chief", "stupid", "karen" and "American"
https://reason.com/2022/12/21/stanford-elimination-harmful-l...
I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.
It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.
How does everyone else feel?