Benchmarking legal work readiness

A simple explanation of how Lexi estimates first-run legal task performance today, and how attorney-scored benchmark logs will replace estimates over time.

Estimated legal task performance score, out of 100
Claude Opus 4.843
ChatGPT 5.541
Gemini Pro 3.136

72/100 scores estimated first-run legal task performance. The separate automation-readiness estimate applies the same rubric to eligible firm workflows.

Introduction

A lawyer rarely asks for useful work in a vacuum. The task is tied to a client, a docket, a jurisdiction, a deadline, an evidentiary record, a firm template, and an attorney's judgment about risk. A general-purpose AI model can reason well, but it normally starts by asking the lawyer to reconstruct that working context.

Lexi was built around the opposite assumption. The agent should start with the matter context already assembled: source documents, firm memory, templates, workflow rules, and approval gates. The benchmark should measure that practical advantage directly.

Cold starting a legal agent

Early product feedback is useful, but it can be misleading for legal agents. A demo matter might make an agent look strong because the user has already explained the file. A one-off prompt might make a general model look weak because it was denied the documents any lawyer would normally review.

That is why the benchmark has to separate two questions: how good is the model at legal reasoning, and how much legal work context does the product provide before the first prompt? The current readiness estimate gives general models credit for reasoning and follow-up questions. Lexi's additional credit comes from starting with the matter.

Building legal work environments

The benchmark object is a legal work environment: a synthetic but realistic law-firm operating context containing the materials a lawyer would need to do the work. It is more than a prompt and more than a case file. It is the working context around the assignment.

A legal work environment should include:

Source documents

Pleadings, orders, correspondence, contracts, transcripts, exhibits, or intake notes.

Procedural context

Jurisdiction, posture, deadlines, docket history, and local rule constraints.

Firm context

Preferred template, tone, citation style, review policy, and escalation rules.

Task context

What output is needed, where it goes next, and what attorney approval is required.

This mirrors how legal work actually happens. A motion, demand letter, chronology, or risk memo is only useful if it is grounded in the correct matter file and firm process.

Tasks and grading criteria

Each environment contains tasks. A task is a concrete legal job to be done inside that context. The output is graded against attorney-written criteria instead of vague impressions.

Draft a hearing-ready motion using the pleadings, docket history, client facts, local rule, firm motion template, attorney preferences, and source citations.

Example grading criteria for that task:

  • Uses the correct procedural posture and jurisdiction-specific legal standard.
  • Pulls facts from the actual source packet rather than inventing or assuming facts.
  • Follows the firm's motion structure, tone, caption style, and citation format.
  • Flags missing facts, evidentiary weakness, deadline risk, and attorney approval points.
Task familyRepresentative taskWhy blank chat struggles
DraftingMotion, demand letter, client update, discovery responseNeeds facts, posture, template, source cites, and attorney risk preferences.
ReviewContract issue list, pleading review, order summaryNeeds the document set, review standard, jurisdiction, and next workflow step.
Fact extractionTimeline, evidence map, party/event chronologyNeeds source documents and a way to preserve the audit trail.
Risk spottingDeadline, privilege, sanctions, evidentiary, or client-position riskNeeds matter-specific facts and firm escalation rules.
Workflow completionRoute draft to attorney, update matter notes, prepare follow-up taskNeeds product context and firm workflow permissions.

Current scoring estimate

Lexi uses two linked estimates. The homepage chart shows a legal task performance score: how well each system is expected to do on first-run legal work before extra lawyer setup.

General models receive meaningful credit for reasoning and useful follow-up questions. Lexi receives additional credit for matter context, source trails, templates, workflow actions, and review gates. The score discounts that setup advantage so the public number stays tied to useful first work product, not just product architecture.

CategoryMaxLexiClaude Opus 4.8ChatGPT 5.5Gemini Pro 3.1
Legal reasoning and drafting ability2522232220
Useful follow-up questions1512131210
Matter-grounded task execution2014000
Source-backed output and audit trail1510111
Template and workflow completion159111
Review controls and risk flags105554
Estimated first-run performance score10072434136

How to read the estimate

The 72 score is not a claim that Lexi's underlying model is twice as intelligent as a frontier model. It is a claim about first-run legal work product. Legal work requires the right documents, the right process, and the right review path. Without that context, a general model must stop and ask for setup.

The automation-readiness estimate is separate. It measures where Lexi appears to have enough workflow fit to assist on eligible legal work while attorneys remain in control of review, strategy, and judgment.

How the estimate becomes a benchmark

The formal benchmark should work like a product development loop, not a leaderboard. The goal is to learn which parts of the legal agent help: matter context, source retrieval, templates, firm memory, attorney approval routing, and workflow actions.

01

Build legal work environments

Synthetic law-firm operating contexts across drafting, review, fact extraction, risk spotting, and workflow completion.

02

Write attorney criteria

Each task gets concrete pass/fail criteria written by attorneys before systems are scored.

03

Run identical scenarios

Lexi runs with its product context; general tools run from the same first instruction without manual setup.

04

Publish the receipt

Show exact model names, dates, prompts, outputs, scoring criteria, limitations, and raw logs.

What we will report

The benchmark receipt should make the estimate auditable. At minimum, the report should include:

  • Overall criteria satisfaction across all legal tasks.
  • First-run attorney-ready rate: how often the first output needs review rather than a rewrite.
  • Task-family breakdown for drafting, review, extraction, risk spotting, and workflow completion.
  • Exact model names and dates, plus the conditions each system received.
  • Representative examples showing where Lexi's context helped and where it still failed.
Public metricWhat it answersPublic status
Readiness indexHow prepared is the system before manual setup?Published as estimate
Criteria satisfactionWhat percentage of attorney criteria did the output satisfy?Benchmark receipt
Attorney-ready first draft rateHow often did the first output avoid a full rewrite?Benchmark receipt
Task-family scoresWhere does Lexi help most, and where does it still need work?Benchmark receipt

What the estimate is saying

The score is not a model-IQ claim. It measures the practical advantage of starting with matter files, source documents, deadlines, templates, firm preferences, and attorney review rules already in place.

That setup advantage is what the benchmark should test: whether Lexi's first output is easier to review, easier to trust, and faster to revise than a general model given the same first instruction.

ClaimStatus todayWhat backs it up
Lexi starts with more legal operating context.Supported by rubricRubric categories for matter context, source access, workflow memory, legal-task framing, and review controls.
Lexi should require less setup before useful legal work begins.Benchmark claimCompare first-run outputs under the same instruction, with no manual file upload or prompt engineering for blank chats.
Lexi produces better first legal work product.Benchmark claimAttorney-scored tasks with prompts, outputs, model names, dates, scoring sheets, and reviewer notes published together.
Lexi replaces attorney judgment.Not the claimThe benchmark should measure better drafts, clearer source trails, and lower setup burden while preserving attorney review.

The benchmark turns readiness into proof.

The public benchmark should prove the practical question attorneys care about: does this context advantage produce first drafts that are easier to review, easier to trust, and faster to revise? The receipt should be simple: task set, grading rubric, raw outputs, exact model and date conditions, and attorney reviewer notes.

Get Your Firm's
2026 Automation Report

The benchmark explains the score. The automation tool applies the same thinking to your firm's workflow mix.

See how your firm scores