Introduction

A lawyer rarely asks for useful work in a vacuum. The task is tied to a client, a docket, a jurisdiction, a deadline, an evidentiary record, a firm template, and an attorney's judgment about risk. A general-purpose AI model can reason well, but it normally starts by asking the lawyer to reconstruct that working context.

Lexi was built around the opposite assumption. The agent should start with the matter context already assembled: source documents, firm memory, templates, workflow rules, and approval gates. The benchmark should measure that practical advantage directly.

Cold starting a legal agent

Early product feedback is useful, but it can be misleading for legal agents. A demo matter might make an agent look strong because the user has already explained the file. A one-off prompt might make a general model look weak because it was denied the documents any lawyer would normally review.

That is why the benchmark has to separate two questions: how good is the model at legal reasoning, and how much legal work context does the product provide before the first prompt? The current readiness estimate gives general models credit for reasoning and follow-up questions. Lexi's additional credit comes from starting with the matter.

Building legal work environments

The benchmark object is a legal work environment: a synthetic but realistic law-firm operating context containing the materials a lawyer would need to do the work. It is more than a prompt and more than a case file. It is the working context around the assignment.

A legal work environment should include:

Source documents

Pleadings, orders, correspondence, contracts, transcripts, exhibits, or intake notes.

Procedural context

Jurisdiction, posture, deadlines, docket history, and local rule constraints.

Firm context

Preferred template, tone, citation style, review policy, and escalation rules.

Task context

What output is needed, where it goes next, and what attorney approval is required.

This mirrors how legal work actually happens. A motion, demand letter, chronology, or risk memo is only useful if it is grounded in the correct matter file and firm process.

Tasks and grading criteria

Each environment contains tasks. A task is a concrete legal job to be done inside that context. The output is graded against attorney-written criteria instead of vague impressions.

Draft a hearing-ready motion using the pleadings, docket history, client facts, local rule, firm motion template, attorney preferences, and source citations.

Example grading criteria for that task:

Uses the correct procedural posture and jurisdiction-specific legal standard.
Pulls facts from the actual source packet rather than inventing or assuming facts.
Follows the firm's motion structure, tone, caption style, and citation format.
Flags missing facts, evidentiary weakness, deadline risk, and attorney approval points.

Task family	Representative task	Why blank chat struggles
Drafting	Motion, demand letter, client update, discovery response	Needs facts, posture, template, source cites, and attorney risk preferences.
Review	Contract issue list, pleading review, order summary	Needs the document set, review standard, jurisdiction, and next workflow step.
Fact extraction	Timeline, evidence map, party/event chronology	Needs source documents and a way to preserve the audit trail.
Risk spotting	Deadline, privilege, sanctions, evidentiary, or client-position risk	Needs matter-specific facts and firm escalation rules.
Workflow completion	Route draft to attorney, update matter notes, prepare follow-up task	Needs product context and firm workflow permissions.

Current scoring estimate

Lexi uses two linked estimates. The homepage chart shows a legal task performance score: how well each system is expected to do on first-run legal work before extra lawyer setup.

General models receive meaningful credit for reasoning and useful follow-up questions. Lexi receives additional credit for matter context, source trails, templates, workflow actions, and review gates. The score discounts that setup advantage so the public number stays tied to useful first work product, not just product architecture.

Category	Max	Lexi	Claude Opus 4.8	ChatGPT 5.5	Gemini Pro 3.1
Legal reasoning and drafting ability	25	22	23	22	20
Useful follow-up questions	15	12	13	12	10
Matter-grounded task execution	20	14	0	0	0
Source-backed output and audit trail	15	10	1	1	1
Template and workflow completion	15	9	1	1	1
Review controls and risk flags	10	5	5	5	4
Estimated first-run performance score	100	72	43	41	36

How to read the estimate

The 72 score is not a claim that Lexi's underlying model is twice as intelligent as a frontier model. It is a claim about first-run legal work product. Legal work requires the right documents, the right process, and the right review path. Without that context, a general model must stop and ask for setup.

The automation-readiness estimate is separate. It measures where Lexi appears to have enough workflow fit to assist on eligible legal work while attorneys remain in control of review, strategy, and judgment.

How the estimate becomes a benchmark

The formal benchmark should work like a product development loop, not a leaderboard. The goal is to learn which parts of the legal agent help: matter context, source retrieval, templates, firm memory, attorney approval routing, and workflow actions.

Build legal work environments

Synthetic law-firm operating contexts across drafting, review, fact extraction, risk spotting, and workflow completion.

Write attorney criteria

Each task gets concrete pass/fail criteria written by attorneys before systems are scored.

Run identical scenarios

Lexi runs with its product context; general tools run from the same first instruction without manual setup.

Publish the receipt

Show exact model names, dates, prompts, outputs, scoring criteria, limitations, and raw logs.

What we will report

The benchmark receipt should make the estimate auditable. At minimum, the report should include:

Overall criteria satisfaction across all legal tasks.
First-run attorney-ready rate: how often the first output needs review rather than a rewrite.
Task-family breakdown for drafting, review, extraction, risk spotting, and workflow completion.
Exact model names and dates, plus the conditions each system received.
Representative examples showing where Lexi's context helped and where it still failed.

Public metric	What it answers	Public status
Readiness index	How prepared is the system before manual setup?	Published as estimate
Criteria satisfaction	What percentage of attorney criteria did the output satisfy?	Benchmark receipt
Attorney-ready first draft rate	How often did the first output avoid a full rewrite?	Benchmark receipt
Task-family scores	Where does Lexi help most, and where does it still need work?	Benchmark receipt

What the estimate is saying

The score is not a model-IQ claim. It measures the practical advantage of starting with matter files, source documents, deadlines, templates, firm preferences, and attorney review rules already in place.

That setup advantage is what the benchmark should test: whether Lexi's first output is easier to review, easier to trust, and faster to revise than a general model given the same first instruction.

Claim	Status today	What backs it up
Lexi starts with more legal operating context.	Supported by rubric	Rubric categories for matter context, source access, workflow memory, legal-task framing, and review controls.
Lexi should require less setup before useful legal work begins.	Benchmark claim	Compare first-run outputs under the same instruction, with no manual file upload or prompt engineering for blank chats.
Lexi produces better first legal work product.	Benchmark claim	Attorney-scored tasks with prompts, outputs, model names, dates, scoring sheets, and reviewer notes published together.
Lexi replaces attorney judgment.	Not the claim	The benchmark should measure better drafts, clearer source trails, and lower setup burden while preserving attorney review.

The benchmark turns readiness into proof.

The public benchmark should prove the practical question attorneys care about: does this context advantage produce first drafts that are easier to review, easier to trust, and faster to revise? The receipt should be simple: task set, grading rubric, raw outputs, exact model and date conditions, and attorney reviewer notes.

Get Your Firm's
2026 Automation Report

The benchmark explains the score. The automation tool applies the same thinking to your firm's workflow mix.

See how your firm scores

Benchmarking legal work readiness