Introduction
A lawyer rarely asks for useful work in a vacuum. The task is tied to a client, a docket, a jurisdiction, a deadline, an evidentiary record, a firm template, and an attorney's judgment about risk. A general-purpose AI model can reason well, but it normally starts by asking the lawyer to reconstruct that working context.
Lexi was built around the opposite assumption. The agent should start with the matter context already assembled: source documents, firm memory, templates, workflow rules, and approval gates. The benchmark should measure that practical advantage directly.
Cold starting a legal agent
Early product feedback is useful, but it can be misleading for legal agents. A demo matter might make an agent look strong because the user has already explained the file. A one-off prompt might make a general model look weak because it was denied the documents any lawyer would normally review.
That is why the benchmark has to separate two questions: how good is the model at legal reasoning, and how much legal work context does the product provide before the first prompt? The current readiness estimate gives general models credit for reasoning and follow-up questions. Lexi's additional credit comes from starting with the matter.
Building legal work environments
The benchmark object is a legal work environment: a synthetic but realistic law-firm operating context containing the materials a lawyer would need to do the work. It is more than a prompt and more than a case file. It is the working context around the assignment.
A legal work environment should include:
Source documents
Pleadings, orders, correspondence, contracts, transcripts, exhibits, or intake notes.
Procedural context
Jurisdiction, posture, deadlines, docket history, and local rule constraints.
Firm context
Preferred template, tone, citation style, review policy, and escalation rules.
Task context
What output is needed, where it goes next, and what attorney approval is required.
This mirrors how legal work actually happens. A motion, demand letter, chronology, or risk memo is only useful if it is grounded in the correct matter file and firm process.
Tasks and grading criteria
Each environment contains tasks. A task is a concrete legal job to be done inside that context. The output is graded against attorney-written criteria instead of vague impressions.
Example grading criteria for that task:
- Uses the correct procedural posture and jurisdiction-specific legal standard.
- Pulls facts from the actual source packet rather than inventing or assuming facts.
- Follows the firm's motion structure, tone, caption style, and citation format.
- Flags missing facts, evidentiary weakness, deadline risk, and attorney approval points.
| Task family | Representative task | Why blank chat struggles |
|---|---|---|
| Drafting | Motion, demand letter, client update, discovery response | Needs facts, posture, template, source cites, and attorney risk preferences. |
| Review | Contract issue list, pleading review, order summary | Needs the document set, review standard, jurisdiction, and next workflow step. |
| Fact extraction | Timeline, evidence map, party/event chronology | Needs source documents and a way to preserve the audit trail. |
| Risk spotting | Deadline, privilege, sanctions, evidentiary, or client-position risk | Needs matter-specific facts and firm escalation rules. |
| Workflow completion | Route draft to attorney, update matter notes, prepare follow-up task | Needs product context and firm workflow permissions. |
Current scoring estimate
Lexi uses two linked estimates. The homepage chart shows a legal task performance score: how well each system is expected to do on first-run legal work before extra lawyer setup.
General models receive meaningful credit for reasoning and useful follow-up questions. Lexi receives additional credit for matter context, source trails, templates, workflow actions, and review gates. The score discounts that setup advantage so the public number stays tied to useful first work product, not just product architecture.
| Category | Max | Lexi | Claude Opus 4.8 | ChatGPT 5.5 | Gemini Pro 3.1 |
|---|---|---|---|---|---|
| Legal reasoning and drafting ability | 25 | 22 | 23 | 22 | 20 |
| Useful follow-up questions | 15 | 12 | 13 | 12 | 10 |
| Matter-grounded task execution | 20 | 14 | 0 | 0 | 0 |
| Source-backed output and audit trail | 15 | 10 | 1 | 1 | 1 |
| Template and workflow completion | 15 | 9 | 1 | 1 | 1 |
| Review controls and risk flags | 10 | 5 | 5 | 5 | 4 |
| Estimated first-run performance score | 100 | 72 | 43 | 41 | 36 |
How to read the estimate
The 72 score is not a claim that Lexi's underlying model is twice as intelligent as a frontier model. It is a claim about first-run legal work product. Legal work requires the right documents, the right process, and the right review path. Without that context, a general model must stop and ask for setup.
The automation-readiness estimate is separate. It measures where Lexi appears to have enough workflow fit to assist on eligible legal work while attorneys remain in control of review, strategy, and judgment.
How the estimate becomes a benchmark
The formal benchmark should work like a product development loop, not a leaderboard. The goal is to learn which parts of the legal agent help: matter context, source retrieval, templates, firm memory, attorney approval routing, and workflow actions.
Build legal work environments
Synthetic law-firm operating contexts across drafting, review, fact extraction, risk spotting, and workflow completion.
Write attorney criteria
Each task gets concrete pass/fail criteria written by attorneys before systems are scored.
Run identical scenarios
Lexi runs with its product context; general tools run from the same first instruction without manual setup.
Publish the receipt
Show exact model names, dates, prompts, outputs, scoring criteria, limitations, and raw logs.
What we will report
The benchmark receipt should make the estimate auditable. At minimum, the report should include:
- Overall criteria satisfaction across all legal tasks.
- First-run attorney-ready rate: how often the first output needs review rather than a rewrite.
- Task-family breakdown for drafting, review, extraction, risk spotting, and workflow completion.
- Exact model names and dates, plus the conditions each system received.
- Representative examples showing where Lexi's context helped and where it still failed.
| Public metric | What it answers | Public status |
|---|---|---|
| Readiness index | How prepared is the system before manual setup? | Published as estimate |
| Criteria satisfaction | What percentage of attorney criteria did the output satisfy? | Benchmark receipt |
| Attorney-ready first draft rate | How often did the first output avoid a full rewrite? | Benchmark receipt |
| Task-family scores | Where does Lexi help most, and where does it still need work? | Benchmark receipt |
What the estimate is saying
The score is not a model-IQ claim. It measures the practical advantage of starting with matter files, source documents, deadlines, templates, firm preferences, and attorney review rules already in place.
That setup advantage is what the benchmark should test: whether Lexi's first output is easier to review, easier to trust, and faster to revise than a general model given the same first instruction.
| Claim | Status today | What backs it up |
|---|---|---|
| Lexi starts with more legal operating context. | Supported by rubric | Rubric categories for matter context, source access, workflow memory, legal-task framing, and review controls. |
| Lexi should require less setup before useful legal work begins. | Benchmark claim | Compare first-run outputs under the same instruction, with no manual file upload or prompt engineering for blank chats. |
| Lexi produces better first legal work product. | Benchmark claim | Attorney-scored tasks with prompts, outputs, model names, dates, scoring sheets, and reviewer notes published together. |
| Lexi replaces attorney judgment. | Not the claim | The benchmark should measure better drafts, clearer source trails, and lower setup burden while preserving attorney review. |
The benchmark turns readiness into proof.
The public benchmark should prove the practical question attorneys care about: does this context advantage produce first drafts that are easier to review, easier to trust, and faster to revise? The receipt should be simple: task set, grading rubric, raw outputs, exact model and date conditions, and attorney reviewer notes.
Get Your Firm's
2026 Automation Report
The benchmark explains the score. The automation tool applies the same thinking to your firm's workflow mix.
See how your firm scores