Why Claude 4.6 changed the math on lease abstraction.
Earlier-gen models had a long-tail accuracy problem on commercial lease riders. 4.6 cut that error rate enough that exception-based review is finally viable.
Unsplash · circuit board
For two and a half years, the question we kept getting from heads of brokerage was the same: "how accurate is it, really?" And for two and a half years, the honest answer was "good on the headline fields, mediocre on the long tail." Base rent and term length were never the problem. The problem was everything that lives in the rider, co-tenancy clauses, escalation formulas with caps and floors, mid-term option rents, holdover language, restoration carve-outs.
These are the clauses that decide whether a renewal is actually worth what the tenant thinks it's worth, and they're the clauses that earlier-generation models hallucinated on with depressing regularity. We saw it in our own evals. We saw it at firms running side-by-side comparisons with their paralegal teams.
Claude 4.6 is the first model where the long-tail accuracy moved enough that the cost-benefit flipped. Not because it's perfect, no model is, but because the failure mode shifted from "confidently wrong about an option clause" to "flag the clause for human review." That's a different category of system.
What changed under the hood
We don't have inside knowledge on the model architecture. What we have is the eval data we run on every release: a held-out corpus of 4,000 commercial leases, office, industrial, retail, and mixed-use, with hand-labeled ground truth across 47 fields. The fields we care about most are the ones that move money: option rent calculations, percentage rent thresholds, exclusive use clauses, restoration obligations.
Across that corpus, 4.6 cut field-level error rates roughly in half versus 4.5, and roughly by an order of magnitude versus the 3.x-class models that the incumbent abstraction tools are still quietly running. The deltas are most dramatic on the clauses with the most legal nuance, which is exactly where you'd want it.
Why this matters for pricing
When the model is mediocre, you have to layer human review on every output, which means the cost structure looks like "AI plus a paralegal" rather than "AI replaces a paralegal." The economics never quite work, you're paying for both.
When the model crosses the threshold where review is exception-based rather than universal, the math flips. A broker reviewing 8% of fields instead of 100% of fields is doing five minutes of work instead of an hour. The marginal cost of an abstract drops by 90%, and the per-deal margin starts to support the price points the market actually wants.
That's the math that changed in 4.6. It's not a marketing claim, it's why we're comfortable selling the abstractor as a flat $30-49/month subscription instead of usage-priced like the legacy stack. The unit economics finally work.
What we're watching next
Two things. First, whether the next model gen closes the remaining gap on the truly adversarial leases, bespoke ground leases, sale-leasebacks with weird earn-outs, the stuff where even a senior paralegal slows down. Second, whether the rest of the stack catches up: connectors that can read the lease and write back into deal pipelines without a human shuttling files. The model is no longer the bottleneck. The plumbing is.
More reading
See it on a real lease.
Free tier is three abstracts a month, no card. Drop in a lease your team has already done by hand and check the diff.