"AI can't do accounting" benchmarks are asking the wrong question

A genre of AI benchmark has emerged over the past year that all reaches a similar conclusion: AI can’t do accounting. AI can’t calculate a tax return. AI can’t complete end-to-end work. AI loses the plot somewhere between minute one and minute five of any sustained task.

Processing Content

We’ve read these studies carefully. Some of them are technically rigorous. Most of them are well-intentioned. The problem is not that they are badly executed. The problem is they are often used to answer a procurement question. Almost all of them describe a version of AI that doesn’t resemble what we deployed at thousands of clients across some of the largest accounting firms in the country during Tax Season 2026.

This article is a practitioner-facing response. It’s not a takedown of any individual benchmark, several of which have advanced the conversation in genuinely useful ways. It’s an argument that the dominant industry framing of “what can AI do?” is structurally mismatched with how AI is being built and deployed in production today, and the conclusions firms draw from these benchmarks can leave them a tax season behind their peers.

What the benchmarks actually measure

Before pushing back, it’s worth being precise about what these studies do and don’t claim.

Column Tax’s TaxCalcBench evaluates whether a frontier language model, given structured taxpayer inputs, can natively calculate a Form 1040. The best-performing model scores in the mid-30% range on strict correctness. The authors are clear about what they’re testing: pure calculation, model-only, no scaffolding, no tax engine, no orchestration. Their conclusion of “AI can’t do your taxes on its own (yet)” is correct on its own terms.

DualEntry’s benchmark scores frontier models on accounting workflow questions: transaction classification, journal entries, AP/AR, reconciliation, financial reporting. The best model scores around 79%. Again, the framing is model-only, no surrounding system.

Harvey’s Legal Agent Benchmark tests whether agents can complete long-horizon legal work product end to end, against expert rubrics with all-pass grading. Many of the same considerations apply.

What unites these studies is their structural choice: They evaluate a language model in isolation, asking it to do one-shot a task that, in real life, would never be one-shotted by a human or an agent. That choice is intellectually defensible if you’re trying to characterize raw model capability. It becomes misleading when it gets read as evidence that AI systems cannot perform this work in production.

The mismatch: benchmarks vs. production systems

Here’s the gap. The best AI tax systems in production today do not look like the systems these benchmarks measure.

A production system for tax preparation is not a frontier model handed a set of W-2s and 1099s and asked to emit a Form 1040. Preparing a return is a workflow, not a single generation. Specialized agents take on different parts of that workflow: One reads and classifies documents, another plans the return, another populates worksheets, another reviews the output for inconsistencies. The surrounding software handles the rest: client document gathering and storage, validation against a deterministic tax engine, and review interfaces that walk practitioners through changes and flag what requires human judgment before anything is filed.

In practice, this changes the work from generating a return from scratch to reviewing a cited draft, resolving flagged issues and validating judgment calls. The benchmarks measure the model. Production systems are the model plus everything that surrounds it. Conflating the two leads to the wrong conclusion.

Six reasons the current benchmark genre misses the point

1. They ignore the deterministic systems to which the model is coupled

This is the single biggest gap. A language model asked to natively compute a Form 1040 will misuse the IRS tax tables, default to bracket math, and miss eligibility rules for the Child Tax Credit. TaxCalcBench shows this clearly. But that finding tells you very little about what happens when the model is paired with a tax engine, which is how production systems are actually built.

The pairing changes the work in two ways. First, the division of labor: The tax engine handles calculation deterministically, while the model handles document understanding, reasoning and worksheet population. Each system does what it’s good at. Second, the interface between them: the model isn’t asked to “write a tax return” in some text-only format invented for the benchmark. It populates specific, typed worksheet fields that the tax engine then validates and computes from. That creates entirely different constraints, feedback signals and failure modes than free-form generation graded against a reference output.

2. They give the agent less context than a human preparer would have

About a quarter of the files needed to prepare a return aren’t tax documents. They’re emails, client notes, prior-year work papers and unstructured communications between the firm and the client. When we ran our first pilot, our accuracy fell short on cases where this context wasn’t available, not because the agent couldn’t reason, but because it was reasoning about an incomplete picture. We’ve since built our email inbox, dynamic binders and client knowledge base to ensure the agent has the same information the human preparer would.

Benchmarks that hand the model only the minimum set of tax forms are testing it under conditions a human preparer would also struggle with.

3. They don’t account for the existing review hierarchy

Accounting work is reviewed work. There is a longstanding cadence of interns and junior preparers producing first drafts, seniors and managers reviewing, and partners signing off. Nobody is proposing that a client receive a final return that no human has ever looked at. What’s happening is that AI is starting to replace the first level of preparation, and the review hierarchy continues to operate above it. The relevant question isn’t “can the agent produce a perfect return?” It’s “can the agent produce a first draft that’s faster and better than the firm’s current process for starting review?”

Benchmarks built around all-or-nothing grading miss the review process that firms actually implement.

4. They one-shot tasks that, in real life, are iterative

A return doesn’t get prepared in a single pass, and modern production systems aren’t single-agent loops. Documents arrive in waves: W-2s in February, 1099s in March, K-1s in August. Within each pass, agent output is reviewed by a secondary agent that flags inconsistencies. Across passes, the agent generates a draft, the preparer reviews, follow-up questions are flagged, the client responds, the draft is updated, the cycle continues.

A benchmark that evaluates a single pass by a single model doesn’t capture any of this. It’s testing a sprint when the actual work is a marathon, and then concluding that errors compound when the production system is specifically designed to catch them.

5. They don’t measure what production systems are designed to surface

In Accrual’s design, when the agent isn’t sure about something, it raises an issue. It doesn’t guess. It tells the reviewer: “The client’s dependent Jane is over 19. Should they still be claimed as a dependent?” That’s the most valuable behavior an agent can have, because it directs reviewer attention to exactly the places where judgment matters.

The expectation cannot and should not be 100% accuracy on high-complexity work. The right expectation is hours saved and the share of remaining work the agent can proactively identify. Think of a junior preparer: It’s better when they tell you what they need help with than when they make a hard-to-spot mistake.

A benchmark that grades on whether the model produced the correct number, full stop, treats “I don’t know, please review” as a failure. In production, it is one of the most useful outputs the system can produce.

6. They miss the cadence mismatch between model improvement and accounting practice

Frontier models improve in months. Accounting firms operate on annual seasonality. You may have reviewed a benchmark in the past and missed what’s going to be possible later in the year. We saw this play out with our own client base: A year ago, most large firms were on the sidelines evaluating. Those firms also missed a season’s worth of internal learnings on how to bring practitioners along through the change. Today, demand is coming from nearly every firm with a tax practice, because the firms that moved earlier produced measurable results.

Firms that benchmarked AI tax tools last year and concluded “not yet” missed an entirely different category of tools for this tax season.

What real accuracy measurement looks like

It’s worth saying what a meaningful production benchmark requires, because we’ve honed in on this through running pilots and deploying them into production.

We measure accuracy as a true A/B comparison: The agent receives the same inputs the preparer used, generates its draft, and we compare it worksheet-by-worksheet against the firm’s filed return. This comparison is deliberately asymmetric. The agent’s first draft is measured against the human’s final, reviewed, signed-off return. The objective isn’t autonomous completion. It’s producing the most complete and accurate draft possible, with explicit preparer notes calling out everything that requires professional judgment.

We started with dollar-weighted accuracy and quickly learned its limits. A single missing document can cascade into apparent variance across a dozen line items, making a return that’s structurally correct look wrong, or making a return with an obvious omission look fine. An experienced reviewer catches these in seconds. Aggregate dollar accuracy doesn’t.

What we’ve shifted toward is measuring the discrete number of steps required to go from agent draft to final filed return. That number translates directly into hours saved, and hours saved translate into capacity firms can redirect to advisory work, complex returns or simply earlier filings with less burnout. It’s a measurement framework that matches how the work gets done.

When we run pilots, we often find that firms don’t have a defined accuracy standard for their existing human-prepared returns. The implicit assumption is that filed returns are 100% correct. In practice, there’s natural variation around the margins. Two preparers at the same firm may produce different drafts of the same return. AI evaluations don’t introduce that variance. They just make it visible.

What this means for firms

If you’re a firm leader trying to make a decision about AI, the honest takeaway from current benchmarks is this: Don’t read them as evidence about what AI can do for your practice. They’re measuring frontier models in isolation, which is a useful research input and a misleading procurement input.

The right diligence question isn’t “What does this AI score on a benchmark?” It’s “Is this system designed to work the way tax preparation actually works?”

Is the AI coupled to a deterministic tax engine, or is it being asked to do calculation natively?
Does the system give the agent access to all the context a human preparer would have, including prior-year returns and unstructured data such as emails?
Does the agent surface uncertainty as actionable preparer notes, or does it guess silently?
Is there a secondary review layer that catches errors before they reach a human reviewer?
Can every value in the output be traced back to a specific source document?
Does the workflow support iterative preparation, with incremental reviews as new information arrives?

Those questions describe the architecture of a system that works in production. A simplified benchmark score alone does not answer them.

What comes next

We’re confident in three things.

The first is that AI can already do a meaningful share of tax preparation work today. Across the firms we worked with this season, the agent generated review-ready drafts with full citations across tens of thousands of returns. Firms saw measurable time savings. Practitioners adapted their workflow from “prepare the return” to “review the draft and resolve the issues.”

The second is that the leap from this season to next is going to be easier than the leap from last season to this one. Frontier models are improving. Our agent orchestration keeps getting better. Each implementation surfaces edge cases we can fix before the next firm encounters them. And the work doesn’t reset each year. Every returning client gives the agent a richer starting point: prior-year worksheets, activity classifications, communications history, and the issues already resolved in the last cycle. The compounding effect is real, both across firms and within them.

The third is that the dominant industry narrative (“AI can’t do this yet”) is going to look increasingly out of step with what’s shipping. Firms that wait for benchmarks to validate the technology will be late. Firms that pilot, measure on their own data, and build internal intuition will be early. That gap is going to matter.

The question isn’t whether AI can do tax preparation. It’s already doing it, in production, this season. The question is whether to start compounding now, or spend next season catching up.

Credit: Source link