As AI Becomes Smarter, Verification Costs Skyrocket—The True Cost Structure of Automation That Small and Medium Enterprises Need to Know
Related Articles
As AI Becomes Smarter, Verification Costs Explode
The accuracy of AI has risen from 99% to 99.9%. That’s an impressive advancement.
Now, here’s a question: How much does it really cost to prove that “0.9% improvement”?
The answer is far higher than most people imagine. And herein lies the biggest pitfall that small and medium enterprises often overlook when adopting AI.
The cost of implementing AI has decreased. There are tools available for just a few thousand yen a month. However, the cost of verifying “Is this AI really reliable?”—this has not actually decreased. In fact, as AI becomes more high-performing, this cost increases. If you adopt AI without understanding this structure, you will face painful consequences.
“Verification Tax”—The Paradox of Increasing Sample Sizes as Accuracy Improves
Researchers refer to this phenomenon as “verification tax.”
The mechanism is simple. As the error rate of AI decreases, the errors themselves become “rare events.” To statistically detect rare occurrences, a vast number of samples are required.
Let’s be specific:
- An AI with a 10% error rate → With 100 samples, about 10 errors can be found. Verification is possible.
- An AI with a 1% error rate → With 100 samples, only 1 error appears. Statistically, nothing can be concluded.
- An AI with a 0.1% error rate → Even with 1,000 samples, only 1 error is found. To prove reliability, tens of thousands of samples are needed.
Research shows that as the error rate ε of an AI model decreases, the number of samples required for verification increases on the order of Θ((1/ε)^{1/3}). This means that even if accuracy improves tenfold, the verification cost balloons to about twice as much. If it improves a hundredfold, the cost increases by approximately 4.6 times.
It’s not true that “AI has become smarter, so we can relax.” The correct statement is: “AI has become smarter, so proving that we can relax is costly.”
What This Means for Small and Medium Enterprises
Large corporations can prepare tens of thousands of test data and have specialized verification teams. But a company with 30 employees doesn’t have that luxury.
For example, let’s say you implement an AI for automatic invoice reading. The vendor claims “99.5% accuracy.” At 20,000 yen per month, it seems cheap. Let’s go ahead with the implementation.
But wait.
To verify whether the accuracy of 99.5% is indeed true, you would need to have humans check at least several thousand invoices. Assuming it takes 3 minutes per invoice, checking 3,000 invoices would take 150 hours. At a rate of 2,000 yen per hour, that amounts to 300,000 yen.
Against the AI’s monthly usage fee of 20,000 yen, the cost of one verification is 300,000 yen. This is the true nature of the “verification tax.”
Moreover, this isn’t a one-time issue. Verification will be necessary every time the AI model is updated. Every time the vendor claims, “The accuracy has improved,” there will be a cost to confirm that.
The More Complicated Issue of “Hidden Measurement Errors”
The verification tax alone is burdensome, but there’s another issue that cannot be overlooked: “The method of verification itself is inconsistent.”
Research on evaluating large language models (LLMs) has revealed shocking facts:
- Just slightly changing the wording of the prompt can lead to significant fluctuations in evaluation scores.
- Changing the evaluator (human or another AI) can alter the results.
- Adjusting the temperature parameter (randomness of output) can also affect the scores.
In other words, even when evaluating the same AI with the same data, the conclusions can vary depending on “how the evaluation is conducted.” When Vendor A claims “95% accuracy” and Vendor B claims “90% accuracy,” it’s impossible to distinguish whether the difference is truly due to model performance or merely a difference in measurement methods.
This is serious for small and medium enterprises. They are left with no choice but to take the vendor’s numbers at face value, even though those numbers are dependent on the measurement method and can fluctuate.
However, there is hope. Research shows that by optimizing the evaluation pipeline, the evaluation error can be halved at the same cost. Specifically, this can be achieved by preparing multiple variations of prompts and averaging the results, or by using multiple evaluators. This approach increases accuracy without raising costs, and it is feasible for small and medium enterprises to implement.
The “195 Benchmarks Exist, but Few Are Usable” Problem in AI Safety
Another structural issue exists across the industry.
Currently, there are over 195 benchmarks for evaluating AI safety. However, most of these are biased towards tasks of “moderate complexity,” and very few comprehensively test truly dangerous rare cases—such as discriminatory outputs, incorrect medical information, or legally risky responses.
Furthermore, there is a significant lack of support for languages other than English. Safety evaluation benchmarks in Japanese are extremely limited. This means that when Japanese small and medium enterprises try to confirm, “Is this AI safe to use in Japanese?” there is almost no reliable metric available.
Despite having 195 benchmarks, virtually none are usable by Japanese small and medium enterprises. This is the reality.
So, What Should We Do?
I’m not saying, “Don’t use AI because verification is expensive.” That’s a form of mental paralysis.
What small and medium enterprises should do is to make decisions about AI adoption while factoring in verification costs. Specifically, there are three steps:
1. Start from Areas Where “Verification Isn’t Necessary”
Introduce AI in areas where errors won’t be fatal. For instance, summarizing internal meeting notes, brainstorming ideas, or drafting documents. In these cases, there’s no need to prove “99% accuracy.” It can be used with the assumption that humans will perform the final check. The verification tax is zero.
2. Ask Vendors About “Verification Costs”
Instead of asking, “What is the accuracy percentage?” ask, “How did you measure that accuracy?” “How many samples were used?” “Was it verified in Japanese?” Vendors who cannot answer these questions may not have verified their claims at all.
3. Run Small Tests and Verify with Your Own Data
Just 100 samples are enough. Run the AI with your actual data and have humans check all entries. If 5 errors appear in 100 samples, the error rate is roughly around 5%. If there are 0 errors in 100 samples, you can at least say, “It doesn’t seem to have an error rate of several percent or more.” While it’s not statistically perfect, it’s far better than zero. The important thing is not to trust the vendor’s numbers, but to see the results with your own eyes.
The True Cost Is Not the “Implementation Fee” but the “Cost of Maintaining Trust”
The usage fee for AI tools ranges from a few thousand yen to tens of thousands of yen per month. Consulting fees for implementation can reach hundreds of thousands of yen. This much is visible.
However, the cost of continuously verifying, “Is this AI really reliable?”—the verification tax—is often not included in estimates. And this cost tends to increase as AI becomes more high-performing.
Whether or not you understand this structure will determine the success or failure of AI adoption.
Technological advancements and decreasing costs pertain to the “cost of running AI.” The “cost of trusting AI” is, in fact, increasing. This asymmetry is the most important fact that every small and medium enterprise looking to adopt AI should be aware of.
JA
EN