Are You Verifying the Content of AI-Written Incident Reports? A Practical Solution to Prevent ‘Quiet Fabrication’ for 50,000 Yen a Month
Related Articles
Conclusion First: Do Not Trust AI-Written Reports Blindly
Currently, there is a quietly spreading phenomenon in small and medium-sized enterprises.
A system failure occurred. It has been restored. And the report? — “I had ChatGPT write it.”
This in itself is not a bad thing. The time taken to create a report has decreased from two hours to fifteen minutes. This is a correct approach for cost reduction. The problem is that no one is verifying the content of that report.
The text generated by large language models (LLMs) has a characteristic that can be called “quiet fabrication.” The grammar is perfect. The structure is logical. Readers feel that it is “well-organized.” However, the content is incorrect. Non-existent error codes are listed. Events that did not actually occur are cited as causes. And without anyone noticing, that report becomes the basis for decision-making.
This is the essence of the problem of being “plausibly incorrect.”
The Reality of ‘Fabrication’ in Numbers
A study published on arXiv (Investigation of the Accuracy of LLM-Generated GPU Kernels) reports multiple cases where serious bugs were hidden even in code generated by LLMs that passed existing tests. In other words, the premise that “passing the test means it is correct” is collapsing.
When this is translated into incident reports, it looks like this:
- A manager reads the AI-written report and judges that “there is no problem.”
- In reality, the description of the root cause is incorrect.
- The same failure recurs.
- The field becomes confused, saying, “But it said this in the previous report.”
In a small SIer, there was a case where an LLM wrote an incident report, and a process name that did not exist in the actual logs was cited as the cause. The person in charge happened to be an experienced veteran and noticed it. What if they had been a newcomer? It is highly likely that it would have been submitted to the client as is.
Why Does AI Make ‘Plausibly’ Incorrect Statements?
The reason is simple. LLMs are just arranging the “most probable next words” and do not have the capability to verify facts.
In terms of incident reports, they pull patterns from the training data that say, “In this type of failure, these causes are generally common.” However, they are not looking at the actual logs or metrics of the failure occurring in front of them. As a result, the descriptions may be correct in general but do not apply to the current incident.
What’s more troublesome is the evaluation bias among AI agents. A study called Contagion Networks shows that in an environment where LLMs evaluate each other, the bias of one agent can propagate to other agents. The idea that cross-checking with multiple AI tools ensures safety is naive; there is a possibility that they will all make the same mistake.
Where Do AI Agents Fail? — Three Patterns of Failure
The AgentArmor framework categorizes AI agent failures into three types. When translated to the context of small and medium-sized enterprises, it looks like this:
1. Undefined Specification Errors
If you do not instruct the AI on “how to write in such cases,” it will fill in the gaps on its own. If the definition of “scope of impact” is not explicitly stated in the incident report, the AI will fabricate a “plausible” range from the training data.
2. Capability Errors
Even though it should be able to write correctly, it does not comply. Even if prompted to “write based on the logs,” it may ignore the log content and resort to general descriptions.
3. Harness Errors
The AI cannot access the necessary information in the first place. It is not allowed to reference log files or provided with data from monitoring tools. If the input is insufficient, the output will be filled with assumptions.
The most common issue in small and medium-sized enterprises is the third one. When having AI write a report, are you just verbally conveying the overview of the failure and saying, “Leave the rest to you”? That is equivalent to requesting fabrication.
So, What Should Be Done? — A Verification System for 50,000 Yen a Month
The conclusion of “let’s stop having AI write reports” is incorrect. The benefit of being able to produce a report that humans took two hours to write in just fifteen minutes is significant. The problem lies in the lack of a verification mechanism, and the solution is to create one.
Here’s a breakdown of the costs involved.
Example Structure: Verification System Under 50,000 Yen a Month
| Item | Content | Monthly Estimate |
|---|---|---|
| LLM API Costs (for verification) | Cross-check outputs with GPT-4o or Claude, etc. | About 10,000 to 20,000 Yen |
| Log Reconciliation Script | A simple tool (self-made or OSS) to match report content with actual logs | 0 to 5,000 Yen |
| Checklist Template | Confirmation items such as “Does this process name exist?” or “Does this timestamp match the logs?” | 0 Yen (reusable once created) |
| Human Review (15 minutes per case) | Final confirmation by a human. However, with a checklist, it can be done in 15 minutes | Within existing personnel costs |
Total: 15,000 to 25,000 Yen + review time within existing personnel costs.
It does not even reach 50,000 Yen. However, the presence or absence of this mechanism makes a significant difference in the risks of misreporting to clients, overlooking recurring failures, and damaging credibility.
Why This Issue is Critical for Small and Medium-Sized Enterprises
In large companies, there are SRE teams, established post-mortem processes, and systems for reviewing reports. Even if AI makes some mistakes, someone somewhere will notice.
Small and medium-sized enterprises are different. It is not uncommon for the person writing the report to also be the one reviewing and submitting it. Because there are fewer eyes for checking, AI’s “quiet fabrication” can go through unnoticed.
Conversely, if a verification mechanism is in place, even a small number of people can produce high-quality reports in bulk. This means that what large companies do with many people can be replaced by AI and a system. The true meaning of small and medium-sized enterprises using AI lies here.
Three Actions to Take
1. Increase the Input Given to AI
Not just the overview of the failure, but also the actual logs, screenshots from monitoring tools, and timelines. The quality of the input determines the quality of the output.
2. Systematize Fact-Checking of Outputs
Create a checklist and reconcile the report content with real data. Ensure that this does not become a personal task, so that anyone can achieve the same results.
3. Review with the Premise that ‘AI Wrote It’
The points to focus on when reviewing a report written by a human and one written by AI are different. AI does not make mistakes in grammar or structure. The errors are in the facts. Focus on checking the facts alone.
The Question of Whether This is Acceptable
The act of having AI write reports will not stop; rather, it will accelerate. The issue is that the unconscious trust in “what AI wrote is correct” may become established without verification.
Whether to implement a verification system costing 20,000 Yen a month or not. The difference will determine whether you lose trust from clients six months later or are evaluated as “that company’s reports are accurate.”
What will emerge as the cost of AI decreases is a “world where anyone can write plausible reports.” At that time, the companies that can guarantee the accuracy of their reports will be chosen. This will be one of the few points of differentiation for small and medium-sized enterprises against large companies.
First, what to do today. Take out a recent report written by AI in your company and reconcile the facts listed in it with the logs. You should find at least one mistake. That will be the starting point.
JA
EN