RAG Accuracy Declines with More Documents — Structural Reasons Why Small Businesses Can Outperform Large Corporations in AI
Related Articles
Conclusion
Let’s get straight to the point. The era where more data makes AI smarter is over.
The more documents you feed into RAG (Retrieval-Augmented Generation), the lower the accuracy of the responses.
This is not just a feeling; it’s backed by numbers in research. An experiment using data from the Wyoming Department of Transportation showed that increasing the number of documents fed into RAG from 54 to 1,128 caused accuracy to plummet from 75% to below 40%. When the number of documents was increased twentyfold, the accuracy was halved.
In other words, the common belief that “larger companies with more data have an advantage” has been turned on its head in the world of RAG.
This presents a structural tailwind for small and medium-sized enterprises (SMEs).
Why Does Accuracy Decline with More Documents?
The reason lies in a phenomenon known as “vector search dilution.”
The mechanism of RAG is simple. It receives a user’s question, retrieves potentially relevant documents through vector search, and then generates a response based on those documents. In essence, it’s like “answering while looking at a cheat sheet.”
The problem arises when there are too many cheat sheets.
When there are few documents, most of the retrieved documents are likely to be “hits.” However, as the number of documents increases, a large number of documents that are similar but subtly different, or related but not essential, start to appear. Since the LLM (Large Language Model) generates responses based on this noise-filled cheat sheet, accuracy naturally declines.
With 54 documents, it’s almost a one-to-one match: “this document answers this question.” But with 1,128 documents, it becomes “there are 30 documents that seem relevant to this question,” leaving the LLM unsure which to trust.
This is the trap that large corporations often fall into. They think that by feeding RAG with manuals, meeting minutes, regulations, FAQs, and past reports from all departments, they will make it smarter. In reality, the opposite happens. The more they feed it, the more noise increases, responses become ambiguous, and the field workers decide it’s “useless” and move away.
The Structural Advantage of SMEs’ “Smallness”
Now, let’s get to the main point.
SMEs inherently have less data, and their operational scope is narrow. This becomes a tremendous strength.
For example, consider a manufacturing company with 30 employees. They handle a few dozen types of products, have a few dozen clients, and their operational manuals are at most a few dozen pages. What happens when this data is fed into RAG?
The number of documents is in the range of dozens to hundreds. The dilution of search is almost nonexistent. Relevant documents are pinpointed in response to questions, resulting in the ability to maintain a high level of accuracy.
Moreover, the quality of the data is uniform. Unlike large corporations, where terminology varies by department and formats are not standardized, this issue is less likely to occur in a company with 10 employees, where everyone refers to the same thing using the same terms. This “consistency of terminology” enhances the accuracy of vector searches.
Let’s put this into specific numbers.
When large corporations build RAG systems, the cost of data organization and cleansing can reach several million yen. This is because they need cross-departmental data integration, terminology standardization, and authority management.
On the other hand, SMEs can operate simply by organizing manuals and past inquiry histories directly related to their business and feeding them into RAG. By using cloud vector databases and APIs, they can run it for a monthly cost of several thousand to tens of thousands of yen. Initial setup can be done for a few hundred thousand yen, depending on the approach.
300 million yen for a large corporation’s RAG with 40% accuracy vs 30 million yen for an SME’s RAG with 75% accuracy.
This is the reality that is currently unfolding.
“Domain Specialization” Is Not a Strategy but a Natural State for SMEs
When large corporations try to improve RAG accuracy, they need to adopt a “domain specialization” approach. This means building RAG not across the entire company but segmented by department or business area. While this is a correct approach, the larger the organization, the higher the execution costs.
SMEs are different. Since their operational domains are already narrowed down, they naturally end up with a “domain-specialized RAG” without even trying. It’s not a strategy they choose; it simply happens.
This difference is significant.
While large corporations spend hundreds of thousands on consultants claiming “RAG accuracy is lacking,” redesigning data structures and taking six months to rebuild, SMEs can have a functioning RAG by simply saying, “Let’s try feeding in the documents we have for now.”
Another Pitfall — AI’s “False Confidence”
The accuracy issue is not limited to RAG. There is a troublesome characteristic of LLM agents known as “false success.”
This refers to the phenomenon where AI confidently reports that it has “completed a task” when, in fact, it has failed. Research indicates that in certain benchmark environments, a significant number of cases where AI judged it to be “successful” were actually inaccurate results.
This problem becomes more pronounced the larger and more complex the data. When running AI agents within the complex workflows of large corporations, the AI confidently returns “plausible answers,” but those answers are clearly incorrect to the human workers on the ground — this situation occurs frequently.
In the case of SMEs, the limited complexity of their operations allows human workers to quickly determine whether the AI’s responses are correct. The scale is such that the CEO themselves can notice, “This is wrong.” The speed of this feedback accelerates the cycle of improving RAG accuracy.
In large corporations, even if the AI’s responses are incorrect, it goes through multiple layers of approval before anyone realizes it. In SMEs, the user can immediately say, “That’s wrong,” and correct the data within the same day.
So, What Should Be Done?
There are three key points for SMEs to achieve results with RAG.
1. Narrow down the documents to be included. “Including everything” is the worst approach.
Only select documents that are directly related to the business. Do not include documents that might be useful someday. Having fewer documents is not a weakness but the greatest weapon for maintaining accuracy.
2. Start with one business area. Company-wide deployment can come later.
Focus on one specific area, such as “creating estimates,” “responding to customer inquiries,” or “verifying product specifications,” and start small. Once accuracy is confirmed in that area, expand to the next business.
3. Create a system to immediately reflect the “wrong” feedback from the field.
When RAG returns an incorrect answer, implement a system where field workers can provide feedback with one click. If this cycle is in place, the accuracy of RAG will improve the more it is used. This is a speed that large corporations cannot replicate.
Smallness is No Longer a Weakness
For a long time in the world of AI, it has been said that “those with data win.” This has been true for training data, but a reverse structure has emerged in the operation of RAG.
Less data leads to higher accuracy. A narrow business scope prevents noise. A smaller organization enables faster improvements.
The “smallness” of SMEs has transformed into a structural advantage in AI utilization.
This is not a temporary trend. Due to the nature of RAG, the problem of vector search dilution will inevitably occur as the number of documents increases. Even as technology evolves, this structure will not easily change.
There is no need to mimic large corporations. Rather, SMEs should leverage their ability to create “small, fast, and sharp RAGs” that large corporations cannot achieve.
With 54 documents achieving 75% accuracy. The question is whether you can turn this number into a weapon for your company. The answer lies in trying it out first.
JA
EN