The Notion That ‘Feeding Internal Data Makes AI Smarter’ Is a Myth—Research Findings on Increased Hallucinations from Fine-Tuning Shatter Common Sense About AI Investment in SMEs

Conclusion First "The more you train AI with specialized data, the more lies it generates." "Feeding internal data to A

By Kai

|

Related Articles

Conclusion First

“The more you train AI with specialized data, the more lies it generates.”

“Feeding internal data to AI makes it smarter.”

This common belief should be discarded.

Recent research has shown that fine-tuning LLMs with scientific data actually increases hallucinations. The more specialized knowledge you provide, the more confidently the AI will tell lies.

This isn’t just an academic issue. If small and medium-sized enterprises (SMEs) are considering spending hundreds of thousands of yen to fine-tune AI with their own data, they should take a moment to read this.

Fine-Tuning Leads to Decreased Reliability Across All Metrics

The findings of the paper “Finetuning with Scientific Data Increases Hallucinations” are simple yet shocking.

The research team evaluated the factual accuracy of 18 LLMs fine-tuned with scientific data using 2,500 prompts. The results were as follows:

  • Reliability of facts decreased across all types of hallucinations.
  • While the model’s internal confidence decreased, the output became more assertive.
  • In other words, a state emerged where the AI confidently lies even when it doesn’t understand the facts.

Consider what this means in practice.

For example, an AI fine-tuned with quality control data from the manufacturing industry might assert, “The heat resistance of this part is 350°C.” In reality, it is 280°C. However, because the AI speaks with such confidence, the people on the ground do not doubt it.

The assumption that specialized data makes the AI reliable is the most dangerous belief.

The cost of fine-tuning can reach hundreds of thousands of yen if outsourced to external vendors. Even if done in-house, GPU computing costs and data preparation can easily run into tens of thousands of yen. There is a possibility that you are spending money to create a lying machine.

Why Does This Happen?—LLMs’ “Memory” Is More Capricious Than You Think

Another study, “Factual Retrieval in LLMs Is a Redundant, Distributed and Non-Contiguous Process,” sheds light on the structure of this problem.

When examining how LLMs retrieve facts, the findings were:

  • The memory of facts is distributed throughout the model.
  • There are multiple redundant computational pathways for retrieving the same fact.
  • These pathways are not contiguous; they are processed in a fragmented manner.

In short, the “knowledge” of LLMs is entirely different from the human conception of a “database neatly organized and stored.” It’s like searching for items in a cluttered warehouse via different routes each time.

What happens when you shove new data into this fine-tuning process? It’s akin to throwing more items into a cluttered warehouse, disrupting the consistency with existing knowledge. As a result, outputs that sound plausible but are factually incorrect increase.

The risk is not that “adding data makes it smarter,” but rather that “adding data can break existing knowledge.”

Diving into fine-tuning without understanding this structure is like entering a mountain without a map.

So, Should We Just Use Knowledge Graphs?—They’re Not a Panacea Either

There’s a notion that if fine-tuning is ineffective, one can use external knowledge graphs. This involves referencing structured databases for the AI.

However, the research “Knowledge-Graph Grounding Helps LLMs Only for Out-of-Training Knowledge” indicates that this approach may not yield the expected results.

  • Knowledge graphs are effective only for new information that the model has not learned.
  • For facts that the model already knows, they are either ineffective or become noise.
  • In the medical field, there have been cases where general LLMs outperformed specialized retrieval tools.

In other words, knowledge graphs are not a guaranteed way to improve accuracy. The correct approach is to assess what the model knows and does not know, and only supplement the unknown areas with external data.

This holds significant implications for SMEs.

I’m Not Saying SMEs Should Stop Fine-Tuning—I’m Saying They Should Change Their Design

Let’s summarize the research findings so far.

Common Practices What Actually Happens
Fine-tuning with internal data Increases hallucinations and leads to confident lies
Connecting knowledge graphs without thought Noise for known information, effective only for unknown information
Pouring in large amounts of specialized data Disrupts consistency with existing knowledge

So, what should SMEs do?

The answer is to switch to a design centered on RAG (Retrieval-Augmented Generation) instead of fine-tuning.

With RAG, the core model remains untouched. When a question arises, it searches for relevant information from internal documents or databases and generates answers based on that. Because it does not rewrite the model’s “memory,” the risk of hallucinations is structurally lower.

Cost-wise, RAG is overwhelmingly advantageous.

  • Fine-tuning: GPU costs + data preparation + validation can reach 1 to 5 million yen.
  • RAG construction: Building a vector database + search pipeline can cost 100,000 to 500,000 yen.

With costs less than one-tenth and a lower risk of hallucinations, it is clear which option is more rational for SMEs.

Furthermore, with RAG, data updates are immediately reflected. Fine-tuning requires retraining, but RAG only needs document replacements. When internal manuals change or price lists are updated, RAG can respond to such routine changes in real-time.

Distinguishing Between “What Is Known” and “What Is Unknown” Is the Essence of AI Design

The research findings ultimately point to one key takeaway:

Design for “what to provide from the outside” rather than “what to make the AI remember.”

General knowledge that the LLM already knows can be used as is. Company-specific information—product specifications, customer interaction histories, internal rules—should be provided externally via RAG. Knowledge graphs should only be used for areas that the model clearly does not know.

Whether you can make this “distinction” will significantly affect the outcomes of AI implementation.

The good news for SMEs is that this design judgment does not require GPUs or large datasets. What is needed is an understanding of the business and the ability to differentiate between “what the AI knows” and “what only the company possesses.” This is something that SMEs, being closer to the ground, are likely to excel at compared to large corporations.

What to Do Tomorrow

  1. If you are getting estimates for fine-tuning, pause for a moment. Consider whether the same goals can be achieved with RAG.
  2. Conduct an inventory of internal data. Identify “information that the AI does not know and is unique to us.”
  3. Start small. Attach internal documents to ChatGPT or Claude and ask questions. This will help you grasp the concept of “RAG-like usage.”

Before spending hundreds of thousands of yen on fine-tuning, first try RAG for 50,000 yen. If that yields sufficient results, the remaining hundreds of thousands can be allocated to other investments.

The era of “feeding data to make it smarter” is over. The era of “designing how to provide data” is upon us.

POPULAR ARTICLES

Related Articles

POPULAR ARTICLES

JP JA US EN