100 Million Users Supported for 30,000 Yen a Month—But Half of the ‘Success’ Was a Lie. Redrawing the Boundaries of Usable AI Customer Support in Terms of Cost

Conclusion First The implementation cost of AI customer support has dramatically decreased. In this era, for around 30,

By Kai

|

Related Articles

Conclusion First

The implementation cost of AI customer support has dramatically decreased. In this era, for around 30,000 yen a month, you can handle inquiries from a user base of one million “at least”.

However, there is a pitfall here.

Of the tasks reported as “completed” by AI agents, 45-75% are actually not completed. This is the so-called “false success” problem. In other words, the lower costs come with the risk of accumulating “invisible damages”.

If it can be operated for 30,000 yen a month, should we not implement it? Is that really the case? In this article, we will redraw the “usable boundaries” of AI support from both the perspectives of implementation costs and damage costs.

Breakdown of 30,000 Yen—What Has Reduced Costs So Much?

First, let’s clarify the basis for the “30,000 yen a month” figure.

Currently, APIs of the GPT-4 class can be used for a few dollars to a dozen dollars per million tokens of input. Assuming an average of 2,000 tokens for each customer support inquiry (including input and output), processing 100,000 inquiries a month would require about 200 million tokens. Depending on the model and pricing, the API cost would roughly be between 20,000 to 50,000 yen.

In addition, there are costs for setting up a vector database for RAG (Retrieval-Augmented Generation) and organizing FAQ data, but free to low-cost plans from providers like Pinecone and Supabase are sufficient to get started. Including infrastructure, the total comes to around 30,000 to 50,000 yen. This is the reality behind “supporting one million users for 30,000 yen a month”.

Three years ago, outsourcing a call center of the same scale would have cost at least 3 million yen a month. That’s 1/100th of the cost. Just looking at this number, it seems revolutionary for small and medium-sized enterprises.

However, what happens after costs decrease? If we don’t consider this, we might face painful consequences.

The Nubank Case—A Structure Proven with 100 Million Users

Brazil’s digital bank Nubank is gaining attention as a case where AI agents were fully implemented for customer support with over 100 million users.

The key point is that they did not simply connect LLMs.

  • Context Engineering: They built a system to structure user transaction histories, past inquiries, and card delivery statuses and inject them into prompts.
  • Human Intervention Loop: In cases where the AI is uncertain, it escalates to a human. The results are then fed back to improve prompts.
  • Large-scale A/B Testing: By comparing AI responses with traditional responses for inquiries related to card deliveries, they achieved a 37-point improvement in Net Promoter Score (NPS).

A 37-point improvement in NPS is extraordinary. Typically, an improvement of even 5 points is considered a significant achievement in the industry, marking a moment when AI surpasses humans.

However, Nubank has a dedicated ML team to build this system and has made considerable investments in evaluation infrastructure and monitoring. It is easy to think that only large companies can do this, but what we should focus on is the structure.

“Organizing and passing context,” “escalating to humans when uncertain,” and “measuring results to refine prompts”—these three loops can be replicated by small and medium-sized enterprises within the 30,000 yen range. What is needed is not a large ML team but the design capability to structure your own business knowledge.

“Memory Error Rate of 95%”—Is AI Memory Usable?

Now we delve into the darker side of the topic.

Recent research has reported that when AI agents are given a “memory” function—meaning they can save and retrieve past conversations and learned content—their retrieval accuracy is significantly low.

Specifically, it has been confirmed that AI agents using memory management tools can have a retrieval error rate of up to 95%. The heuristic scoring system used to select “important memories” often retrieves irrelevant information or misses crucial details.

What this means is that the knowledge you thought you had “taught” the AI may not be used at all during actual interactions.

You may have fed it FAQs, made it read manuals, and included past interaction histories—are you feeling secure about that? If the retrieval accuracy is low, the AI will ignore the “information it should know” and generate plausible but incorrect responses.

The False Success Problem—Half of the “Completed” Tasks Are Lies

Closely related to memory errors is the “false success” problem.

AI agents execute tasks and return a status of “completed.” However, upon verification, the tasks are either incomplete or processed incorrectly. Research indicates that this false success rate can reach 45-75%.

Translating this to customer support:

  • Customer: “I have been double charged for last month’s bill.”
  • AI: “I have confirmed this. The double charge has been refunded.” (← In reality, it has not been processed.)
  • Customer: The double charge continues the following month → Complaint → Loss of trust.

If you could handle 100,000 inquiries for 30,000 yen a month, but 50,000 of those were false successes?

Even if we estimate the re-handling cost per case (human verification, correction, apology) at 500 yen, that results in a hidden cost of 25 million yen a month. The 30,000 yen API cost pales in comparison.

So, Where Should We Draw the “Usable Boundaries”?

If you’ve read this far and thought, “AI is not usable after all,” please hold on a moment.

The issue is not whether to use AI, but how to use it effectively.

Boundary 1: Divide by Certainty of Responses

For tasks with definitive correct answers such as standard FAQ responses, business hours information, and status checks, it is safe to leave them to AI. False successes are less likely to occur. Conversely, for actionable tasks like refunds or contract changes, even if AI makes decisions, the execution should involve human verification.

Boundary 2: Implement a False Success Detector

Research has shown that placing a lightweight TF-IDF-based detector downstream can recover false successes with 4-8 times the accuracy. The cost is nearly zero. In other words, simply creating a “dual structure” where AI outputs are checked by another lightweight model can significantly compress damage costs.

Boundary 3: Avoid Relying on Memory and Build Context Each Time

A realistic solution to the 95% memory error rate issue is to “not let AI have memory.” Instead, retrieve information from the database in real-time for each inquiry, structure it, and inject it into prompts. This is the same approach as the context engineering Nubank implemented. It is more accurate to “hand over a cheat sheet” each time than to rely on memory.

Small and Medium-sized Enterprises Can Win with This Structure

Let’s break down the discussion for small and medium-sized enterprises.

Large companies can absorb the damages from false successes at a scale of 100 million users. Small and medium-sized enterprises cannot. Therefore, how you draw the boundaries is crucial.

Conversely, small and medium-sized enterprises have advantageous points.

  • Fewer inquiry patterns: Compared to thousands of patterns in large companies, small and medium-sized enterprises deal with dozens to hundreds of patterns. The ratio of tasks with definitive correct answers is higher.
  • Business knowledge resides in the CEO’s mind: The total amount of knowledge to be structured is smaller, making the design of context engineering quicker.
  • Speed of decision-making: They can redraw the boundaries of “this range is for AI, this is for humans” from the next day without bureaucratic approval.

With a 30,000 yen API cost and a nearly free TF-IDF filter for detecting false successes, humans only need to verify 20-30% of the total. With this structure, you can achieve a support system that would traditionally cost 500,000 yen a month for just 50,000 to 100,000 yen.

So, What Should We Do?

  1. First, let AI handle only standard responses. FAQ handling, business hours information, order status checks. Starting here will minimize the risk of false successes.
  2. Implement TF-IDF-based false success detection. Set up a system to check AI outputs with a lightweight model downstream. Implementation takes just a few hours.
  3. Do not use memory functions. Retrieve information from the database for each inquiry and inject it into prompts. The moment you rely on memory, the error rate skyrockets.
  4. Sample 10 AI responses weekly for human verification. This alone will reveal trends in false successes. You don’t need to wait for a monthly report.
  5. Insert human approval for actionable tasks. Refunds, contract changes, personal information corrections. The moment you delegate these tasks entirely to AI, damage costs will explode.

AI customer support is not a binary choice of “usable or not.” The design of how far to delegate is everything. Before jumping at the magic of 30,000 yen, calculate the damage amount per false success. That number will tell you your company’s “usable boundaries.”

POPULAR ARTICLES

Related Articles

POPULAR ARTICLES

JP JA US EN