LLM Inference Costs Cut by 70%, Cash Savings of 95%—Five Things Companies Still Paying 100,000 Yen a Month Should Do in a Week Where ‘AI is Expensive’ is Over
Related Articles
Conclusion First: “AI is Expensive” is Over
This week, three news items emerged simultaneously.
- An 81,000-person survey of Claude users revealed that the majority are using it for coding, writing, and analysis, with 80% of applications being “routine repetitive tasks.”
- A technology called MixLLM was announced, demonstrating that it can reduce LLM inference costs by up to 70%.
- There have been multiple case studies of semantic caching, reporting over 90% cache hit rates for similar queries and up to 95% cuts in API call costs.
When we look at these three together, a clear structure emerges.
We have entered an era where “companies throwing the same questions at full price every time” are at a significant disadvantage.
Small and medium-sized enterprises (SMEs) paying 100,000 yen a month in API costs could potentially reduce that to 10,000 to 30,000 yen simply by changing their “system.” The issue is not the technology; it’s whether they are willing to reconsider.
—
The 81,000-Person Survey Reveals “Usage Bias”
The 81,000-person survey of Claude users released by Anthropic highlights the breakdown of “what they are using it for.”
The top uses included software development (code generation and review), writing and editing, and data summarization and analysis. In other words, the majority of tasks are patternizable.
This is a crucial point. Being patternizable means:
1. Past responses can be reused (caching is effective)
2. Not needing the highest-performing model (smaller models can suffice)
3. Standardizing prompts prevents personalization (systematization is possible)
Despite this, many companies are still paying full costs with “just the top-tier plan for Claude” or “just GPT-4o.” For example, a Pro plan at 20,000 yen multiplied by five people results in a monthly cost of 100,000 yen. If billed by usage via API, some companies may see costs balloon to 200,000 to 300,000 yen depending on usage.
The question is not whether “using it adds value.” It is about “how much can we achieve the same results for?”
—
Inference Costs Reduced by 70%—How MixLLM Changes Cost Structure
MixLLM is a technology that “quantizes outputs at different precisions (bit counts) based on their characteristics” during LLM inference.
Traditional quantization reduces precision uniformly, leading to unavoidable quality degradation. MixLLM maintains almost the same level of accuracy while reducing memory usage and costs by up to 70% by mixing high precision for important outputs and low precision for others.
For SMEs, the implications are straightforward:
- Processes that previously required high-performance GPUs (like A100) can now run on affordable GPUs or low-cost cloud instances.
- The barrier to self-hosting is lowered, eliminating the need to continue paying for API usage.
- Monthly cloud GPU costs could potentially drop from, say, 50,000 yen to 15,000 yen.
Of course, not all SMEs need to host models themselves. However, the important thing is that “options exist.” The era of being forced to rely solely on API usage is coming to an end.
—
Semantic Caching—Don’t Pay Twice for the Same Question
Another revolution is semantic caching.
Standard caching only hits on “exact matches.” However, semantic caching can detect semantically similar queries and reuse past responses.
For example:
- “Tell me the risks of this contract” and “Summarize the risk points of this contract” are nearly identical questions.
- Traditionally: Full API requests for both → Cost for two requests.
- After implementing caching: The second request is answered immediately from the cache → Zero cost, with latency drastically reduced.
In implementation cases, cache hit rates over 90% and API cost reductions of up to 95% have been reported. Response times can be shortened from several seconds to mere milliseconds.
Consider this in the context of SMEs:
- 80% of the questions asked in customer support are similar in nature.
- Queries thrown in internal knowledge searches are also limited in pattern.
- Daily report summaries, organizing meeting minutes, drafting emails—these are all repetitive tasks.
The more repetitive the tasks, the more dramatic the effects of caching will be. And this is precisely the daily operations of SMEs.
—
From 100,000 Yen to 20,000 Yen—Five Things to Do This Week
Now that the reasoning is clear, what should be done?
Here are five specific actions that can be taken this month.
1. Extract API Usage Logs and Measure “Rate of Similar Questions”
First, assess the current situation. Both OpenAI and Anthropic provide access to API usage logs. Analyze the requests from the past month to see “how many semantically similar queries there are.” Based on experience, 6 to 8 out of 10 business uses are similar queries. Just knowing this number will clarify the cost-effectiveness of implementing caching.
2. Implement Semantic Caching
There are open-source options like GPTCache and LangChain’s caching features. When combined with vector extensions for Redis or PostgreSQL, the implementation cost is nearly zero. If you have one engineer, they can set up a testing environment in 1 to 2 days. If you don’t have an engineer, outsourcing can be done for 100,000 to 200,000 yen. If the monthly API cost drops from 100,000 yen to 10,000 to 20,000 yen, you can recoup the investment in 1 to 2 months.
3. Reassess the Need for the “Top-Tier Model”
The input token cost for GPT-4o is about 20 times that of GPT-4o-mini. There is also a significant price difference between Claude 3.5 Sonnet and Haiku. The top-tier model is unnecessary for summarizing daily reports or drafting emails. By using different models for different tasks, costs can be reduced to less than half.
4. Calculate the “Nbreak (Break-even Request Count)”
Calculate the implementation costs of caching, the labor involved in switching models, and the monthly costs of self-hosting. For each, determine “how many requests are needed to break even.” In many cases, if you have over 1,000 requests a month, implementing caching will pay off immediately. If you have around 100 requests a month, a single Pro plan account may be sufficient.
5. Visualize “Who is Spending What on What”
This is the most important and the least done. By breaking down API costs by department and purpose, you can see facts like, “This department is using 30,000 yen worth of API just for meeting minute summaries.” If you can see it, you can fix it. If you can’t see it, it will continue to drain resources indefinitely.
—
The True Meaning of “AI Becoming Cheaper”
Finally, I want to consider what lies beyond cost reduction.
What happens when the cost of AI decreases to one-tenth?
“AI that was only usable by large companies becomes accessible even to companies with five employees.”
Large corporations that were spending 1 million yen a month to operate AI chatbots will be able to provide equivalent customer support for 10,000 to 20,000 yen. Internal knowledge systems that were built for 3 million yen could be constructed for 50,000 yen.
What lies beyond reduced costs is the disappearance of “scale advantages.”
Tasks that large companies were managing with teams of 100 can now be handled by SMEs with three people plus AI. This is not a dream; it is already happening. However, this applies only to companies that have realized the cost reduction and changed their systems accordingly.
The three news items released this week are eliminating the last excuse of “AI is expensive.”
The question is simple. Is your company still using this year’s AI at last year’s prices?
Start by opening this month’s API invoice.
JA
EN