What Happens When Token Costs Drop from 500,000 to 50,000 Yen a Month? — Three Technologies Small Businesses Should Know: KV Cache Compression, Proxy, and Cost Tracking
Related Articles
Conclusion First
The operational costs of LLMs are on the verge of a significant change.
“AI is expensive” — this has been the common understanding among small and medium-sized enterprises (SMEs). Running a model like GPT-4 for business purposes can easily rack up token costs of several hundred thousand yen a month. For a company with ten employees, that’s equivalent to hiring a new staff member.
However, in 2025, technologies that could overturn this understanding have emerged in rapid succession. The KV cache compression technology “TriAttention” developed by MIT and NVIDIA, agent proxies that reduce token usage, and the runtime “Ark” that visualizes costs at the decision unit level. When combined, these technologies make it plausible that the monthly token costs that were once 500,000 yen could drop to below 50,000 yen.
When costs are reduced to one-tenth, what happens? The discussion shifts from “Should we use AI?” to “How can we maximize its use?” This presents a significant opportunity for local SMEs to close the gap with larger corporations.
—
KV Cache Compression “TriAttention” — A Game Changer That Reduces Memory Consumption by One-Tenth
First, let’s focus on TriAttention. This technology was announced through joint research by MIT, NVIDIA, and Zhejiang University.
When LLMs handle long texts, they use a memory area called “KV cache” to retain past context. The problem is that as the context lengthens, this cache expands significantly. For instance, inferring 32K tokens (approximately 20,000 characters in Japanese) can lead to a rapid depletion of GPU memory. If memory is insufficient, a more expensive GPU is needed, or processing must be divided, resulting in longer processing times. Both scenarios increase costs.
TriAttention addresses this issue. Here are some specific figures:
- Generation throughput: 2.5 times (processing 2.5 times more in the same amount of time)
- KV memory usage: reduced to 1/10.7
- Accuracy: Maintains equivalence to full attention
Translating this into cost implications:
For example, if you are running a self-hosted model like Llama 3, workloads that previously required an A100 (80GB) could potentially be run on an RTX 4090 (24GB). In cloud terms, the hourly rate for an A100 instance is about 400-500 yen/hour, while an instance equivalent to an RTX 4090 costs 100-150 yen/hour. Infrastructure costs would drop to less than one-third.
Moreover, a throughput increase of 2.5 times means that you can handle 2.5 times more requests in the same time frame. Therefore, the cost per request would drop to 1/7 to 1/10 of the previous costs due to the combined effects of memory reduction and increased throughput.
Even if you are using an API-based service, if this technology is implemented on the provider side, it will reflect in the pricing. In fact, the prices of OpenAI and Anthropic have continued to drop below half over the past year, driven by advancements in such infrastructure technologies.
—
Agent Proxy — Stopping “Unnecessary Tokens” Before They Accumulate
Next, we should pay attention to the mechanism of AI agent proxies.
The primary reason for high costs associated with LLMs is “wasteful token consumption.” For instance, agents may repeatedly send the same information to the API, cramming unnecessary past logs into the context window, or resending full prompts with every retry. Such “invisible waste” inflates the monthly bills.
An agent proxy acts as middleware that sits between requests to the LLM and the application. It performs three specific functions:
1. Cache Hit: Detects identical or similar requests and reuses past responses. If the same question is asked 100 times, the API call counts as just one.
2. Context Compression: Automatically summarizes and reduces redundant information in prompts, decreasing the number of input tokens.
3. Model Routing: Automatically directs requests to either GPT-4o or GPT-4o-mini based on the difficulty of the request.
The effect of the third function, model routing, is particularly significant. In practice, 70-80% of requests can be adequately handled by models like GPT-4o-mini or Claude Haiku. The input token cost for GPT-4o is $2.50/1M tokens, while for GPT-4o-mini, it is $0.15/1M tokens. This results in a price difference of about 17 times. Simply automating this routing can halve the token costs.
Some estimates suggest that just implementing the proxy can lead to a 40-60% reduction in monthly token costs. If the monthly cost was 200,000 yen, it could drop to 80,000 to 120,000 yen.
—
Cost Tracking Runtime “Ark” — Making Invisible Costs Visible
The third technology is Ark. This runtime tracks costs associated with each decision made by AI agents.
Why is this important? Many SMEs experience the shock of “surprise API bills at the end of the month.” They have no idea which agent incurred what costs for which processes. Money flows out like a black box, making it impossible to improve.
Ark provides what can be described as “AI cost accounting.”
- Agent A’s customer support task: average 12 yen per request
- Agent B’s document summarization task: average 3 yen per request
- Agent C’s data analysis task: average 45 yen per request
Costs become visible at this level of granularity. This allows for specific actions to emerge, such as “Improving the prompt for Agent C could save 20,000 yen a month” or “Agent A has a low cache hit rate, so we need to revisit the proxy settings.”
This should resonate with business owners in SMEs. It’s akin to cost management in manufacturing. Without knowing how much each process costs, cost reduction is impossible. The same applies to AI.
—
What Can Be Done with 50,000 Yen a Month? — A Concrete Estimate
So, when these technologies are combined, what can actually be achieved with a budget of 50,000 yen a month? Let’s estimate.
Assumptions:
- Using GPT-4o-mini as the main model (input $0.15/1M tokens, output $0.60/1M tokens)
- Only routing complex tasks to GPT-4o (20% of total)
- Proxy cache hit rate of 30%
- Monthly budget of 50,000 yen (approximately $330)
Estimated Results:
- Number of requests that can be processed: approximately 15,000 to 20,000 requests/month
- Per day: approximately 500 to 670 requests
This means that in a company with ten employees, each employee could make 50 to 67 inquiries to AI per day. Tasks such as drafting emails, summarizing meeting minutes, generating customer response texts, and assisting with data analysis could cover a significant portion of daily operations.
A year ago, doing the same would have cost 200,000 to 300,000 yen a month. Now it’s just 50,000 yen. This results in an annual difference of 1.8 to 3 million yen, a significant figure for SMEs.
—
So, What Should Be Done?
The three actions SMEs should take now are:
1. First, visualize costs
Implement tools like Ark or open-source tracking tools such as LangSmith or Langfuse. Without visibility into the current situation, improvement is impossible. Implementation can be done in a matter of hours.
2. Implement model routing with proxies
Stop sending all requests to the highest-performing model. For 80% of tasks, a cheaper model is sufficient. This alone can halve costs.
3. Keep self-hosting options available
As technologies like TriAttention mature, we will enter an era where practical LLMs can run on a single RTX 4090. With an initial investment of 300,000 to 400,000 yen, monthly API costs could drop to zero. If you’re spending 50,000 yen a month, you could recoup your investment in 6 to 8 months.
—
What Happens After Costs Drop
Finally, let’s discuss what happens in the longer term.
When LLM costs drop below 50,000 yen a month, the boundary between “companies that can use AI” and “those that cannot” will disappear. It will no longer be a privilege reserved for large corporations.
In fact, SMEs, which often make faster decisions and are closer to the field, may benefit more from AI. While large corporations take three months for security reviews and approvals, SMEs can start using AI next week.
Lower costs mean lower barriers to entry. Lower barriers to entry mean that the difference will be not about “whether to do it or not,” but “how to make the most of it.”
The technology is in place. Now, it’s just a matter of trying it out.
JA
EN