Five Simultaneous Technologies That Cut LLM Inference Costs by 70%—The Structural Reasons Why ‘AI is Expensive’ No Longer Holds

Conclusion Let’s get straight to the point. The notion that "AI is expensive" is structurally coming to an end. In June

By Kai

|

Related Articles

Conclusion

Let’s get straight to the point. The notion that “AI is expensive” is structurally coming to an end.

In June 2025, five papers were published in quick succession addressing the reduction of LLM inference costs. These include KV cache compression, memory bandwidth optimization, parallel decoding, cost-aware model selection, and the efficiency of reinforcement learning. Different teams are pursuing different approaches but are all heading in the same direction.

This is not a coincidence. The challenge of “reducing inference costs” has become the top priority for the entire research community.

So, what does this mean for small and medium-sized enterprises (SMEs)? Let’s delve into that.

The Five Technologies and Their Impact

1. KV Cache Compression—Transforming Memory Usage Significantly

A paper titled “Sequential KV Cache Compression via Probabilistic Language Tries” has been released. When LLMs perform inference, the “KV cache,” which retains information about past tokens, consumes a significant amount of memory. The longer the text being handled, the more this becomes a bottleneck.

This research compresses the KV cache using a probabilistic trie structure. The reported compression rates are extraordinarily high. Importantly, the output quality remains virtually unchanged even after compression. In other words, it’s not a case of “cheap and poor quality.”

What changes practically? The context length that can fit in GPU memory increases. Tasks involving long texts that previously required multiple expensive A100 GPUs may now be handled with just one. The monthly cost per GPU in the cloud is approximately 300,000 to 500,000 yen. Halving this cost could save between 1.8 to 3 million yen annually.

2. Ragged Paged Attention—Utilizing TPU Memory Bandwidth Up to 86%

Optimized for Google’s TPU, “Ragged Paged Attention” processes LLM workloads with dynamic memory slicing, increasing memory bandwidth utilization to 86% and FLOPs utilization to 73%.

Traditional attention processing often wastes memory due to fixed-size blocks, leading to a significant amount of unused space. Ragged Paged Attention dynamically manages this, extracting more inference from the same hardware.

This improves the efficiency of cloud infrastructure. If the costs for API providers decrease, the price of APIs will also drop. The inference cost of OpenAI’s GPT-4 has already decreased to less than one-tenth between 2023 and 2025, and this trend is expected to accelerate. SMEs do not need to directly interact with TPUs; the benefits will manifest as lower API prices.

3. DepCap—Increasing Generation Speed Without Compromising Quality

“DepCap” is a block-based parallel decoding method. LLMs generate tokens sequentially, which is a cause of slowness. While generating in parallel could speed things up, it typically compromises quality.

DepCap adaptively determines block boundaries, dynamically deciding where parallel generation can occur without quality loss. Compared to fixed block methods, this significantly improves the trade-off between quality and speed.

What happens when speed increases? The number of requests that can be processed per unit time on the same GPU rises, improving throughput. Higher throughput means lower costs per request, enhancing user experience. If response times drop from three seconds to one second, the adoption rate of internal tools changes dramatically.

4. Cost-Aware Model Orchestration—A System to Stop Using “All GPT-4”

“Cost-Aware Model Orchestration for LLM-based Systems” may have the most significant practical impact.

In essence, it’s a system that automatically switches models based on task difficulty. For simple questions, it uses GPT-4o-mini, while for complex reasoning, it routes to GPT-4, automating the process.

The paper reports that model selection accuracy improves by up to 11.92%, and energy efficiency increases by 54%.

This is precisely what SMEs should implement. In fact, 80% of the questions that come into internal chatbots are “routine inquiries.” There’s no need to use the highest-performing model for every single one. By routing sufficient processing to a cheaper model, monthly API costs can be cut to less than half.

Specifically, the input token cost for GPT-4o is $2.50/1M tokens, while for GPT-4o-mini, it’s $0.15/1M tokens—a difference of about 17 times. If 80% of requests can be routed to mini, costs could be reduced by over 70%. This isn’t technically challenging; it simply requires implementing a routing mechanism.

5. Adaptive Entropy Regularization—Enhancing Reinforcement Learning Efficiency

This technology improves the efficiency of RLHF (Reinforcement Learning from Human Feedback) that boosts LLM performance. It prevents the collapse of policy entropy while maintaining exploration capabilities during learning.

While this isn’t a direct cost-reduction technology for users, it has indirect effects. Improved learning efficiency reduces costs for model developers. Lower development costs are reflected in API pricing. Moreover, it allows for the creation of higher-performing models with the same computational resources, raising the performance ceiling for cheaper models. In other words, the range of tasks that can be handled by cheaper models expands even further.

What’s Happening Structurally

The five technologies reveal that cost reduction is not a “single silver bullet” but rather a “multi-layered, simultaneous effort.”

  • Memory Layer: Improved GPU memory efficiency through KV cache compression
  • Computational Layer: Increased hardware utilization with Ragged Paged Attention
  • Generation Layer: Enhanced throughput with DepCap
  • Operational Layer: Reduced unnecessary high-performance model usage through cost-aware orchestration
  • Development Layer: Improved performance of cheaper models through reinforcement learning efficiency

These factors compound. If memory efficiency doubles, throughput doubles, and model selection reduces costs by 70%, it’s not surprising that total costs could be less than one-tenth of what they were a year ago.

In fact, looking at the trend of OpenAI’s API costs reveals this. The input token cost for GPT-4 at the time of its release in March 2023 was $30/1M tokens. As of June 2025, GPT-4o is priced at $2.50/1M tokens—a reduction to one-twelfth. With GPT-4o-mini at $0.15/1M tokens, that’s 200 times less. This downward trend has not yet stopped.

What Changes for SMEs

Now we get to the crux of the matter.

What happens when “AI is expensive” no longer holds?

First, the structure of “AI adoption decision-making” changes.

The primary reason SMEs have postponed AI adoption has been the lack of visible cost-effectiveness. Monthly API costs in the hundreds of thousands of yen and the expenses for building GPU environments raised questions about the returns. It was difficult to advocate for “let’s try it first” in response to these inquiries.

However, if practical AI functions can operate at monthly costs of a few thousand to tens of thousands of yen, the conversation changes. If the amount is low enough that failure doesn’t hurt, experimentation becomes possible. Once experimentation is possible, companies can discover how to use AI effectively for their needs.

Second, the gap between “companies that can use AI” and “companies that cannot” will widen.

Lower costs mean reduced barriers to entry. However, at the same time, the gap will widen between companies that use AI because it has become cheaper and those that do not use it even at lower costs. This is similar to the structure seen a decade ago when cloud services became cheaper. The question “Are you using AWS?” is now shifting to “Are you integrating LLMs into your operations?”

Third, SMEs will gain access to the same tools as large corporations.

This is the most crucial point. Large corporations have dedicated AI teams and can develop proprietary models. SMEs lack those resources. However, if API costs dramatically decrease and orchestration mechanisms are provided as open-source, SMEs can incorporate AI functionalities comparable to those of large corporations into their operations.

In fact, SMEs may even have advantages. Decision-making is faster, and they are closer to the ground. The statement “Let’s use this tool company-wide starting next week” can be made. Changes that take three months for large corporations to approve can be implemented by SMEs in three days.

So, What Should Be Done?

Let me suggest three actions.

1. First, implement model orchestration.

This can be done immediately and has the most significant impact. Simply stopping the use of high-performance models for all requests can reduce API costs to less than half. There are open-source tools available like OpenAI Router and LiteLLM.

2. Check API price trends quarterly.

Use cases that were deemed “too expensive” six months ago may now be at a realistic price point. Prices continue to decline. Don’t fixate on past judgments.

3. Start creating a list of “What to do when AI becomes cheaper.”

It’s certain that costs will decrease in the future. The question is whether you are prepared for “what to do when it drops.” Companies that are prepared will win. Those that are not will miss out on even realizing that prices have dropped.

LLM inference costs will continue to decrease structurally due to multiple technologies progressing simultaneously across layers. This is not a trend; it’s a structure.

The phrase “AI is expensive” is no longer an excuse. What is being asked is, “What will you do with cheaper AI?”

POPULAR ARTICLES

Related Articles

POPULAR ARTICLES

JP JA US EN