Seven Technologies That Reduce LLM Token Costs by 90% — Read as ‘Structural Change’ Rather Than ‘Savings’
Related Articles
Title
Seven Technologies That Reduce LLM Token Costs by 90% — Read as ‘Structural Change’ Rather Than ‘Savings’
Main Text
Do you have any idea how much your monthly API costs are?
If you seriously use a model in the GPT-4o class for business, it can easily cost you hundreds of thousands of yen each month. For a company with ten employees, that could mean 300,000 yen a month, or 3.6 million yen a year. This is not an amount that can be brushed off as just a little trial.
But what if that 3.6 million yen could be reduced to 360,000 yen? The conversation fundamentally changes. The discussion shifts from “whether to use it” to “how to make the most of it.”
The seven technologies introduced here are precisely what create that turning point. They are not mere cost-saving techniques. The structural reduction in LLM usage costs shifts the very boundary of whether small and medium-sized enterprises can use AI or not.
—
Seven Tactics Demonstrated by “Local-Splitter” — 45% to 79% Token Reduction
A recently published study titled “Local-Splitter” systematizes seven specific tactics to reduce the number of tokens sent to cloud LLMs. The key idea is to “stop throwing everything to the cloud and handle what can be processed locally.”
Let’s go through the seven tactics one by one.
1. Local Routing
Simple queries are processed by local small models (around 7B to 13B parameters), while only difficult ones are sent to the cloud. In other words, “don’t ask everything to GPT-4.” In fact, 60-70% of the queries that fly in business can achieve sufficient quality with local models. This alone can reduce the amount of tokens sent to the cloud by more than half.
The implications for small and medium-sized enterprises are clear. The monthly API cost of 300,000 yen could potentially drop below 150,000 yen just by implementing routing.
2. Prompt Compression
Before sending prompts to the cloud, eliminate redundant parts. Human-written instructions often contain unnecessary details. A polite preface like “Please summarize the following text. Ensure that important points are not omitted…” is unnecessary for the model. Using automatic compression tools can reduce the number of tokens in prompts by 30-50%.
3. Semantic Caching
When similar questions arise, reuse past answers. The key is to determine whether they are “semantically similar” rather than requiring an exact match. In tasks like customer support, where similar questions are frequently repeated, cache hit rates can exceed 50%. For the hits, there are zero API calls. This means zero cost.
4. Local Drafting with Cloud Review
First, generate a draft using a local small model, and then request only review and corrections from a large cloud model. It consumes significantly fewer tokens to revise than to generate from scratch. In document generation tasks, this can reduce cloud-side token consumption by 60-70%.
This is particularly effective for editing tasks and report creation. If local small and medium-sized enterprises are using LLMs for “automated minutes” or “drafting reports,” it’s worth starting here.
5. Minimal-Diff Edits
When requesting document revisions, send only the necessary differences instead of resending the entire document. Many cases involve paying for 10,000 tokens for a document when only a small section needs to be changed. Sending just the differences can cost only a few hundred tokens.
6. Structured Intent Extraction
Extract the “intent” from user input as structured data and send it to the cloud. Sending the intent in JSON format significantly reduces tokens compared to sending it in natural language. For example, “I want to book a shinkansen to Osaka next Monday” can be sent as `{“action”: “book”, “type”: “shinkansen”, “destination”: “osaka”, “date”: “next_monday”}`.
7. Vendor Prompt Caching and Batch Processing
Utilize prompt caching features provided by OpenAI and Anthropic to reduce costs for common system prompt parts. Additionally, tasks that do not require real-time processing can be grouped into batch APIs, potentially halving the unit cost.
—
By combining these seven tactics, a maximum of 79% reduction in cloud tokens for editing workloads and 45% for explanatory workloads has been confirmed.
This means the monthly API cost of 300,000 yen could drop to between 60,000 and 160,000 yen. That translates to annual savings of 1.7 to 2.9 million yen. For small and medium-sized enterprises, this difference can determine the viability of AI projects.
—
“CascadeDebate” — Reducing the Need for Heavy Models by Debating with Lightweight Models
Another noteworthy technology is the “CascadeDebate” system.
The mechanism is simple. First, multiple lightweight models (agents) debate internally. If they reach a consensus, they return the answer as is. Only when opinions diverge do they escalate to a higher-performance model.
To put it in terms of a human organization, it’s like saying, “If three frontline staff can solve it, don’t escalate it to the department head.”
With this approach, accuracy can improve by up to 26.75% while significantly reducing the frequency of calls to high-cost models. Accuracy increases while costs decrease. Normally, these are trade-offs, but simply automating the obvious step of “not using high-end models for simple problems” allows for both to coexist.
In the context of small and medium-sized enterprises, this can be applied to internal chatbots and inquiry responses. Eighty percent of questions are routine. There’s no need to process them with GPT-4o. By handling them with lightweight models and only routing truly complex inquiries to the higher models, API costs can dramatically change.
—
KV Cache Compression and Speculative Decoding — For Companies Running LLMs on Their Own Servers
Technologies are also advancing for companies hosting models on their own servers rather than using cloud APIs.
KV cache quantization is a technique that compresses the memory-hungry KV cache during inference. According to the study “Quantization Dominates Rank Reduction for KV-Cache Compression,” quantization can achieve up to a 75% reduction in KV cache with almost no loss in accuracy. This means that the same GPU can handle longer contexts, or that the same processing can be done with a cheaper GPU.
Processing that previously required an A100 (over 2 million yen) could potentially run on an RTX 4090 (around 300,000 yen). This lowers the barrier for small and medium-sized enterprises to adopt local LLMs.
Speculative decoding “MARS” involves generating tokens with a small model first and then validating them with a large model. By separating generation and validation, and confirming once a “stable judgment” is reached, unnecessary rollbacks are reduced. If inference speed increases, more requests can be processed in the same time. This means throughput increases, and costs per request decrease.
—
So, What Should We Do?
Having discussed the technologies, let’s organize what to do starting this week.
Step 1: Visualize Current API Costs
First, check the monthly token consumption and costs on the dashboards of OpenAI or Anthropic. Understand what applications are using how much. You cannot optimize without knowing this.
Step 2: Try Local Routing
Set up a local model in the 7B to 13B class using tools like Ollama. Route simple classification, summarization, and template generation tasks locally. The criterion is “things that won’t be fatal if they are wrong.” This alone can halve the amount sent to the cloud.
Step 3: Implement Prompt Compression and Caching
Try prompt compression tools like LLMLingua. At the same time, implement semantic caching tools like GPTCache. The more repetitive the tasks, the greater the effect.
Step 4: Identify Tasks That Can Switch to Batch APIs
Redirect processes that do not require real-time processing (like daily report generation or data organization) to batch APIs. OpenAI’s Batch API is usually half the cost.
With these four steps, many small and medium-sized enterprises should be able to reduce their API costs by 50-70%.
—
This is Not “Savings” but “Structural Change”
Finally, I want to emphasize the most important point.
If the cost of LLMs drops to one-tenth, it means that AI can enter operations that were previously deemed “not cost-effective.”
With a monthly cost of 50,000 yen, even a small factory with five employees can use it. It can be used by local tax offices. It can be used by individual shops.
Large companies allocate AI budgets in the tens of millions of yen, set up specialized teams, and build large-scale systems. Small and medium-sized enterprises cannot do that. However, if costs decrease, the conversation changes. Start small, and if it proves effective, expand. The cycle can be accelerated by the faster decision-making of small and medium-sized enterprises.
When costs drop to one-tenth, the advantage of large companies shifts from “size of budget” to “speed of decision-making.” And in terms of decision-making speed, small and medium-sized enterprises can win.
The question is whether we are aware of this structural change. The belief that “AI is for large companies” is becoming the biggest cost.
This week, I hope you will start by opening the API dashboard.
JA
EN