Seven Tactics to Cut Cloud LLM Token Costs by 70%: Breaking Down the “Savings Structure” That Reduces Monthly Expenses from 100,000 Yen to 20,000 Yen for Small and Medium Enterprises

Monthly Token Costs of 100,000 Yen: Will You Really Keep Paying That? An increasing number of small and medium enterpri

By Kai

|

Related Articles

Monthly Token Costs of 100,000 Yen: Will You Really Keep Paying That?

An increasing number of small and medium enterprises (SMEs) are incorporating GPT-4o and Claude 3.5 into their operations. From automatic generation of estimates to handling inquiries and summarizing meeting minutes, while these tools have made work more convenient, token costs are gradually swelling.

Monthly costs range from 50,000 to 150,000 yen. For large corporations, this may be negligible, but for a company with ten employees, it equates to the cost of “one additional employee’s communication expenses.”

A recently published study has demonstrated seven tactics that can reduce these token costs by 45% to 79%, supported by empirical data. A 79% reduction means that a monthly cost of 100,000 yen could drop to about 20,000 yen. Moreover, several of these tactics can be implemented at the level of “try it out this afternoon.”

In this article, we will re-prioritize the seven tactics from the paper based on their practical applicability for SMEs and provide explanations along with specific cost implications.

Overview: Seven Tactics and Estimated Reduction Rates

# Tactic Estimated Reduction Rate Implementation Difficulty SME Priority
1 Prompt Compression 20-40% ★☆☆ Highest Priority
2 Semantic Caching 30-60% ★★☆ Highest Priority
3 Batch Processing + Vendor Caching 20-50% ★☆☆ Highest Priority
4 Minimal Diff Editing 15-35% ★★☆ High
5 Local Draft + Cloud Review 30-50% ★★★ Medium
6 Structured Intent Extraction 10-25% ★★☆ Medium
7 Local Routing 40-70% ★★★ Future Investment

The order of tactics in the paper differs from the priority in this article. The reason is simple: they are rearranged in order of “effect ÷ effort”. For SMEs, tactics that can save 20,000 yen this month are more valuable than those that take a month to implement.

[Highest Priority] Three Tactics to Implement This Week

1. Prompt Compression — Halving “Overly Long Prompts”

What to Do: Remove redundant instructions, duplicate contexts, and unnecessary examples from prompts sent to the API.

A common sight in many workplaces is the “copy-pasted prompts” that have accumulated over time. Initially written by someone, additional notes were added with each issue, and before you know it, a single request exceeds 3,000 tokens, with half of it being text included “just in case.”

Specific Actions:

  • List all the prompts currently in use.
  • Measure the token count of each prompt using tools like [tiktoken](https://github.com/openai/tiktoken).
  • Organize instructions in the order of “conclusion → conditions → constraints” and eliminate duplicates.
  • Compare the output quality between the edited and original versions (doing this for 10 instances will reveal trends).

Cost Implications: The input token price for GPT-4o is $2.50 per million tokens. If a company using 1 million tokens per month compresses prompts by 40%, it saves $1.00 monthly on the input side, totaling $12.00 annually. While this may seem small, if the company uses GPT-4 Turbo class (at $10 per million tokens) and operates at a scale of 5 million tokens monthly, the annual cost could drop from about $24,000 (approximately 3.6 million yen) to $14,400 (approximately 2.16 million yen). A difference of 1.44 million yen annually can be achieved just by organizing prompts.

2. Semantic Caching — Avoid Paying for the Same Question Twice

What to Do: If a similar question has been asked in the past, return the answer from cache instead of calling the API. This involves matching at a level of “semantic similarity” rather than requiring exact matches.

Those operating an internal chatbot will notice that 70% of inquiries are “almost identical rephrases.” Questions like “How do I apply for paid leave?” and “How do I take paid leave?” can all be answered with the same response, yet the API is called every time.

Specific Actions:

  • Build a simple caching layer using [GPTCache](https://github.com/zilliztech/GPTCache) or Redis + vector search.
  • Set the similarity threshold around 0.90 to 0.95 and monitor hit rates and response quality for a week.
  • If the hit rate exceeds 30%, API calls can be reduced to zero for those inquiries.

Cost Implications: A company using 2 million tokens monthly for inquiry responses could save 800,000 tokens if the cache hit rate is 40%. While it may seem like a small saving of $8 for GPT-4 Turbo or $2 for GPT-4o, including output tokens (which cost $10 to $15 per million tokens) results in a reduction of $15 to $25 monthly. This translates to an annual savings of approximately 30,000 to 45,000 yen. While this may appear minor, the key point is that the implementation cost is nearly zero. Redis for caching can be started on a free tier.

3. Batch Processing + Vendor Prompt Caching — Send Requests in Batches and Eliminate Repetitions

What to Do: Stop sending requests one by one and send them in batches instead. Additionally, use the “prompt caching” feature provided by OpenAI or Anthropic to reduce charges for common system prompt segments.

OpenAI’s Batch API can be used at a 50% discount. Processes that do not require real-time responses—such as daily report generation, bulk creation of product descriptions, or analyses that can be completed by the next morning—can all be handled in batches.

Anthropic’s prompt caching automatically caches common system prompts (over 1,024 tokens) and allows subsequent reads at a 90% discount. If you are sending the same prompt, such as “You are a customer support representative for XX. Please follow the rules below…” every time, that portion can be discounted by 90%.

Cost Implications: A company generating 500 product descriptions monthly with GPT-4o, averaging 2,000 tokens per request (input + output), would use 1 million tokens monthly. Simply switching to the Batch API would reduce costs from $1.25 to $0.63 monthly, resulting in an annual saving of about $7.5 (approximately 1,100 yen). While this may seem minimal at a small scale, at 5,000 requests and 10 million tokens, the annual savings could reach $75 (approximately 11,000 yen). Furthermore, when combined with prompt caching, the common prompt portion (which often accounts for 30-50% of the total) can be discounted by 90%, leading to an overall cost reduction of 30-50%.

[Next Steps] Two Tactics That Work When a Structure is Established

4. Minimal Diff Editing — Stop Rewriting Entire Texts

When asking an LLM to revise text, are you throwing the entire piece at it with a request like “Please improve this text?” Even if only three lines need changing, you are inputting the whole text and getting the whole output back. Both input and output incur token charges.

What to Do: Specify only the changes in JSON format and have it return just the differences. For instance, instructing it to “Change the subject in paragraph three and update the number in paragraph five” will drastically reduce both input and output token counts.

Prompt Example:
“`
Please make corrections only to the sections specified in the following JSON and return only the modified sections in JSON format.
{“changes”: [{“paragraph”: 3, “action”: “Change the subject from ‘our company’ to ‘our firm'”}, {“paragraph”: 5, “action”: “Update the sales figure from 120 million to 150 million”}]}
“`

If you input the entire text, it would be 3,000 tokens for input plus 3,000 tokens for output, totaling 6,000 tokens. With a diff instruction, it could be 500 tokens for input plus 200 tokens for output, totaling 700 tokens. This results in an 88% reduction. If this occurs 20 times a day, it could save (6,000−700)×20×22 days = 2.33 million tokens monthly.

5. Local Draft + Cloud Review — Use a Cheaper Model for Drafts

Structure: Create drafts using a local or inexpensive model (like GPT-4o mini, Gemini Flash, or local Llama) and rely on a more expensive model (like GPT-4 Turbo or Claude 3.5 Sonnet) only for “checking and correcting.”

The input cost for GPT-4o mini is $0.15 per million tokens, while GPT-4 Turbo is $10 per million tokens, indicating a price difference of about 67 times. If 80% of the draft can be used as is, the amount of tokens processed by the expensive model can be reduced to just 20%.

While implementing this requires a routing mechanism, it can be set up in a few days using LangChain or LiteLLM. For companies with monthly token costs exceeding 100,000 yen, this is a structure worth considering first.

[Mid-Term Considerations] Two Tactics That Are Effective but Require Design

6. Structured Intent Extraction

Instead of directly throwing user free input at the LLM, first classify the intent and route it to templated processes. For example, “I want a quote” would go to the quote flow, while “I want to know the delivery date” would go to the inventory inquiry flow.

Intent classification can be done using small models or rule-based systems. In subsequent flows, cases where LLMs are not used (or where token consumption is minimal due to standardized processing) will increase, thus reducing the overall token count.

7. Local Routing

This tactic has the highest reduction rate in the paper, but it requires a local GPU server. While used servers with NVIDIA T4 can be obtained for 150,000 to 200,000 yen, this poses a high barrier for SMEs to “try it out first.”

However, there is also the reality that a Mac mini M4 (around 100,000 yen) can run Llama 3.1 8B at practical speeds. There are emerging cases where routine internal questions (like FAQ responses and format conversions) are handled locally, thereby halving the calls to cloud APIs. If monthly API costs exceed 50,000 yen, it is worth calculating the break-even point for using local models.

Where to Start

What to Do Today:
1. Inventory all the prompts currently in use and measure their token counts.
2. Eliminate redundant parts (prompt compression).
3. Switch real-time unnecessary processes to the Batch API.

What to Do This Month:
4. Implement semantic caching and measure hit rates.
5. Establish operational rules to change from “rewriting entire texts” to diff instructions.

From Next Month Onward:
6. Set up a two-tier structure of drafting with a cheaper model and reviewing with a more expensive model.
7. Estimate the break-even point for local models.

By following this order, it is expected that prompt compression and batching alone can lead to 20-40% cost reductions within the first week. For a company spending 100,000 yen monthly, this translates to savings of 20,000 to 40,000 yen, amounting to 240,000 to 480,000 yen annually. This is equivalent to the labor cost of one part-time employee for SMEs.

The Essence Is Not Just “Reducing Tokens”

If you’ve read this far and thought, “So, how much are we spending monthly?” you are already one step ahead.

Many SMEs have adopted LLMs but do not know how many tokens they consume monthly. Without this knowledge, optimization is impossible. The first step is measurement. Just doing that can change perspectives.

Moreover, these seven tactics illustrate not just cost-saving techniques but a design philosophy of “what to rely on the cloud for, what to handle locally, and what to return from cache.” SMEs capable of this design can compete differently from large corporations that have ample budgets to call APIs extensively.

Achieving maximum results with limited budgets is something SMEs have excelled at for a long time. This remains unchanged even in the age of AI.

POPULAR ARTICLES

Related Articles

POPULAR ARTICLES

JP JA US EN