From 80,000 yen to 16,000 yen: Three Technologies Dramatically Reducing ‘AI Electricity Costs’ — A Monthly Cost Analysis for Small and Medium Enterprises
Related Articles
Conclusion
To put it simply, we have entered an era where the operational costs of AI can be reduced to one-fifth immediately.
Many small and medium enterprises are paying monthly AI operational costs of 80,000 yen. Companies that have integrated the ChatGPT API into their operations, utilizing it for internal chatbots, meeting minutes summarization, and customer support automation, can easily spend between 50,000 to 100,000 yen just on token charges.
That 80,000 yen will drop to 16,000 yen. An 80% reduction.
You might think, “Is this just another cost-cutting story?” However, this time is different. With three technologies reaching practical application simultaneously, we are entering a phase where instead of just ‘getting cheaper,’ the structure is changing.
What exactly is happening? Let’s examine each technology while converting it into monthly costs for small and medium enterprises.
—
Technology 1: AI Cost Optimization Tools — 80% of API Calls Were Wasteful
The first technology to introduce is AI cost optimization tools, represented by distillfast.com.
What this tool does is simple: it caches API calls for the same or similar questions, eliminating unnecessary recalculations. Additionally, it automatically compresses prompts and optimizes responses.
Consider this: what percentage of questions directed at your internal chatbot are “new questions”? In reality, about 70-80% are similar to past inquiries. Questions like “How do I apply for paid leave?” “What is the expense reimbursement process?” “When is the delivery date for XX?” — each time, you throw it to GPT-4, consuming tokens every time.
By implementing this tool, the number of API calls can drastically decrease. The official claim is up to an 80% reduction. Based on actual trials, a 60-70% reduction is likely for FAQ-type uses.
Monthly Cost Breakdown:
- Before Implementation: 80,000 yen (using GPT-4 API, assuming about 500 calls per day)
- After Implementation: 24,000 to 32,000 yen (if cache hit rate is 60-70%)
- Tool Usage Fee: A few thousand yen per month (many are pay-per-use)
Net Savings: 40,000 to 50,000 yen per month. This alone brings costs down by nearly half, and it can be implemented with minimal changes to the code. There’s no reason not to do it.
—
Technology 2: int4 Quantization — The Shock of Running GPT-4 Level on Just One Mac
Next is the evolution of quantization technology, which can be considered the main focus of this discussion.
Quantization is a technique that intentionally reduces the computational precision of AI models to make them lighter. Traditionally, fp16 (16-bit floating point) was the standard, but recent research has demonstrated that even reducing it to int4 (4-bit integer) can yield performance equal to or exceeding fp16 under specific conditions.
This is particularly compatible with Apple Silicon (M2/M3/M4 chips), leading to the following changes:
- Memory Usage: About one-fourth compared to fp16 (e.g., a 70B parameter model that uses 140GB in fp16 can be reduced to about 35GB in int4)
- Inference Speed: Improved by 1.5 to 2 times (including KV cache optimization)
- Required Hardware: Cloud GPU server → Just one Mac Studio
What does this mean?
You will no longer need to keep paying for API usage.
The monthly API cost of 80,000 yen could potentially be replaced by purchasing one Mac Studio (M4 Max, 128GB memory), allowing for local processing at a similar level. The price of a Mac Studio is about 600,000 yen. If you’re spending 80,000 yen monthly on API fees, you can recover the cost in 8 months. From the ninth month onwards, it’s just the electricity bill, which is only a few thousand yen.
You might wonder, “But isn’t the quality of local models inferior to GPT-4?” This concern is valid. However, what’s crucial here is the segmentation of use cases.
For internal FAQs, meeting minutes summarization, template email generation, and data formatting — 80% of such tasks can be sufficiently handled by open models of 70B class (like Llama 3, Qwen 2.5) quantized to int4. GPT-4 is only necessary for complex reasoning or creative tasks.
Monthly Cost Breakdown:
- Before Implementation: 80,000 yen (processing all tasks with GPT-4 API)
- After Implementation: 5,000 to 10,000 yen (electricity cost + remaining API usage)
- Initial Investment: 600,000 yen for Mac Studio (recouped in 8 months)
For small and medium enterprises, the idea of “owning a server” may sound grand, but in reality, it just means placing one Mac on your desk. No racks or data centers are needed.
—
Technology 3: Stepwise Routing — Automatically Optimizing the ‘Amount of Thinking’
The third technology is stepwise routing. While it may seem the least flashy, it could be the most practical.
Here’s how it works: the moment it receives input from a user, it automatically assesses the complexity of the task and allocates it to the most appropriately sized model.
- “What’s the weather today?” → Small model (cost: 0.01 yen/call)
- “Analyze the risks in this contract.” → Large model (cost: 5 yen/call)
The router automatically makes the judgment that humans previously made regarding which model to use for a task. Moreover, it can switch at the inference step level, allowing for a single response to utilize a small model for easier parts and a large model for more complex sections.
What’s remarkable about this is that it allows you to break free from the mindset of ‘throwing everything at GPT-4.’
In fact, analyzing API usage logs shows that 60-70% of tasks directed at GPT-4 could be adequately handled by models like GPT-3.5 Turbo or even lower. The token cost of GPT-4 is about 20 times that of GPT-3.5 Turbo. Thus, by routing 70% of tasks to smaller models, API costs can be reduced to less than half.
Monthly Cost Breakdown:
- Before Implementation: 80,000 yen (all tasks handled by GPT-4)
- After Implementation: 30,000 to 40,000 yen (after routing, with 70% allocated to small models)
- Implementation Effort: Just adding one layer of router to the API call section.
—
What Happens When You Combine the Three?
Now we get to the main point. These three technologies are not mutually exclusive. They can be used in combination.
| Measure | Reduction Rate | Monthly Cost (based on 80,000 yen) |
|---|---|---|
| ① Cache Optimization Only | ▲60-70% | 24,000-32,000 yen |
| ② Local Quantization Only | ▲85-95% | 5,000-10,000 yen |
| ③ Routing Only | ▲50-60% | 30,000-40,000 yen |
| ① + ③ (if continuing API use) | ▲80-85% | 12,000-16,000 yen |
| ② + ③ (using local + API) | ▲90% or more | Below 5,000 yen |
Even if you continue using the API, with cache + routing, you can expect a monthly cost of 12,000 to 16,000 yen. If you focus on local models, a world where costs drop below 5,000 yen becomes visible.
From 80,000 yen to 5,000 yen. That’s a reduction of about 900,000 yen annually. For small and medium enterprises, this 900,000 yen represents part of labor costs, seed money for new businesses, or employee bonuses.
—
So, What Should You Do?
I won’t say to implement all three at once. The realistic steps are as follows:
What to Do Today (Time Required: 30 minutes)
- Check your company’s API usage logs. Understand how much you are paying monthly and what tasks you are using it for.
What to Do This Week (Time Required: Half a day)
- Implement a cache optimization tool. You only need to change one API endpoint. This alone could potentially cut your monthly costs by nearly half.
What to Do This Month (Time Required: 2-3 days)
- Introduce routing. Implement a system that automatically switches between GPT-4 and GPT-3.5 Turbo (or GPT-4o mini) based on task complexity.
Considerations for Next Month and Beyond
- Consider introducing a local quantization model. Purchase a Mac Studio, select an open model, and apply int4 quantization. This step may have a slightly higher technical hurdle, so bringing in external support could be beneficial.
The important thing is not to try to do everything at once. Just by optimizing the cache, you can save 40,000 to 50,000 yen monthly. Start there and use the saved budget to move on to the next step.
—
What Truly Changes is Not the ‘Cost’ but the ‘Playing Field’
Finally, I want to step back from the technical discussion.
The operational cost of AI dropping from 80,000 yen to 16,000 yen is not just about cost reduction.
The systems that large corporations have built by investing hundreds of thousands of yen monthly in AI can now be replicated by small and medium enterprises for 10,000 to 20,000 yen per month. The financial disparity will no longer be an advantage.
In fact, small and medium enterprises, which can make decisions quickly and are directly aware of on-site challenges, will find themselves in a better position to benefit from AI. While large corporations take six months to get approvals, small and medium enterprises can start moving as early as next week.
What will happen when AI costs decrease is a “reversal between large and small enterprises.”
The question is whether one will realize this structural change and take action. The technology is ready. Now, it’s just a matter of taking action.
JA
EN