The Era of Running Multiple AIs on a Single GPU: Technology Halving Inference Costs Makes “50,000 Yen Monthly AI Operations” a Reality
Related Articles
Conclusion
To put it simply, technology that can reduce AI operational costs to less than half is already in the practical stage.
Many small and medium-sized business owners have encountered the barrier of wanting to use AI in-house but finding it too expensive.
The API costs for GPT can reach tens of thousands of yen per month. If you try to run your own LLM, you’ll need a server with a GPU costing around 800,000 yen, plus monthly cloud fees in the tens of thousands. Many have likely given up, thinking, “It’s impossible for a company of our size.”
However, that assumption is beginning to crumble.
This article will discuss three technologies:
- Prism — Running multiple AI models simultaneously on a single GPU, halving inference costs.
- TWLA — Compressing AI model sizes to less than one-fifth, allowing them to run on smaller machines.
- Edge RAG — Completing AI search and answer generation on a single laptop.
While all these technologies aim to make AI more affordable, their essence is different. They fundamentally overturn the common belief that “the same infrastructure as large corporations is necessary to use AI.” Let’s take a closer look at each.
—
1. Prism: A System for “Sharing a Single GPU Among Multiple AIs”
What’s Happening
GPUs are expensive. An NVIDIA H100 costs over 4 million yen. Even renting one in the cloud can cost hundreds to thousands of yen per hour.
However, in actual operations, the majority of GPU memory is consumed by models that are “idle.” When chatbot model A is not responding, that memory is sitting unused. The same goes for translation model B. When multiple AI models are loaded, memory gets filled up while computational capacity remains underutilized — this is the reality.
Prism addresses this issue.
What is “Memory Ballooning”?
The mechanism is simple. It shrinks the memory of unused models like a balloon and reallocates it to the models that are in use. When model A is busy, memory is concentrated on A; when B becomes busy, the memory expands for B. This idea, similar to how an OS virtually manages CPU memory, is applied to GPU memory.
Traditional methods offered a choice between “time-sharing” or “partitioning”. Both had significant waste. Prism dynamically reallocates memory based on real-time load, greatly increasing the operational efficiency per GPU.
Effects in Numbers
- Currently in commercial operation on over 10,000 GPUs (proven in ByteDance’s production environment).
- The number of models that can be loaded per GPU increases, reducing inference costs by up to 50%.
- The SLO (Service Level Objective) violation rate is also lower than traditional methods.
For small and medium-sized enterprises, this means that instead of renting two cloud GPU instances, they can manage with just one. Monthly costs drop from 200,000 yen to 100,000 yen. Alternatively, they can run an internal chatbot, meeting minutes summarization, and translation simultaneously on a single GPU. The era of “setting up servers for each use case” is over.
—
2. TWLA: Quantization Technology Compressing Models to One-Fifth
Why Are Models So Large?
A typical LLM (for example, the 70B parameter model of Llama 3) consumes about 140GB of memory when stored with 16-bit precision. This alone requires two GPUs.
TWLA (Ternary Weights and Low-Bit Activations) pushes this “precision” to the limit.
What Exactly Does It Do?
- Compresses weights to 1.58 bits — about one-tenth of the usual 16 bits.
- Quantizes activations (intermediate calculation values) to 4 bits — one-fourth of the usual 16 bits.
What does 1.58 bits mean? It rounds the weight values to three values (-1, 0, +1) (ternary). Multiplication is replaced by addition and subtraction, so calculations can be performed without dedicated GPUs.
Is the Accuracy Okay?
This is a natural concern. The TWLA paper validated it using Llama 3 models, keeping accuracy degradation within an average of 5% compared to full precision. Depending on the use case, this is sufficiently practical for internal FAQ responses or document summarization.
Impact on Costs
A 70B model shrinks from 140GB to about 30GB or less. This means it can easily fit on a single GPU (80GB).
- From two GPUs to one: halving hardware costs.
- Significant reduction in power consumption: since multiplication is eliminated, inference power consumption can be reduced by up to 80%.
- The grade of cloud GPU instances can be downgraded: further compressing monthly costs.
Specifically, calculations suggest that a 70B-class model could run on an AWS g5.xlarge instance (equipped with A10G, costing about 50,000 yen per month). Previously, a p4d.24xlarge (with 8 A100 GPUs, costing about 3 million yen per month) was required. 3 million yen becomes 50,000 yen. That’s a two-digit change.
Of course, processing speed differs between running eight GPUs fully and running a compressed model on one. However, for internal use in small and medium-sized enterprises — handling a few hundred requests a day — the cost of operation per month is 100 times more important than speed.
—
3. Edge RAG: Completing AI Search on a Single Laptop
What is RAG? (In 30 seconds)
It’s a system that allows AI to read internal documents and answer questions. RAG combines “Retrieval” and “Generation.” Think of it as feeding internal manuals into ChatGPT.
Traditionally, RAG was commonly run on cloud-based GPU servers, as both vector searches and LLM generation require computational power.
What Has Changed?
Research has emerged using the NPU (Neural Processing Unit) embedded in Qualcomm’s Snapdragon X Elite to complete the entire RAG process on a single laptop.
- Embedding (vectorizing documents)
- Re-ranking (reordering search results)
- LLM generation (creating answers)
All three steps are processed on-device.
How’s the Performance?
- The NPU is up to 18.1 times faster than the CPU.
- Energy consumption is one-fourth compared to the CPU.
- No network required. It operates offline.
What This Means for Small and Medium-Sized Enterprises
If this becomes practical, monthly cloud API costs could drop to zero.
Currently, building RAG with OpenAI’s API can easily cost between 30,000 to 100,000 yen per month, depending on document volume and question frequency. With Edge RAG, the only cost is the initial investment for the laptop (150,000 to 250,000 yen). Ongoing costs are just electricity.
Moreover, data does not leave the company. For industries handling personal information — such as healthcare, nursing, legal, and human resources — this is a watershed moment between “usable and unusable.” Companies that cannot send customer data to the cloud can utilize AI search on their own PCs. This significance is profound.
—
Considering a Structure for In-House AI Operations at 50,000 Yen per Month
By combining these three technologies, a structure like this emerges:
| Item | Configuration | Monthly Cost |
|---|---|---|
| LLM Inference | TWLA Quantized Model + 1 Cloud GPU (g5.xlarge equivalent) | About 50,000 yen |
| Internal RAG | Edge RAG (NPU-equipped laptop) | 0 yen (initial investment only) |
| Multi-Model Operation | Consolidated with Prism-like memory sharing | No additional cost |
| Total | About 50,000 yen per month |
A configuration that would have cost 300,000 to 500,000 yen a year ago is now just 50,000 yen. This fits within the “IT budget” of small and medium-sized enterprises.
Of course, this is an optimally theoretical configuration, and in reality, there will be effort involved in model selection, tuning, and operational maintenance. However, there is a world of difference between “technically impossible” and “possible but requires effort.”
—
So, What Should Be Done?
None of these three technologies are something that should be implemented immediately. Prism is designed for large-scale environments, and its direct use by small and medium-sized enterprises is still limited. Many TWLA technologies are still in the research phase. Edge RAG is waiting for the proliferation of compatible devices.
However, the direction is clear.
- AI inference costs will continue to decrease.
- The required hardware specifications will continue to drop.
- Dependence on the cloud will become just one of many options.
Small and medium-sized enterprises should focus on three things:
1. Decide “what to use AI for” first. It’s too late to think about it after the technology becomes cheaper. Identify where AI can be integrated into operations for effective results.
2. Start small. Quantized models (like llama.cpp) can be tested on personal PCs starting today. We are in an era where you can validate whether “AI can be used in-house” at zero monthly cost. There’s no reason not to do it.
3. Plan with the assumption of changing cost structures. What costs 300,000 yen this month might drop to 50,000 yen in six months. Instead of saying, “It’s too expensive, let’s pass,” decide “at what price we will proceed.”
The decrease in AI costs means that “financial strength” will no longer be a competitive advantage. A system equivalent to what large corporations built for tens of millions of yen can now be obtained for 50,000 yen per month.
This is not a threat. For small and medium-sized enterprises, it is rather an opportunity. Being faster in decision-making and closer to the field than large corporations, small and medium-sized enterprises are in a position to leverage cheaper AI first.
There’s no need to wait for technology to become cheaper. It has already begun to get cheaper.
JA
EN