A 1 Trillion Parameter LLM Runs on 768GB of Memory — The Conventional Wisdom That ‘AI is Something to Borrow’ is Coming to an End
Related Articles
Conclusion First: The Cost of ‘Owning AI’ is Starting to Break
A 1 trillion parameter LLM can run on a single server.
With a machine equipped with 768GB of Intel Optane Persistent Memory and one GPU, it is now possible to perform local inference with a model that has a parameter count comparable to GPT-4. Approximately 4 tokens per second. It’s not fast, but the fact that it “works” is significant.
Why is this important? Until now, running a model of this scale required a GPU cluster costing tens of millions of yen or relying on APIs from OpenAI or Google. Now, it has become accessible with an initial investment of about 2 to 3 million yen for the entire hardware setup.
I urge small and medium-sized business owners to consider what this means.
“AI is for large corporations” and “AI is something to borrow from the cloud” — these assumptions are being fundamentally shaken by the collapse in hardware prices.
What Happened: A Tectonic Shift in Memory Prices
Intel Optane Persistent Memory (PMem) was originally an expensive memory solution aimed at data centers. However, since Intel announced its withdrawal from the Optane business, a large stock has been released into the market. As a result, a 768GB configuration (6 x 128GB DIMMs) is now available for around $5,000 (approximately 750,000 yen).
If you were to assemble 768GB using standard DDR5 DRAM, it would cost over 1 million yen just for the memory. While Optane has narrower bandwidth and higher latency than standard DRAM, these weaknesses are not critical for the purpose of “keeping the parameters of a large model in memory.” The bottleneck during inference is the GPU’s computational speed, meaning that memory only needs to function as a storage for parameters.
In other words, it is based on optimizing the use case rather than compromising on performance. This is a crucial point.
Calculating the Break-Even Point Between ‘Borrowing’ and ‘Owning’
Let’s compare with specific numbers.
[If Owned In-House]
- Intel Optane PMem 768GB configuration: Approximately 750,000 yen
- GPU (equivalent to NVIDIA A100 80GB, second-hand market): Approximately 800,000 to 1,200,000 yen
- Server body (CPU, motherboard, power supply, storage, etc.): Approximately 500,000 to 800,000 yen
- Total Initial Investment: Approximately 2,000,000 to 2,800,000 yen
- Monthly electricity cost (assuming continuous operation at about 600W): Approximately 10,000 to 15,000 yen
- Maintenance and operational costs: Approximately 10,000 yen per month
- Monthly Running Cost: Approximately 20,000 to 30,000 yen
[If Using Cloud API]
- API usage fee for GPT-4 class: Input $30/1 million tokens, Output $60/1 million tokens (as of 2024)
- If consuming 500,000 tokens per day internally, that amounts to approximately 15 million tokens per month
- Monthly Cost: Approximately 60,000 to 100,000 yen
- Costs can skyrocket with increased usage
For companies spending 80,000 yen per month on API usage, the break-even point for owning their own system is around 30 to 35 months. This means they can recoup their investment in just under three years.
However, there is a critical number that should not be overlooked here. API usage tends to “only increase.” As AI utilization becomes more ingrained within a company, it’s not uncommon for token consumption to double or triple within six months. In such cases, the break-even point can shrink to 12 to 18 months.
Moreover, owning the hardware provides a qualitative difference of “unlimited use.” Many companies limit their usage due to concerns over API pay-as-you-go pricing. “I won’t use it because it feels wasteful” — this represents the greatest opportunity loss in AI utilization.
Is 4 Tokens Per Second Enough?
The question naturally arises: “Isn’t 4 tokens per second too slow?”
Four tokens per second translates to about 2 to 3 characters per second in Japanese. It is indeed slow for chat responses and would be challenging for real-time customer support.
However, consider the use cases where AI is genuinely needed in small and medium-sized enterprises:
- Nighttime Batch Processing: Summarizing daily reports, classifying inquiry emails, generating drafts for quotes. These tasks only need to be completed by morning. Even at 4 tokens per second, it can process hundreds of thousands of tokens overnight.
- Internal Knowledge Search: Generating answers from manuals or past meeting minutes. A wait time of a few seconds to tens of seconds is significantly shorter than the time it takes a human to search for materials.
- Generating Data Analysis Reports: Feeding sales data or customer data to produce analysis reports. Even if it takes 30 seconds, it automates tasks that previously took a human half a day.
For tasks requiring speed, APIs can be utilized, while cost-sensitive bulk processing can be handled on in-house servers. This hybrid operation is likely the most realistic solution for small and medium-sized enterprises.
The Option of “Joint Purchase of AI Servers”
An initial investment of 2 to 3 million yen is not a trivial amount for small and medium-sized enterprises.
Here, I want to highlight the concept of decentralized AI compute cooperatives. Essentially, it is about “joint purchasing and sharing of AI servers.”
If five local small and medium-sized enterprises jointly own one AI server, the initial investment per company would be around 400,000 to 600,000 yen. Monthly running costs can also be shared. Since different industries will have varied usage times, the utilization rate can increase.
In fact, several platforms for joint GPU utilization have been established overseas. In Japan, if chambers of commerce or local industry support organizations take the lead, this model could be sufficiently viable.
The essence of this system lies in “reproducing the cost advantages that large corporations gain through economies of scale by collaborating among small and medium-sized enterprises.” One company cannot compete with a large corporation, but if five or ten companies collaborate, they can procure computational resources at costs equal to or lower than those of large corporations.
It is important to design the system in such a way that data remains separate for each company while only the computational resources are shared. This way, confidentiality issues can be resolved.
The Question: “Is it Okay to Continue with API Billing?”
Many small and medium-sized enterprises are starting to utilize AI through APIs from OpenAI or Google. This is a valid approach. APIs, which can be started with zero initial investment, are optimal as a first step.
However, I urge you to pause and consider.
The API billing model increases costs the more you use it. As AI utilization progresses, it creates a structure that pressures profits. Moreover, the pricing power of APIs lies with the providers. If prices increase, companies have no choice but to comply. If a model is discontinued, they must rebuild their workflows from scratch.
This is not “utilizing AI” but rather a state of “being dependent on AI.”
The greatest value of owning hardware is not cost reduction but regaining control.
- Freedom to choose models (Llama, Mistral, Qwen, Japanese specialized models, etc.)
- Data does not leave the organization
- Ability to experiment without worrying about usage
- Not being swayed by changes in API specifications or price adjustments
Particularly, the third point — “the ability to experiment without worrying about usage” — is critically important for AI utilization in small and medium-sized enterprises. The success of AI utilization is determined by how much trial and error can be conducted. Many companies are more hindered by the fear of pay-as-you-go billing than they realize.
So, What Should We Do?
I won’t say that every company should buy an AI server right now.
However, companies that meet the following conditions should seriously consider it:
- Monthly API usage fees exceed 50,000 yen — the break-even point falls within a realistic range
- Want to process highly confidential data with AI — customer information, financial data, contracts, etc.
- Want to implement AI utilization company-wide, but pay-as-you-go pricing is a barrier — an unlimited use environment can be a breakthrough
- There are companies in the region facing the same challenges — costs can be distributed through joint ownership
A 1 trillion parameter LLM runs on 768GB of memory. This fact indicates that the democratization of AI is occurring at the hardware level.
Will we continue to depend on the “convenient but expensive” services provided by cloud giants? Or will we take control of AI ourselves?
The day we are forced to make this choice is coming sooner than we think. In fact, it has already arrived.
JA
EN