How AI Chips Overcome the "Memory Wall"

Traditionally, consumer GPUs are designed for gaming and rendering. However, they can also handle other tasks that require parallel computing.

For instance, a GPU can run a PoW miner for cryptocurrency mining, but in competition with specialized hardware, GPU farms have become a solution for niche projects.

A similar situation is unfolding in the AI sector. Graphics cards have become the primary computing tool for neural networks. However, as the industry evolves, there is a growing demand for specialized solutions for AI tasks. ForkLog delves into the latest developments in the AI arms race.

Silicon Optimization for AI

There are several approaches to creating specialized hardware for artificial intelligence tasks.

Consumer GPUs can be seen as a starting point on the path to specialization. Their ability to perform parallel matrix computations has been beneficial for deploying neural networks, particularly deep learning, but there is still ample room for improvement.

One of the main challenges of using GPUs for AI is the constant need to transfer large volumes of data between system memory and the GPU. These accompanying processes can consume more time and energy than the computations themselves.

Another issue with GPUs stems from their versatility. The architecture of graphics cards is designed for a wide range of tasks—from rendering graphics to general-purpose computations. As a result, some hardware blocks become redundant for specialized AI workloads.

A separate limitation is the data format. Historically, graphics processors have been optimized for FP32 operations—32-bit floating-point numbers. For inference and training, lower precision formats are typically used: 16-bit FP16 and BF16, integer INT4 and INT8.

Nvidia H200 and B200

Among the most popular products for inference and training are the H200 chips and DGX B200 server systems, which essentially represent "enhanced" GPUs for data centers.

The primary AI-oriented component of these accelerators is the tensor cores, designed for ultra-fast matrix operations such as model training and batch inference.

To reduce data access latency, Nvidia equips its cards with a massive amount of high-bandwidth memory (HBM). The H200 features 141 GB of HBM3e with a bandwidth of 4.8 TB/s, while the B200 has even higher specifications depending on the configuration.

Tensor Processing Unit

By 2015, Google developed the Tensor Processing Unit (TPU)—an ASIC processor based on systolic arrays designed for machine learning.

In the architecture of conventional processors—CPUs and GPUs—each operation involves reading, processing, and writing intermediate data to memory.

TPUs pass data through an array of blocks, each performing a mathematical operation and passing the result to the next. Memory access occurs only at the beginning and end of the computation sequence.

This approach allows for less time and energy spent on AI computations compared to a non-specialized graphics processor, but working with external memory remains a limiting factor.

Cerebras

The American company Cerebras has found a way to use a complete silicon wafer as a processor, which is typically cut into smaller pieces for chip production.

In 2019, the developers introduced their first 300-mm Wafer-Scale Engine. In 2024, the company released an upgraded WSE-3 processor featuring a 460-mm chip with 900,000 cores.

The Cerebras architecture involves distributing SRAM memory blocks in close proximity to logic modules on the same silicon wafer. Each core operates with its own 48 KB of local memory, eliminating competition for access among cores.

According to the developers, many models for inference can be handled by a single WSE-3. For larger tasks, there is an option to assemble a cluster of multiple such chips.

Groq LPU

The company Groq (not to be confused with Grok from xAI) offers its own ASICs for inference based on the Language Processing Unit (LPU) architecture.

One of the key features of Groq chips is their optimization for sequential operations.

Inference relies on the sequential generation of tokens: each step requires finalizing the previous one. In such conditions, performance depends more on the speed of a single thread than on the number of threads.

Unlike conventional general-purpose processors and some AI-specialized devices, Groq does not generate machine instructions on the fly. Each operation is pre-planned in a sort of "schedule" tied to a specific moment in the processor's operation.

Like several other AI accelerators, the LPU combines logic and memory modules on a single chip to minimize data transfer costs.

Taalas

All the examples mentioned above imply a high degree of programmability. The model and required weights are loaded into rewritable memory. At any moment, an operator can load an entirely different model or make adjustments.

With this approach, performance depends on the availability, speed, and volume of memory.

Developers at Taalas went further, deciding to "embed" a specific model with pre-set weights directly into the chip at the transistor architecture level.

A model that typically operates as software is implemented at the hardware level, allowing for the elimination of a separate universal data storage and associated costs.

In its first solution—the HC1 inference card—the company used the open model Llama 3.1 8B.

The card supports low-bit precision down to 3-bit and 6-bit parameters, allowing for faster processing. According to Taalas, the HC1 processes up to 17,000 tokens per second while remaining a relatively inexpensive device with low power consumption.

The company claims a thousandfold performance increase compared to GPUs in terms of energy consumption and cost.

However, this method has a fundamental drawback—the inability to update the model without completely replacing the chip.

At the same time, the HC1 is equipped with support for LoRA—a method for "fine-tuning" LLMs by adding additional weights. With the right LoRA configuration, the model can be transformed into a specialist in a specific field.

Another challenge is related to the design and production process of such "physical models." Developing ASICs is costly and can take years. In the highly competitive AI industry, this is a significant limitation.

Taalas claims to have developed a new method for generating processor architecture aimed at addressing this issue. An automated system converts a model and a set of weights into a ready chip design within a week.

According to the company's estimates, the production cycle from acquiring a previously unknown model to releasing finished chips with its physical embodiment will take about two months.

The Future of Local Inference

New specialized AI chips are primarily taking up space in massive data center installations, providing cloud services for a fee. Non-trivial solutions, including "physical models" implemented directly in silicon, are no exception.

For consumers, this revolutionary engineering breakthrough will translate into lower service costs and faster performance.

At the same time, the emergence of simpler, cheaper, and more energy-efficient chips creates the groundwork for the popularization of local inference solutions.

Specialized AI chips are already present in smartphones, laptops, surveillance cameras, and even doorbells. They enable local task execution, ensuring low latency, autonomy, and privacy.

Radical optimization, even at the cost of flexibility in model selection and replacement, significantly expands the capabilities of such devices and allows for the integration of simple AI components into inexpensive mass-market products.

If most users begin directing their requests to models operating on local devices, the load on data center capacities may decrease, reducing the risk of industry overload. Perhaps then there will be no need to seek radical ways to increase computing power—such as launching them into orbit.