How Designers Are Building Ultra-Low Power AI Chips

Cadence System Analysis

AI processor design

Although the electronics supply chain has been the big story of 2022, there were many advances in AI that flew under the radar up until later the end of the year. The software world now makes copious use of AI for everything from photo touchup to learning your favorite content on social media. Even with all the advances of AI instantiated in software and on the internet, the hardware world continues to lag behind in terms of AI capabilities.

One of the major problems inhibiting implementation AI at the edge and in IoT devices is in processor architecture. The standard logic architecture available in the market is simply not optimized for power consumption or throughput. In other words, the number of compute cycles per unit power consumed by the device is too small.

To overcome these challenges, semiconductor companies have focused on developing processor architectures that implement compute operations at the hardware level, rather than as software running on top of a general-purpose architecture. While this enables new capabilities not seen in earlier generations of processors, it also demands a heterogeneous design concept in order to meet performance demands.

Making Embedded AI Lower Power

The approach to implementing on-device AI inference and training has evolved over time, initially starting purely as a software implementation on CPUs. Next, implementation was done on GPUs, and a few years ago we began to see the introduction of AI accelerator chips that implement systolic array execution. FPGAs offer another approach, but the vendor support and IP needed to quickly build a product with edge AI inference capabilities surfaced only recently.

Companies are approaching AI compute on edge devices under three possible approaches:

Heavily optimized models run on MCUs with TensorFlow Lite/TinyML
FPGAs (as custom accelerators), GPUs, or AI processors (as off-the-shelf accelerators)
Fully custom processors that include an AI compute block

All three of these options can provide low power, low latency inference on an edge device. MCUs are appropriate for simpler models with smaller inputs, whereas larger models with many more data inputs are being run on FPGAs and GPUs. GPUs are known for their large size and power consumption, meaning they are not appropriate for smaller edge devices and IoT devices. For this reason, the industry has focused on accelerators and custom processors, and FPGAs to a lesser extent.

Low Power AI Processor Requirements

The industry seems to have gravitated much closer to totally custom AI processors using heterogeneous dies. Some of the important

Power consumption targets	<100 mW in inference tasks
Process node	10x power enhancements can still be achieved at older nodes (e.g., 14 nm or 22 nm)
Parallelization	Processors for AI inference could run with 100’s or 1,000’s of cores in parallel
Power delivery	Burst power delivery as cores run high compute algorithms
Material	Currently all on silicon, but other materials (e.g., MoSe₂) have been researched
Compute process	Systolic array execution on silicon with floating point computation

The AI core that can meet these performance requirements must be designed to implement systolic array execution for very highly efficient compute of matrix/tensor calculations used in neural network models. Systolic arrays are highly parallelized multiply-accumulate calculations, as shown below.

Systolic array

Systolic array in a neural network.

In a traditional processor architecture, this requires millions of clock cycles to execute a single iteration of a matrix calculation through combinational logic. The power reduction benefit here comes from reduction in total power expended to execute a matrix/tensor computation in the AI core.

The AI core in newer AI-capable processors is the first place to start optimizing a design for minimal power consumption, but it is not the only place where power savings can be found. As newer processors take a heterogeneous integration design approach, other peripherals in the device also must be optimized for minimal power consumption.

Low Power is More Than Just the AI Core

Without a doubt, reducing power consumption in edge devices in AI inference tasks solidly relies on designing the right logic architecture in hardware, right down to the level of individual transistors. Whether you have gone to the regime of running transistors as analog compute elements, or you’ve designed hardware implementations of vector/tensor compute, this is the first place to reduce power consumption in the AI core.

The core design is not the only place to optimize AI chips and maximize flops per unit power ratings for your chips. The other areas where power consumption in AI-capable IoT/edge devices includes:

Use of AI acceleration techniques in models and in logic
More efficient memory optimized for AI
Management techniques like clock gating and power gating

Eventually, you can only get so efficient at the individual chip level, and then you must get more efficient at the system level.

Because processor architectures are now solidly shifting to multi-die (i.e., chiplet) based packaging, designing low-power AI chips is really about designing low power functional blocks and integrating these dies into the package. In-package power delivery that ensures low loss continues to be a challenge as well. Under this paradigm, designers can take a systems-level approach for their AI-capable processors and focus on optimizing functional blocks in the package individually.

Design teams that want to create intelligent embedded systems need the best set of system analysis tools from Cadence for high-level design and evaluation. Only Cadence offers a comprehensive set of circuit, IC, and PCB design tools for any application and any level of complexity.

Subscribe to our newsletter for the latest updates. If you’re looking to learn more about how Cadence has the solution for you, talk to our team of experts.