The software world currently has the upper hand in producing advanced AI systems, and it sometimes feels like software developers take the hardware they use for granted. The real future of scalable, flexible, and fast AI is to bring these capabilities to embedded systems as a standard part of the processor and module architecture. Major semiconductor companies and new semiconductor startups are building chipsets specifically to enable a transition of AI out of the data center and to the edge.
When discussing AI, particularly for embedded systems, everyone focuses on the processing power required to instantiate models and run inference tasks. It’s true that a sufficiently powerful processor is needed, but the processor does not tell the whole story and it won’t be the only enabler of AI systems. The other factor that is needed to improve AI performance is memory, specifically memory that can provide low latency high-rate data transfer as part of compute operations.
Getting Away From von Neumann
If you studied computer science, then you know about the von Neumann architecture, the classic architecture used to describe modern computers and computing components. The architecture is based on a 1945 description by John von Neumann, which describes everything required in an electronic digital computer. The architecture includes:
- A host processor that executes program instructions in universally extensible logic (sequential + combinational with register control)
- Temporary memory that stores program instructions and temporary data
- Permanent memory that stores permanent data
- An I/O interface or HMI elements
The important part of this description is to understand that the architecture separates memory and logic, and indeed different types of logic, into different functional blocks. The result is that these functional blocks became separate components in the first commercially available computers built with mass-manufactured semiconductor integrated circuits.
This is where a major bottleneck in embedded AI still lies. The architecture has instilled in designers and in the semiconductor industry a physical design that provides plenty of available memory, but slow access to that memory.
Processor and Memory Requirements for AI at the Edge
Rather than only focusing on the processor requirements, implementation of AI at the edge also needs to account for memory requirements, as well as access to accelerator modules used with general-purpose processors. The typical processor requirements include:
- Multiple cores (at least 4 for the simplest models)
- PCIe interface to access solid state storage, a GPU, and/or AI accelerator modules
- SerDes/Ethernet interface to stream data to an external host as needed
- DDR4 or faster interface to access RAM
- Slower memory interface (e.g., SPI) to access Flash memory
These are only high-level requirements that allow a processor to interface with many other types of components, ranging from simple accelerator SoCs to GPUs/FPGAs via a PCIe or SerDes interface. In addition to these is the need for many I/Os to receive data in serial or parallel formats, depending on the data being given to the processor for training or inference tasks. SATA/mSATA and other SAS memory interfaces are omitted above as they are more common in servers, and we might not expect these to be used in smaller embedded devices.
Anyone looking at the above list should immediately recognize the two main bottlenecks: the standard serial processing architecture of general-purpose processors, the limited bandwidth of SDRAM, and the limited bandwidth of Flash memory. In any of these options, data has to be sent back and forth between the processor and the
Why SDRAM Is a Major Bottleneck
Flash and solid-state drives are the least useful for embedded AI aside from their use as permanent storage. SDRAM is still the best contender for memory at the device level, but they have their own bottlenecks due to the fact that the compute is not happening in the same location as memory. The interface between the processor and external memory is a major bottleneck that increases latency and power consumption.
This is why newer processors built with a 2.5D or 3D approach are including high-bandwidth memory in the package, rather than adding it as an external component. Although the memory die is still separate from the processor die, placing them in the package allows them to be connected with very low latency and in a highly parallelized architecture. For example, in GPU cards, the GPU die might only contain ~KB worth of memory per core, but it could need access to dozens of GB of RAM to perform inference and training tasks. Newer GPU packages integrate the memory into the package alongside the processor die with a direct connection between them.
Architecture with high-bandwidth memory (HBM) integrated into the package. [Source: Rambus]
Today’s high-bandwidth memories are connected directly to the host processor die over a highly parallelized DDR interface (up to 128 bit). This architecture has enabled massive data transfer between memory and the host processor with transfer rates reaching above 100 GB/s. With stacked dies in a 3D package, the allowed data rates under JEDEC specifications can exceed 1 TB/s.
What Determines RAM Requirements for AI?
If we’re talking about embedded systems, RAM will typically be placed on the board as a discrete chip, not as a module connected through a DIMM connector. The DIMM connector increases the size of the system, but in devices that need much more memory, this will be a simple way to provide large memory banks while also making the board upgradable. The other issue here is that many processors are off-the-shelf and do not integrate a huge amount of RAM in the package. For example, outside of GPUs or other advanced processors, all the RAM will be placed as peripheral components.
The main design question here is how much RAM is needed for machine learning operations in an embedded system. This depends on the following factors:
- Data structure (bit width)
- Number of neural network layers
- Number of neural network weights
- Activation functions
- Use of mini-batches for processing
This is not a trivial problem for an embedded device. In PCs and servers, RAM problems are solved by just adding more and more DIMMs onto the motherboard, as well as using GPUs with greater memory in the GPU package. In an embedded system, where power and board space are some of the biggest functional constraints, you can’t throw memory at your AI compute problems and expect them to be solved. You need to make smart decisions about how memory is parallelized to match your processor capabilities.
Memory Calculation Example
For small neural networks, each with a small number of neurons and activations, the memory required in the network may be small enough to run on the internal memory in a small microcontroller or CPU. For larger networks requiring larger parallelization to process, the total amount of memory and parallelization requirements will increase dramatically.
Two examples are shown below, and it is assumed that the entire network (all layers) are loaded into memory. In the case where there is not enough memory available to store all weights, then each layer would need to be loaded into memory sequentially, which adds additional computation time.
Example 1 - 10 neurons (weights), 10 activations, 2 layers, 8-bit precision
Suppose we take an example of a small neural network with 10 neurons.
This network could have up to 10 data inputs (1 per neuron) that are cycled through two layers. If this were implemented on an MCU, 8-bit serial data streams would be supplied to 8 I/Os and the data fed into the neural network. The results would then be given in the output layer in up to 8 I/Os after a series of multiply-accumulate operations.
Example 2 - 6,220,800 neurons (weights), 6,220,800 million activations, 4 layers, 32-bit precision
Now let’s look at a significantly larger neural network. The size of this neural network would be used for production-level image classification or object recognition with high resolution (1080p) RGB images. The amount of required memory would be:
Obviously, as the size of the network scales the memory requirements scale. There is an important conclusion here:
- With a small neural network, no external memory is needed. However, for the most challenging applications in AI and current production-grade uses, processors do not include enough on-die memory
- With a large neural network, external memory must be dedicated just to storing the AI model.
The next question would be to determine the number of clock cycles per multiply-accumulate operation required in each example and use the clock frequency to determine the total calculation time. Unfortunately, the number of clock cycles needed to execute multiply-accumulate operations in each case will depend on the specific processor architecture, the specific instructions being used, and the input data format.
For example, some processors may have dedicated hardware for performing multiplication and accumulation operations, which could allow them to complete the operation in a single clock cycle. Other processors may not have dedicated hardware for these operations and may need to use a sequence of instructions to perform the same operation, which could take multiple clock cycles to complete. Parallelization is currently the primary strategy used to reduce the latency by reducing the total number of clock cycles needed for training/inference.
Can We Reduce Memory Burdens and Latency?
The answer is “yes” to both questions. Reducing model size is equivalent to reducing the total size of the memory needed to implement a neural network, and this is implemented in some AI accelerator techniques (e.g., pruning, quantization, etc.).
The best solution is to execute neural network compute tasks directly inside the memory. This is a new approach that is being embraced by some startup companies, and it is a major departure from the von Neumann architecture. This implementation eliminates the need to pass weights in each layer from memory to the host processor, and then load new layers into memory.
When you’re ready to start designing your next AI-capable embedded system, and you need to determine your entire set of product requirements, use the complete set of system analysis tools from Cadence to ensure your designs meet performance targets and ensure sustainable production. Only Cadence offers a comprehensive set of circuit, IC, and PCB design tools for any application and any level of complexity.