Home Articles Evolving silicon choices in the AI age

Dr Peter Debenham, Senior Consultant
Written by Dr Peter Debenham

Senior Consultant

Evolving silicon choices in the AI age

Only a few years ago choosing silicon for a processing job seemed simple.

Either use a CPU or GPU (if your task had a lot of parallel sections) or possibly create a totally bespoke FPGA or ASIC (much more expensive in engineering effort but often worthwhile in many high value jobs).

Now it is much more difficult. The old options still exist but new options exist such as TPUs (Tensor Processing Units) and their cousins NPUs (Neural Processing Units). What changed?

The change of course is Artificial Intelligence (AI). In the past few years AI has changed from a subject only talked about dryly in academic journals or otherwise displayed in Science Fiction where, usually, the AI is out to kill people in various ways (e.g. Ava in Ex-Machina 2014, Schwarzenegger’s Terminator 1984, or HAL from Kubrick’s 2001: A Space Odyssey 1968 – all three great films by the way).

Artificial Intelligence in the real world is taking a complicated set of inputs (pixels of a picture maybe), applying a series of weights or biases to those pixels across a number of layers to get a simple output (picture contains a person and a cat). Performing any single set of calculations is not usually too slow for modern computers but away from those dry academic journals most people want to perform a lot of sets of calculations and perform them fast. People want some kind of real-time response to a changing situation, say frames from a video camera but to achieve this they generally would prefer something which is physically small, efficient, and not doubling up as a fan-heater – silicon and electricity cost money after all even when you are plugged into the mains.

If the situation is computationally intensive enough when running a trained AI it is even worse when attempting to train an Artificial Intelligence model. This is an iterative process. A set of carefully chosen training data is fed through a putative AI model and the accuracy of the resulting output is measured. Based on how well the model performs changes are made to the model and the process is repeated. Training a large AI model requires using a large set of carefully chosen training data and many cycles of the loop; process data, check accuracy, refine model. Essentially the training process must run the AI model an enormous number of times. Whilst silicon cost and power consumption remain a concern the big driver here is usually the elapsed time necessary to train the model which can easily be tens of hours or even tens of days. Reducing this is, relatively speaking, worth a lot of electricity and silicon cost especially where the silicon can be rented from Amazon, Google or similar.

Given the requirement to run AI models in a fast and efficient fashion what type of processing silicon should be used? For a fixed model it is possible that a bespoke FPGA is fastest and least power hungry. But the time and cost of designing and implementing a bespoke FPGA remains considerable and is unaffordable in most circumstances. Consider that AI development is moving fast enough that a good model today will be a bad model tomorrow and a bespoke FPGA even less likely to be a good solution.

The Central Processing Unit/Graphics Processing Unit (CPU/GPU) equation remains like before. A single CPU core can perform a general complicated set of mathematical operations very fast. Each core in a multi-cored CPU can perform these calculations almost totally independently of the other cores. A high-end processor such as the AMD Threadripper Pro 5995WX has 64 cores and particularly when training an AI, it is possible to devise algorithms which use each efficiently. Even a more mainstream, lower cost and power consumptions, Intel i7 may have 12 or more cores.

Conventional GPUs operate differently to CPUs. Rather than a lower number of very powerful processing units, cores, these contain vastly more numerous but less individually powerful cores. The Nvidia H100 GPU has 16896 cores. Even a “humble” GeForce home PC graphics card can have over 9000 cores. But there is a catch. As well as each core being less powerful than a CPU core, GPU cores cannot each operate individually. They are grouped together (often in groups of 32 or 64 cores) and each such group must operate in lockstep and do so without using the storage memory required by other groups. Where software can be written to make efficient use of this the system can run much faster than on a CPU due to the large number of cores. This has been done for many software packages which train common AI model architectures.

More recently Google created a new type of processor specifically designed around the needs of a particular type of machine learning AI, convolutional neural networks (CNN). This is the Tensor Processing Unit (TPU) which Google announced in 2016, though they had by then been using them in-house. TPU workloads are available as part of Google’s cloud architecture (cloud TPU) and use Google’s own TensorFlow software. The TPU is a bespoke ASIC designed for high throughput of low precision calculations, specifically matrix processing. TPUs were originally designed for running already trained CNN models where they are more power-efficient per-operation than the more general GPU or CPU but can (since version 2) be used additionally for training such models. Waymo, uses TPUs to train its self-driving software.

At a similar time Nvidia added Tensor Cores to its data centre GPUs (2017) and its consumer GPUs (2018) again targeting matrix multiplication and accumulation operations. This adds efficient AI acceleration alongside the other advantages GPUs have over CPUs. The data centre H100 GPU adds 528 Tensor cores to the 16896 Cuda cores. In 2022 Tensor cores started to appear in lower-power devices such as the Jetson Orin Nano.

In general GPUs are preferred to CPUs for running large AI model training workloads. Often the choice over which to use is not what is theoretically “best” but which hardware you have available either physically in a machine you control or are able to rent in the cloud. For much of the past few years those wishing to purchase high end GPUs have suffered from significant, many months long, backlogs as demand overwhelmed the supply chains. OpenAI used GPUs to train its large-scale AI models such as GPT-3.

Neural Processing Units (NPU) are like Google’s TPUs. They again offer hardware specifically designed to accelerate aspects of neural networks and AI. NPUs may be standalone data centre cards or integrated alongside CPUs both in PCs (Intel Core Ultra Series, AMD Ryzen 8040) and mobile (Qualcomm, Huawei) or even lower-power edge processors.

Of particular interest to Plextek is adding intelligence “at the edge” for Internet of Things (IoT) devices where per-unit cost, computing power, battery capacities and communication bandwidth are tightly limited. The option of transmitting everything to a more powerful server which can do the number crunching is not possible. This means the ability to run an AI model locally efficiently and rapidly on an IoT device is necessary for it to respond to its environment only as and when needed.

IoT devices in addition to its sensors typically need an embedded microcontroller. For moderately complicated devices this is often some version of an Arm Cortex. These are capable of running AI loads but are often too slow or too power hungry to be practical. Then some kind of AI accelerator is required to both speed up the time to run the model and reduce the energy required from the limited power budget.

Examples of currently available “at the edge” accelerators include Google’s Edge TPU (a smaller version of their datacentre TPUs), Arm’s own Helium technology (Armv8.1-M Cortex-M chips such as the Cortex-M85) and NXP’s eIQ Neutron NPU (Arm Cortex-M33 based MCX-N series). At the higher power end of IoT there are devices such as Nvidia’s Jetson Nano boards incorporating an Arm A57 and a 128-core Nvidia GPU (an older design lacking AI specific Tensor Cores) or the Orin Nano containing a more recent Tensor core supporting GPU.

How much faster are the accelerators compared to just using the conventional Arm core? Numbers can be hard to find but NXP and Google publish some public data. Google shows the inference time difference between an Arm A-53 core and its development board (Arm A-53 and Edge TPU) when running various AI models trained using the ImageNet dataset. The Edge TPU is typically 30+ times faster than the Arm core by itself. This comes with the limitation that the Edge TPU is only designed to process a limited range of AI model types.

NXP does not give precise numbers for inference time but shows “ML Operator Acceleration” from the NPU compared to just using the Arm-M33 core for three different typical AI operations. As with the Edge TPU acceleration is found to be around 30+ times.

Nvidia make the point that their GPU solution can process models which the Edge TPU cannot by showing frames-per-second the Nano can manage when running Classification and Object detection models with frequent DNRs (did not report) for the Coral Edge TPU development board. Where both platforms could run the model, their speeds were similar.

Which type of AI processing platform is the best? As the Nvidia blog shows this depends on precisely what type of AI you are trying to run. Some types of processing platform can only handle a restricted set of AI model types. Others provide acceleration to a wide variety of AI processing loads but are tightly coupled to manufacturer’s CPU cores. As the use of AI continues to increase the only certainty is that the CPU/GPU/TPU/NPU acronym list is going to continue to grow.

Contact Plextek

Contact Us

Got a question?

If you have got a question, or even just an idea, get in touch


Technology Platforms

Plextek's 'white-label' technology platforms allow you to accelerate product development, streamline efficiencies, and access our extensive R&D expertise to suit your project needs.

  • 01 Configurable mmWave Radar Module

    Plextek’s PLX-T60 platform enables rapid development and deployment of custom mmWave radar solutions at scale and pace

    Configurable mmWave Radar Module
  • 02 Configurable IoT Framework

    Plextek’s IoT framework enables rapid development and deployment of custom IoT solutions, particularly those requiring extended operation on battery power

    Configurable IoT Framework
  • 03 Ubiquitous Radar

    Plextek's Ubiquitous Radar will detect returns from many directions simultaneously and accurately, differentiating between drones and birds, and even determining the size and type of drone

    Ubiquitous Radar
Enhancing communication and safety in mining: the role of custom RF system design

We explore the role of custom RF system design in communication and safety within the mining industry, ensuring robust data handling and operational efficiency in challenging conditions.

A visual representation of: SSL The Revolution Will Not Be Supervised
SSL: The Revolution Will Not Be Supervised

Exploring the cutting-edge possibilities of Self-Supervised Learning (SSL) in machine learning architectures, revealing new potential for automatic feature learning without labelled datasets in niche and under-represented domains.

Evolving silicon choices in the AI age
Evolving silicon choices in the AI age

How do you choose? We explore the complexities and evolution of processing silicon choices in the AI era, from CPUs and GPUs to the rise of TPUs and NPUs for efficient artificial intelligence model implementation.

A visual representation of: A Programmer's Introduction to Processing Imaging Radar Data
A Programmer’s Introduction to Processing Imaging Radar Data

A practical guide for programmers on processing imaging radar data, featuring example Python code and a detailed exploration of a millimetre-wave radar's data processing pipeline.

A visual representation of: Using artificial intelligence to explore the biological world
Using Artificial Intelligence to Explore the Biological World

Harnessing AI's capabilities to decode protein folding, catalysing a leap in biological research and therapeutic innovation.

A visual representation of: Artificial Intelligence in the Big and Scary Real World
Artificial Intelligence in the Big and Scary Real World

Analysing the application of Artificial Intelligence in real-world scenarios, addressing its transformative potential and the ethical framework required for its deployment.

A visual representation of: Are there really any benefits to 5G
Are there really any benefits to 5G?

Questioning the real-world benefits of 5G technology, offering a candid examination of its impact on connectivity and the disparity between expectations and user experiences.

A visual representation of: AI Gesture Control
AI Gesture Control

Exploring the possibilities of AI gesture control for household appliances and more, using privacy-preserving radar technology, underscoring innovation in smart home interactions.

A visual representation of: Human Problem Solving in the AI era
Human Problem Solving in the AI-era

Exploring the symbiosis of human expertise and AI, the team navigated the AI era, enhancing problem-solving capabilities across various sectors without compromising the human touch.

A visual representation of: Repurposing Innovation Bullet Proof Your Wine
Repurposing Innovation: Bullet Proof Your Wine

Repurposing military-grade technology to safeguard fine wines, ensuring their pristine condition from bottling to cellar.

A visual representation of: Webcams and Eye Contact in the Post-Covid Office
Webcams and Eye Contact in the Post-Covid Office

Exploring the challenges and technological solutions to achieving effective eye contact through webcams in virtual meetings, enhancing remote communication in the post-COVID workplace.

A visual representation of: Of mice and ships
Calculating Error: What do a brain and a ship have in common?

Analysing the commonalities between brain function and ship steering through error correction methods, highlighting the indispensable role of calculus in both biological and engineered control systems.


Related Technical Papers

View All
an image of our technical paper
mmWave Imaging Radar

Camera systems are in widespread use as sensors that provide information about the surrounding environment. But this can struggle with image interpretation in complex scenarios. In contrast, mmWave radar technology offers a more straightforward view of the geometry and motion of objects, making it valuable for applications like autonomous vehicles, where radar aids in mapping surroundings and detecting obstacles. Radar’s ability to provide direct 3D location data and motion detection through Doppler effects is advantageous, though traditionally expensive and bulky. Advances in SiGe device integration are producing more compact and cost-effective radar solutions. Plextek aims to develop mm-wave radar prototypes that balance cost, size, weight, power, and real-time data processing for diverse applications, including autonomous vehicles, human-computer interfaces, transport systems, and building security.

an image of our technical paper
Low Cost Millimeter Wave Radio frequency Sensors

This paper presents a range of novel low-cost millimeter-wave radio-frequency sensors that have been developed using the latest advances in commercially available electronic chip-sets. The recent emergence of low-cost, single chip silicon germanium transceiver modules combined with license exempt usage bands is creating a new area in which sensors can be developed. Three example systems using this technology are discussed, including: gas spectroscopy at stand off distances, non-invasive dielectric material characterization and high performance micro radar.

an image of our technical paper
Ku-Band Metamaterial Flat-Panel Antenna for Satcom

This technical paper by Dr. Rabbani and his team presents research on metamaterial-based, high-gain, flat-panel antennas for Ku-band satellite communications. The study focuses on leveraging the unique electromagnetic properties of metamaterials to enhance the performance of flat-panel antenna designs, aiming for compact structures with high gain and efficiency. The research outlines the design methodology involving multi-layer metasurfaces and leaky-wave antennas to achieve a compact antenna system with a realised gain greater than +20 dBi and an operational bandwidth of 200 MHz. Simulations results confirm the antenna's high efficiency and performance within the specified Ku-band frequency range. Significant findings include the antenna's potential for application in low-cost satellite communication systems and its capabilities for THz spectrum operations through design modifications. The paper provides a detailed technical roadmap of the design process, supported by diagrams, simulation results, and references to prior work in the field. This paper contributes to the advancement of antenna technology and metamaterial applications in satellite communications, offering valuable insights for researchers and professionals in telecommunications.

an image of our technical paper
The Radiation Resistance of Folded Antennas

This technical paper highlights the ambiguity in the antenna technical literature regarding the radiation resistance of folded antennas, such as the half-wave folded dipole (or quarter-wave folded monopole), electrically small self-resonant folded antennas and multiple-tuned antennas. The feed-point impedance of a folded antenna is increased over that of a single-element antenna but does this increase equate to an increase in the antenna’s radiation resistance or does the radiation resistance remain effectively the same and the increase in feed-point impedance is due to transformer action? Through theoretical analysis and numerical simulations, this study shows that the radiation resistance of a folded antenna is effectively the same as its single-element counterpart. This technical paper serves as an important point of clarification in the field of folded antennas. It also showcases Plextek's expertise in antenna theory and technologies. Practitioners in the antenna design field will find valuable information in this paper, contributing to a deeper understanding of folded antennas.

an image of our technical paper
60 GHz F-Scan SIW Meanderline Antenna for Radar Applications

This paper describes the design and characterization of a frequency-scanning meanderline antenna for operation at 60 GHz. The design incorporates SIW techniques and slot radiating elements. The amplitude profile across the antenna aperture has been weighted to reduce sidelobe levels, which makes the design attractive for radar applications. Measured performance agrees with simulations, and the achieved beam profile and sidelobe levels are better than previously documented frequency-scanning designs at V and W bands.

an image of our technical paper
Ku-Band Low-Sidelobe Waveguide Array

The design of a 16-element waveguide array employing radiating T-junctions that operates in the Ku band is described. Amplitude weighting results in low elevation sidelobe levels, while impedance matching provides a satisfactory VSWR, that are both achieved over a wide bandwidth (15.7-17.2 GHz). Simulation and measurement results, that agree very well, are presented. The design forms part of a 16 x 40 element waveguide array that achieves high gain and narrow beamwidths for use in an electronic-scanning radar system.

an image of our technical paper
Non-Invasive Auditory Sensing with Affordable Headphones

This paper presents a sensor for measuring auditory brainstem responses to help diagnose hearing problems away from specialist clinical settings using non-invasive electrodes and commercially available headphones. The challenge of reliably measuring low level electronic signals in the presence of significant noise is addressed via a precision analog processing circuit which includes a novel impedance measurement approach to verify good electrode contact. Results are presented showing that the new sensor was able to reliably sense auditory brainstem responses using noninvasive electrodes, even at lower stimuli levels.

an image of our technical paper
Long Range Retro-Reflector

Passive retro-reflectors that modulate a scattered RF signal but do not transmit in their own right are well known. They are widely used in RFID tags, and keyless entry systems with a number of standardised solutions defined within the industry. The main advantage of these systems is that the mobile unit (the tag) can either avoid completely the use of a battery by powering itself from the incident RF ‘interrogating’ signal or only require a very small battery with a long life. This enables a ‘disposable’ tag to be engineered at very low cost, size and weight. However, there are many potential applications that require a somewhat longer transmission range than can sensibly be achieved with this method. The conventional paradigm requires a higher power ‘interrogating’ signal in order to increase range and there are obvious limits to how far this can be taken. The combination of regulatory restrictions and the steep range vs power slope that results from the fundamental mode of operation generally restrict the range to a few metres at most. Plextek have been taking a fresh look at the possible ways of circumventing this obstacle to produce a long range device that is nevertheless RF passive (does not transmit but only scatters). This paper describes in outline some ideas in this space, some initial experiments that have been done and some potential applications of the techniques.

an image of our technical paper
An Optical Room Occupancy Sensor

An automated sensor system that determines whether rooms within a building are occupied by a person or people has many applications. These divide broadly into the following classes: Security: Whole-site surveillance from control node, cases where a room should not be occupied, intruder detection and asset tracking. Safety: Identification of lone workers during non-core hours and remote supervision of isolated working environments. Confirmation of building evacuation. Management of high risk processes. Facilities Management: Environmental controls (lighting/heating) to meeting room booking aid. Sensors that seem to solve this problem are plentiful and it is only when they are considered in detail that their deficiencies become apparent. This short paper makes this case and introduces a new type of sensor based on an optical method.

an image of our technical paper
An Introduction to Yocto

Yocto is a comprehensive project designed to address the complexity of building custom Linux distributions for embedded systems. Unlike conventional Linux distributions (distros) created for standard PC architectures, Yocto caters to the diverse and often incompatible hardware in the embedded world. By providing a sophisticated build system based on layered scripts called "recipes," Yocto streamlines the process of creating, maintaining, and updating embedded distros. Each package within a distro has its own recipe, maintained by the package developers, ensuring that updates and customizations are manageable and consistent. This structure allows developers to define precise sets of packages for their embedded systems, facilitates updates through package managers, and supports a wide range of hardware platforms. With support from major chip and board manufacturers, Yocto is becoming the go-to toolset for embedded Linux development, offering unparalleled flexibility and control for developers aiming to create finely tuned, market-ready products.

an image of our technical paper
GPU Computing

Power limits restrict CPU speeds, but GPUs offer a solution for faster computing. Initially designed for graphics, GPUs now handle general computing, thanks to advancements by NVIDIA, AMD, and Intel. With hundreds of cores, GPUs significantly outperform CPUs in parallel processing tasks. Modern supercomputers, like Titan, utilize thousands of GPU cores for immense speed. NVIDIA’s CUDA platform simplifies GPU programming, making it accessible for parallel tasks. While GPUs excel in parallelizable problems, they can be limited by data transfer rates and algorithm design. NVIDIA’s Tesla GPUs provide high performance in both single and double precision calculations. Additionally, embedded GPUs like the NVIDIA Jetson TX2 deliver powerful, low-power computing for specialized applications. Overall, GPUs offer superior speed and efficiency for parallel tasks compared to CPUs.