IBM's AI Accelerator: Better not just be a science project

IBM’s AI Accelerator: Better not just be a science project


Big Blue was one of the system designers who spotted the accelerator bug early on and stated quite forcefully that in the long run all kinds of supercomputers would have some kind of acceleration. That is, a kind of specialized ASIC that a CPU would offload its math homework to.

Perhaps IBM is re-learning some lessons from that early HPC era a decade and a half ago when it developed the PowerXCell vector math accelerator and used it in the petaflop-crushing “Roadrunner” supercomputer at Los Alamos National Laboratory, and applies those lessons to the modern AI era.

One can at least hope, just to keep things interesting in the AI ​​arena, that the company takes itself seriously in at least some sort of HPC (which AI training certainly is) like its IBM research arm seems to be doing with one new AI acceleration unit it unveiled.

Many details behind IBM Research’s AIU have not been disclosed, and so far nobody has revealed the history of IBM’s vector and matrix math units (which are absolutely no computational difficulties) and their use of mixed precision and a blog post talks about the AIU specially to pass by

The AIU device IBM Research has revealed is based on a 5-nanometer process and is believed to be made by Samsung, IBM’s partner for its 7-nanometer “Cirrus” Power10 processors for enterprise servers and its “Telum” -System z16 processors for its mainframes. The Power10 chips have very powerful vector and matrix math units that are an evolution of designs IBM has used for many decades, but the Telum chip uses a third generation of IBM Research’s AI Core mixed-precision matrix math unit On-chip AI inference and low-precision AI training accelerator.

Announced in 2018, the first AI core chip was capable of accumulating FP16 half-precision and FP32 single-precision math and was instrumental in IBM’s work, even lower-precision data and processing into neural networks bring to. After creating the AI ​​accelerator for the Telum z16 processor, which we detailed here back in August 2021, IBM Research took this AI accelerator as a fundamental building block and scaled it up on a single device.

Before we delve into the new AIU, let’s review the AI ​​accelerator on the Telum chip.

On the z16 chip, the AI ​​accelerator consists of 128 processor tiles, likely laid out in a 2D torus configuration measuring 4 x 4 x 8, but IBM was unaware of this. This systolic array supports FP16 matrix math (and its mixed-precision variants) on FP32 multiply-accumulate floating-point units. This was explicitly designed to support machine learning matrix mathematics and convolutions – including not just inference but also low-precision training that IBM expects could happen on enterprise platforms. We think it’s also likely to support the FP8 quarter-precision format for AI training and inference, and INT2 and INT4 for AI inference, which we’re seeing in this experimental quad-core AI core chip announced by IBM Research in January 2021 for embedded and mobile devices. The AI ​​accelerator in the Telum CPU also has 32 complex function tiles (CF) supporting FP16 and FP32 SIMD instructions, optimized for activation functions and complex operations. The list of supported special functions includes:

  • LSTM activation
  • GRU activation
  • Fused Matrix Multiply, Bias op
  • Fused Matrix Multiply (with Broadcast)
  • batch normalization
  • Fused Convolution, Bias Add, Relu
  • Max Pool 2D
  • Average pool 2D
  • soft max
  • read again
  • Tanh
  • Sigma
  • Add to
  • Subtract
  • Multiply
  • share
  • minimum
  • Max
  • protocol

A prefetcher and write-back unit hooks into the z16 core and L2 cache ring and also connects to a scratchpad, which in turn connects to the AI ​​cores via a data mover and formatter unit which, as the name suggests, data formatted so it can run through the matrix math unit to make its conclusions and provide the result. The pre-fetcher can read data from the scratchpad at more than 120 GB/s and write data to the scratchpad at more than 80 GB/s; The data mover can pull data into and out of the PT and CF cores in the AI ​​unit at 600 GB/s.

On the System z16 Iron, IBM’s own Snap ML framework and Microsoft Azure’s ONNX framework are supported in production, and Google’s TensorFlow framework just went into open beta two months ago.

Now imagine copying this AI accelerator from the Telum chip and pasting it into a design 34 times like this:

These 34 cores and their uncore regions for scratchpad memory and the connection between the cores and the external system have a total of 23 billion transistors. (IBM says there are 32 cores on the AIU, but you can clearly see 34, so we think two of these are there to increase chip yield on devices with 32 usable cores.)

The Telum z16 processors weigh in at 5GHz, but the AIU is unlikely to run at anywhere near that speed.

If you look at the AIU die, it has sixteen I/O controllers, which are probably generic SerDes that can be used for memory or I/O (like IBM does with their OpenCAPI interfaces for I/O and memory in did the Power10 chip). There are also apparently eight banks of LPDDR5 memory from Samsung on the package, and that would be a total of 48GB of storage, delivering around 43GB/s of aggregate bandwidth. If these are all storage controllers, storage could be doubled to 96 GB/s and 86 GB/s aggregate bandwidth.

The controller complex at the top of the AIU- This is probably a PCI Express 4.0 controller, but hopefully it’s a PCI Express 5.0 controller with built-in support for the CXL protocol.

IBM hasn’t said what kind of performance to expect with the AIU, but we can make a few guesses. Back in January 2021, a quad-core AI-Core chip etched by Samsung in 7 nanometers and 25.6 teraflops in FP8 training and 102.4 teraflops in INT4 inference at 1.6 GHz debuted at the ISSCC chip conference delivered. This test chip ran at 48.6 watts and had 8MB of on-chip cache.

This AIU has 34 cores, 32 of which are active, so it should have 8x the performance, assuming the clock speed stays the same (whatever that was), and 8x the on-chip cache. This would account for 204.8 teraflops for AI training in the FP8 and 819.2 teraflops for AI inference with 64MB of on-chip cache, which is a little south of 400 watts when implemented in 7 nanometers. But IBM is implementing it in 5 nanometers with Samsung, and that probably puts the AIU device at around 275 watts.

For comparison, a 350-watt PCI Express 5.0 version of Nvidia’s “Hopper” GH100 GPU delivers 2TB/s of bandwidth over 80GB of HBM3 memory and 3.03 petaflops of FP8 AI training performance with sparsity enabled -Support.

IBM Research will need AI cores. Lots of AI cores.

Sign up for our newsletter

Featuring highlights, analysis and stories from the week straight from us to your inbox, with no in-between.
subscribe now

#IBMs #Accelerator #science #project

Leave a Comment

Your email address will not be published. Required fields are marked *