While AI training dims the lights of hyperscalers and cloud builders and costs billions of dollars a year, in the long run there will be far more aggregate processing done on AI inference than on AI training. It could be a factor of 2 to 3 times more computing capacity soon, and 10 to 100 times more capacity within a decade. Nobody really knows.
What we all suspect, however, is that there will be relatively few heavy-duty AI training devices and platforms using them, along with myriad and many inference devices. of AI. Thus, the relative performance and price/performance ratio of compute engines that run inference will be important when deployed at scale.
Meta Platforms helped invent many machine learning techniques and technologies that are deployed in production these days, and it’s no surprise to us that the company has created a unified inference framework, called AITemplate. , which he has open sourced and described earlier this month in a MetaAI engineering blog post.
Everyone was very excited that Meta Platforms released performance data running this new AITemplate inference framework, especially because some datasets – which were later removed from the blog – allowed us to do a direct comparison between Nvidia “Ampere” A100 GPU accelerators and AMD “Aldebaran” Instinct MI250 GPU accelerators. But it turns out that most of these datasets were illustrative within their product families, but not between them. However, there was a graph for the BERT transformer model for natural language processing that made a direct comparison, and we managed to get it in hand before it was taken down. And because we are The next platformwe priced against the devices being compared and speculated on the performance of the Nvidia A100 and “Hopper” H100 GPU accelerators using its own “Triton” TensorRT inference platform to give a broader sense of the space of inference value.
We did this primarily to talk about how difficult it is for system architects to choose inference platforms, and we expect many organizations to throw their hands up and scatter inference all around and consolidate it in their data centers. The nature of inference – which must be an integral part of applications and therefore highly sensitive to latency – will demand this.
What we learned from reading the AITemplate blog and talking with Meta Platforms behind the scenes is that the PyTorch framework (originally deployed six years ago by Facebook) built from the Torch library (which has -even two decades) is not very good at inference. Even when PyTorch is running in so-called “eager” mode, which is also supported in Google’s TensorFlow framework and which computes tensors in real time and does not compute compute graphs that can be run later, its performance let down. to desire. Especially in the FP16 half-precision data format (and its BF16 variant) which Meta Platforms likes for AI training on CPUs and GPUs and also seems to use for AI inference.
The AITemplate framework that Meta Platforms created explicitly for inference is much better and more importantly is enabled to run on both Nvidia and AMD GPUs and will be modified to support raster and vector units with precision mixed which are integrated in the processors now or will be added in the future. We assume that Meta Platforms will code itself an AITemplate backend for Intel “Ponte Vecchio” or “Rialto Bridge” Xe HPC and GPUs will similarly create backends for any device or math accelerator it puts into production, but will rely on those making other devices to create the backends for them.
When talking about inference performance, you should always consider batch size. Inferences can be processed one at a time – Batch = 1 – or grouped in multiples and run on the vector or raster math units by handles. Batch size one means absolute real-time processing and has the lowest latency. Larger batch sizes will have longer latencies, on average, but the overall system will have higher throughput due to lower communication overhead between the application on the CPU and the inference processing on the GPU .
Batch sizes on the AITemplate tests run ranged from 1 to 256, and depending on the architecture and the test, the performance increases of using AITemplate compared to PyTorch in impatient mode vary. Here’s what it looks like for Nvidia A100 processor accelerators running CUDA 11.6 for the ResNet-50 image processing model and the BERT-Base transformer model:
And here is a graph showing the speedups of AMD MI250 GPU accelerators running the ROCm 5.2 environment:
You have to be careful with these tables. They are relative above a specific GPU but not on the other side the GPUs.
What we see in these graphs is that PyTorch’s impatient mode was not as good at inferencing with small batch sizes on the Nvidia A100 GPU and the AITemplate inference framework is very good at this in comparison. The relative performance increases across batch sizes for BERT and ResNet-50 are more consistent on the AMD MI250 GPU. We don’t know why, and Meta hasn’t discussed it.
The graphic that was interesting – and perhaps accidentally useful because it was later deleted – in the AITemplate post is this:
This table above absolutely makes a direct comparison between the two platforms, and their performance under AITemplate is compared to that of PyTorch on an Nvidia A100. As you well know, we don’t think comparisons are ever odious, although IT vendors certainly don’t like it unless they come out on top.
Just for fun, we’ve added pricing information to this data after normalizing the performance to something that can be reasonably divided into the cost of these GPUs. We’ve also added some performance metrics for Nvidia’s BERT-Base for its Triton TensorRT inference platform, which performs inference in INT8 format, not FP16 or BF16 format, and which Nvidia claims has about 1 .92 times the throughput on BERT-Base compared to AITemplate on the same GPU.
And for even more fun, we’ve extrapolated the INT8’s performance to currently available GPU Hopper GH100 accelerators. We think GPU pricing is what’s happening in the market right now, and we realize that data is thin. Especially on the H100s. But we have it on authority The price of the H100 will be higher than the floor we calculated when we did a price/performance analysis of Nvidia GPUs in May.
And so, here’s a bang for the buck board on the BERT-Base model on these two GPUs and two AI inference frameworks:
As far as we know, Meta Platforms uses Nvidia GPUs and TensorRT for at least some of its production inference workloads, and it’s curious to us that Facebook wants to retain its 16-bit floating point data for many of its AIs. workloads.
We understood why he wanted to add BF16 formats to Intel’s AVX-512 vector engines on Xeon SP “Cooper Lake” processors that are used in Meta Platforms AI training systems, which complement the BF16 format used in training AI on the Nvidia A100 GPU in its “Zion” and “ZionEX” systems. Not having to move data formats between floating point and integer when going from training to inference and also not cropping the data can make things easier. Maybe Meta Platforms doesn’t want to sacrifice any resolution in natural language processing because it reduces the accuracy of its DLRMs. If so, and running TensorRT in FP16 mode, the performance increases shown in bold red italics above will be halved and significantly, the performance advantage of TensorRT over AITemplate will vanish on the A100 GPU. They will be roughly the same, with AITemplate being around 4% higher based on data we got from Nvidia via a third party.
We want to have an actual throughput in sequences processed per second for the BERT-Base workload in the table above. But we don’t. We therefore contented ourselves with relative performances.
So what does this chart show? For batch sizes of 1, the performance of the AITemplate on AMD MI250 or Nvidia A100 is the same – 1.8 times better than on the A100 running inference on PyTorch in impatient mode. The Nvidia A100 with 40GB is $10,000 and we estimate the AMD MI250 at $12,000 with much bigger 128GB memory. .) And the price/performance ratio of AITemplate running on the A100 is 44% better and running on the MI250 it’s 33% better than PyTorch looking forward to it. of the A100. If you increase the batch size to 2, each GPU on the MI250 card works and performance doubles to 3.6X on the AMD GPU and remains stable at 1.8X on the Nvidia GPU; At this point, the AMD MI250 plus AITemplate combo is 67% better value for money than the Nvidia A100 plus PyTorch combo, and the Nvidia A100 plus AITemplate combo remains at a 44% improvement in price/performance ratio. On a batch size of 4, Nvidia A100 GPU performance vs A100 plus PyTorch comparison drops (this is only a 20% increase) and MI250 plus AITemplate performance vs A100 plus increases slightly .
Assuming TensorRT scales similarly delivering 1.92X performance to AITemplate running on the same A100 with 40GB as Nvidia said, then you can see we are projecting a 3.46X multiplier for TensorRT on A100 for a lot size of 1 and 2 and a multiplier of 2.3X for a lot size of 4. And if you upgrade to H100 GPUs, and assuming INT8 performance scales by the 3X delta in performance of peak between A100 and H100, you get a multiple of 8.86X and 5.91X. But since the H100 might cost around $26,000, you get 3x the work for 2.6x the cost.
We’d like to see how AITemplate performs on Nvidia P4, T4, and L40 inference cards. . . . that’s probably what Meta Platforms wants to use, if at all.
What is the lesson of all this? Do your own benchmarks on your own workloads and reflect very carefully on how and where you want to design AI inference.
Nothing will likely be cheaper than the extra few hundred dollars of modest AI inference performance that will be built into the processors. That’s why enterprises, despite what hyperscalers and cloud builders are doing with their disaggregated and networked compute engines, will likely have a lot of their inference on the CPU – or maybe even their DPUs and certainly at their edges for the foreseeable future. For places that need to do a lot on GPUs – HPC simulation, AI modeling and AI inference as part of a hybrid HPC simulation, we will definitely see GPUs get the inference action.
#odious #comparisons #GPU #inference #performance