Powering the Future of Sustainable AI through Specialized AI hardware Accelerators by Kaoutar El Maghraoui


Video Transcription

Thank you so much everyone for joining this session. Uh Hello everyone, uh wherever you are. So my name is uh Kar Alma Rawi and uh I'm a principal research scientist at IBM TJ Watson Research Center.And today it is my pleasure to uh present as part of the uh women Tech Global Conference and talk about, you know, uh the future of sustainable A I to specialize A I hardware Accelerators. So uh first I would like to start by mentioning that A I and especially deep learning has achieved incredible performance in numerous fields as we all know. And we all have noticed including computer vision, speech recognition, natural language processing, et cetera. However, with this huge success comes also increased complexity. First, what we see is that these uh neural network models, they're getting larger and deeper with millions and even billions of parameters. In fact, the number of parameters in deep learning models is increasing on the order of 10 X year per year GP T, for example, uh GP T three has approximately 100 75 billion parameters or about two orders of magnitude more than its uh intermediate predecessor GP D two and more than 10 times as many as the previous record holder, the Turing NLG model from Microsoft and this year, a couple of models have been published that are even larger, breaking the trillion parameter threshold for the first time.

These are, for example, the switch transformers published by Google in January 2022. So the other uh trend that we see is that this with this increased size and complexity of deep learning models, uh we also see high computational demands. So this increasing complexity translates into high computational demands.

In fact, compute requirements for large A I training jobs have been doubling every 3.5 months. And additionally, most of these models are too expensive computationally to run on devices like cell phones, embedded devices or edge devices. And the third thing which is also the most important thing is that this increase in computational needs also translates in increased carbon footprint and high costs. Researchers, for example, at the University of Massachusetts Amherst release in 2019, a startling report estimating that the amount of power required for training and search in a certain neural network architecture involves the emissions of roughly £626,000 of carbon dioxide.

And that's equivalent to nearly five times the lifetime emissions of an average US car including its manufacturing. If we keep going at this rate, we're going to consume the entire energy budget of the world just to train A I models. So that's really not sustainable. So, so we've seen all these uh you know, trends in terms of the increase of the model complexity, the unbounded computational demands and also the increase in the carbon footprints. And all of this uh necessitates the need for approaches and methodologies for efficient and sustainable A I.

And as the future of A I is moving to the edge and A I becomes increasingly important at the edge, we need to consider more seriously a shift from state of the art model accuracy to state of the art model efficiency, learning at the edge, especially with reinforcement learning is going to play a key role in the future.

And with edge intelligence, we also have various challenges such as, you know, resource constrained devices, CPU memory power, et cetera. The difficulty of deploying these models on a fleet of devices, the unstable uh network connectivity, which is often uh the connectivity is often not stable and not guaranteed.

And the fact that also the data often cannot leave the edge uh especially for privacy concerns and security concerns. And uh for example, in the case of health care. So for all of these reasons, you know, it has become very important that we need to work, you know on software, hardware innovations uh across the entire stack for purpose built A I compute infrastructures. And this requires a holistic approach across the entire stack from materials architecture algorithms all the way to software. So in this grand landscape of building efficient neural networks, we see three approaches that are coming to play. The first time is uh the first thing is increasing model efficiency through uh efficient design of accurate and uh efficient neural networks. The second is increasing the hardware efficiency through purpose built A I accelerators. And the third approach is increasing the design efficiency by automating the design of efficient neural network architectures using uh neural architecture search techniques. So I'm gonna cover examples from each one of these key pillars. So looking at the first one which focuses on model efficiency, some of the popular approaches that uh that the research community and also in the industry that have they have been looking at at include pruning compact resolution filters, reduced precision and knowledge distillations.

So what do we mean by pruning? So this is a way of compressing uh models. And the goal for pruning here is to optimize the model by eliminating the values of the weight sensors. The aim here is to get a computationally cost efficient model that takes a less amount of time in training. And pruning is also a way of reducing the size of the neural network through compression. So after the network is pre trained, it is then fine tuned to determine the importance the importance of the various connections of the neural network. The other approach is looking at more efficient neural network architectures. For example, in the case of convolution neural networks uh compact convolutions.

They aim at optimizing the down sampling. Also reducing the number of channels and kernel sizes. For example, the reducing the number of channels and kernel sizes from three by three filters to one by one filters, reduced precision or quantization is another popular approach.

And the fundamental idea here behind quantization is that if we convert the weights and inputs uh into integer types, we consume much less memory of the multiply accumulate operations in hardware. And also these operations become much more faster. So here you achieve two things faster, compute and also less uh energy uh footprints. The the last approach here is knowledge distillation. And the idea here is to uh use also is a type of model compression. And the way this is done is by teaching a smaller network step by step exactly what to do using a bigger already trained network. So these soft labels uh that refer to the output features map, you know the the output uh refer to the output feature maps by the bigger network after every convolution network and the smaller network is then trained to learn the exact behavior of the bigger or the larger network by trying to replicate its output at every level uh at every level.

So this is also a very interesting techniques to compress this model and make it much more efficient and practical for deployment. So with regard to hardware efficiency, what are we doing in this space. This is a huge area of investment uh for us and focus at IBM research. And what we're doing here is we're imagining the future of A I compute and realizing it beyond the world of GP US or graphical processing units, which are the current state of the art for A I hardware accelerations, GPO S are best suited for things that we have today. But by no means they are the ideal architecture to implement some of these A I architecture, especially emerging A I AR architectures. So to address these challenges in February 7th of 2019, IBM, research launched the research collaboration center to drive next generation A I hardware accelerators.

And uh and IBM made a commitment here to invest $2 billion and also New York State invested 300 millions as well over the next five years. So this generates a lot of interest uh in the technical community. And this is a very serious endeavor uh for IBM Research and the IBM company to build an ecosystem of enterprise and academic partners and to drive next generation A I hardware accelerators. So uh when we founded the center, you know, the idea here is to uh address these computer challenges. And we believe that addressing uh these challenges require a full stack approach. You cannot solve these issues by just focusing on just a single piece of the stack. It is important to focus on the entire stack. So uh and the four technical trusts that we're focusing on here are core and architectures which is uh focusing on how do we re invent compute for deep learning and A I focusing mainly but not exclusively on digital computes and taking strong advantage of reduced precision or reduced precision scaling to increase compute efficiency.

If we move to the right here, uh when precision is reduced enough analog computation start to become feasible and analog compute is challenging if you need to do, for example, 32 bits uh multiply accumulator mac operations, analog compute work much better if your precision requirements are in the range of 2 to 4 bits.

So here we're taking an interesting approach where we're combining fundamental material innovations with architectural algorithmic innovations to get to even more compute efficiency with analog computes moving to heterogeneous integration. The focus here is to uh figure out how do we design systems uh that are well balanced.

So, so if you scale the compute efficiency very high but do not maintain system balance. Like scaling the bandwidth, the connectivity and the memory capacity, you end up with a lot of nus silicon to maintain that overall balance. Uh You know, we focus on using heterogeneous integration techniques and then finally, the end user thrust brings in the end users and the software ecosystem. So uh the center is hardware heavy, but it is also important to show that these innovations in the context of key workloads and end user uh use cases and applications and to study these emerging workloads from software perspective to see how they will inform also future design of our hardware.

So what we show here is IBM resources roadmap for what's beyond the GP U. What we've plotted here is some existing and projected performance metrics of mostly GP U hardware that exist on the market or has been announced by major players. And as you see here, there has been the steady progress uh at the rate of 2.5 X per year over the last few years. So the question here is what technology do we need to develop to be able to continue this rate of progress and go beyond the GP U. So the first step we're taking the first approach is to build accelerators that exploits the fact that machine learning models are highly tolerant to reduce precision computation. And the next advances will come from exploiting analog computations or in memory compute technologies which could potentially offer another 100 acts in energy efficiency. So here you know, this figure shows you know the key ingredients that we are following uh for building efficient A I design systems.

So building these software stacks that can can maximize the peak efficiency of the accelerator. And at the same time allows for ease or programmability is the focus of our efforts. And three ingredients come to play here. Approximate computing which leverages resiliency to approximate computations to benefit efficiency, hardware accelerators, specialized computing systems for A I and also custom compilers, which basically build software stack that would also extract the efficiency without sacrificing and user productivity.

So those are all important and need to work together to provide, you know, drive uh A I efficiency. So uh so here, let's start by looking at what we're doing in the reduced precision space. And uh first you know, let me explain what we mean by reduced precision computation. So well as you probably know the transactional systems that we have built and have enabled so many things uh for us that we take for granted today in the modern economy is built on the premise that 32 or 64 bits of precision are required. And in many cases, this is true. If you're a bank or you deal with hundreds of millions of transactions a day, even small errors can translate to millions of dollars lost every year. But on the other hand, machine learning and A I systems are very tolerant to uh to errors. And uh in the model because in the end, they can be averaged out and high accuracy can still be achieved. They're statistical by nature. So take as an example, the picture of the Mona Lisa that you we see here on the on the right on the top, right, even with all of these missing pixels, you can still recognize that this is the Mona Lisa.

So precision scaling it involves uh the convolutions or the gem operations, the general uh multiply uh matrix multiply operations. And we have been pushing the boundaries of precision scaling and precision scaling involves performing the computations or the convolution of the convolution operation.

For example, or the gem operations of the neural network at scale precision, that's lower bit width. However, the auxiliary operations such as the activations, the pooling, the normalization are done at higher precisions after which they are scaled and quantized and then they are fed back into the next layer. And so on, we have pioneered several advances to enable the use of ultra low precision for both training and inference. And for example, we have demonstrated training in FP 16 floating 0.16 and floating 0.8 formats. And more recently, we also have demonstrated training with uh in four integer four formats. And in the context of inference, we have even driven the uh scaling all the way down to uh four bits and two bits. One thing to notice here is that if you build a core that is uh 16 bit precision, it's gonna be four x smaller than cores with 32 bit precision. So the footprint of the hardware is quadratic smaller and that also drives, you know, quadra quadratic uh improvement into uh compute efficiency. So this just shows, you know, uh the our reduced precision scaling.

And it is important that we do this while maintaining iso accuracy, meaning that we aim at preserving the accuracy of the model that you would get even at full precision. So in 2015, that was the first time when we proposed the words first reduced precision neural network training and started this precision scaling journey. And then from there on, we have demonstrated that several conferences, you know how we scale precision uh you know by two X while improving 4 to 6 X hardware performance without losing the the model accuracy. So going in uh you know, to uh the analog. So here the next step is to look at specialized hardware for A I uh is really solving the performance efficiency loss that comes from data movements between computational units and memory. So, so here, what we're looking at is uh you know, using the, the the laws of physics to perform these analog computations. So let me explain what is analog A are all about non volatile memory or NVM technology enables in memory computing which eliminates the van Nomen bottle neck and also performs uh allows performing computations directly.

This is accomplished by mapping the neural networks to analog hardware and storing the neural networks weight as a property of the nonvolatile uh material itself. And what this accomplishes it eliminates the data movements. This approach offers unparalleled speed up and energy efficiency for A I workloads.

Let me explain you know how you know uh fundamentally this concept works. So first, I would like just to go over some of the uh key building blocks for deep learning. So we have these multiply accumulate operations, multiply add, then there is the updates that happens to the weights during the gradient descent. And also there are all these activation functions that are you know that happen between the layers. So uh to avoid having linearity to to introduce nonlinearity in the models, things like sigmoid, soft max ray and so on. So matrix manipulations and non nonlinear activation functions are reoccurring in operations in deep neural networks. And that's basically fundamental in the operation of neural networks.

So let's look at how this is done conceptually if you look at the digital accelerator, typically it is set up like this, you have one the processor on one side, the CPU and then on one side, you have the memory and they're connected to a bus and this is known as the Van Norman architecture.

So let's say that you want to multiply two numbers A and B. So first they start by sitting in memory and they are sent to the processor, then the answer you know is computed uh and then sent back, you know, if you want to do any operation on these is computed in the CPU and then the result is sent back again to the memory. This uh fact that this data movement that happens back and forth between the memory and and CPU through the bus consumes a lot of energy and additionally, this bus can become a bottleneck if there is a lot of data movement analog accelerators. On, on the other hand, what they do is uh they get rid of the processor and do everything in memory. So by storing a in the memory uh and sending B as a voltage and then computation just happens inherently in the memory itself. So there is no data movement here. So let's look at basically how this uh the neural network is mapped into memory and how this multiplication is performed. So I I explain here the uh one of the key concepts for analog A I computations, especially with these resistive memory dot products.

So consider that we have a simple circuit here that has three wires and two resistors and assume the two resistors have conductance values G one and G two. I still hope that you remember some of the your physics uh fundamentals. So if we apply voltage V one and V two across these wires, the first wire injects a current V one times G one into the vertical line. And the second wire injects a current V two times G two into the vertical wires. And these currents add up to give a total current value of V one times G one plus V two times G two. Thus, the resulting current in the vertical line is simply the dot product of the input vector of voltages V. And then the uh and then the conductance and the vector of conductance is G. So all we're doing here is relying on Onslow and K to flow to perform the multiplications and additions in the analog domain. Using the laws of Electrophysics, we can extend this concept further and create a dense array of metrics uh uh of uh or a grid of resistances which can be programmed beforehand to give us a matrix of conductance G. And when I apply a vector voltages V as input, the current in each vertical line is the dot product of the input voltage uh vector and the vector programmed uh as residence conductance G. So the current coming out are the results of vector matrix multiplications.

And this grid of resistances we call it an analog cross cross point array. So, so this is kind of magic, you know, uh you just, you know inject these voltages, you program these cones ahead of time and you know, boom you have, you know the uh your matrix vector multiplication done, you know, instantly uh all in one step. So, and this is what makes you know, analog uh memory in memory, computing very fast and very efficient. So you can see here that the weights it's included with conduct and spare G minus and G plus. So to account for positive and negative values and then the product X I times WIJ and the sum is obtained using arms and Kal flow respectively. And this is basically the fundamental operation of the MV MS that happen in neural networks. And this just shows you how a neural network is typically mapped to these analog array. So with this uh mechanism, we are achieving parallel fast and energy efficient multiply accumulate computes. So let me just uh state some of the advantages here that we have uh for analog A I. Uh for example, inference.

Uh So first due to the parallel in memory compute the matrix vector product is very power efficient as the weights data does not need to be fetched from memory. Like in the digital accelerators, the operations are done instantly on the memory itself. Second, because the analog matrix vector product is performed using the laws of electrophysics, it is computed in constant time. And this implies that a large weight matrix needs in first approximations are uh you know, it's the same time to compute a smaller matrix like you see here, whether you have a large or a smaller, you would achieve the same uh you know uh speed in terms of compu computing these uh products.

And and this is not the case for conventional digital mac operations where the time to compute the matrix vector product typically scales with the number of matrix elements. This also translates into latency which takes advantage of pipeline the weight stationary uh nature of this architecture latencies that are one millisecond for most models and workloads. And also this is very advantageous for low mini badge streaming workloads. However, with all of these great adventures comes also key challenges for analog A I accelerators. And most importantly, the analog vector product is not exact. And there are many possible noise sources in the analog domain and non ideality, for instance, as shown in this graph, uh repeating the same matrix uh vector product W will result in slightly different outcomes you see here some, some notes. So, so, so the fundamental question that we have to you know, ask ourselves here and also explore in research is how do we achieve acceptable accuracy for large scale DNNS? And can we improve algorithms to cope with this noise inherent noise in uh in the analog domain and these non ideologies? And this was the motivation behind developing a flexible simulation tool kit for analog accelerators.

As there is a need to investigate whether we can achieve acceptable accuracies when we accelerate important A I workloads with these future analog chips. Additionally, a simulation tool kits that simulates the expected noise sources of the analog compute will also help develop new algorithms or error compensation methods to improve the accuracy and reduce any potential impacts of the expected non ideality of the analog computes.

What I have shown so far is just a glimpse of the many innovations that A I is driving in hardware. Uh We have, you know, open source these tool kits which is first of all kind open source tool kits using the pytorch framework. And it also has full GPO support and it provides a suite of tools that enable one to master and explore and accelerate neural network architectures with analog hardware technology and power more sustainable A I models. So we're also actively working on connecting these toolkits with uh real analog accelerator prototypes coming from IBM research. So uh here you know, I put a link to this toolkit. I hope you can explore it offline if you're interested in this technology. And this toolkit has many capabilities.

It uh allows you to simulate analog A I training. Also analog A I inference has also capabilities around hardware. We training for inference uh to basically make models uh work well or cope better or, or be resilient to analog A I noise during inference. And also you can simulate a wide range of analog A I devices and cross bar configurations because the research community is still trying to figure out the right or ideal material for these analog memory uh arrays. Uh you know, there there could be faint change memory, it could be an Ecrm. Uh It could be a resistive random access memory. Many different types of uh of resistance memory types are possible to implement these uh types of uh hardware accelerators. In addition to this uh toolkit, we also have the uh uh front and the cloud experience that we call the analog uh A I hardware composer. And this is an interactive no code cloud experience, which allows you to do cloud based simulations, but it also has preconfigured analog device presets. It also showcases new algorithmic you know advancements and it's it's very interactive. You can select templates, you can compose your neural network, you can select for a wide range of preconfigured device presets. Uh You can also select analog friendly optimizers. You can visualize the accuracy results. You can also configure various node models uh for for inference simulation.

So lots of different capabilities that are presented in a visual and user friendly manner. And if you look at the road map, we also have a lot in store here in terms of capabilities that we're currently working on. And we're also planning to add in terms of analog hardware access backed integration with uh uh the supercomputers. This is an example of supercomputer that uh we're we're working on. So uh you can have faster simulation, uh more network, more neural network templates and types unsupported being able also to bring your own model and data sets and and also explore additional, you know, for the material scientists, the people interested in the hardware piece of the things.

They can also bring new device models and materials and explore them uh for uh neural network training and inference. So I hope you can try this out. This is also open to the public. Uh And I I put here the URL if you're interested in exploring this further. So Uh In addition to the simulation capabilities, we're also building real analog A I inference hardware where we have showcased first of all kind industrial demonstrations of neural network inference using analog A I chips. So for the first time IBM research, a hardware center announced at the yellow I it's a really top conference uh in 2022 a 14 nanometer fully on hardware deep learning inference technology used in two types of analog A I compute chips based on phase change memory technology. We can see here we show demonstrations using the renet nine neural network on the cipher and data sets for image recognition and also a one and two layer uh multi layer perceptron with fully connected layers on the M data set for image recognition really cool demonstrations.

And I also provide here for reference to the papers if you're interested in learning more about these demonstrations. All right. So this brings me to uh you know, the other, you know pillar here in terms of building efficient neural network, which is around design efficiency.

What do we mean by that? So this third approach relies on building efficient neural networks which bring in more automation. So uh today's neural networks are extremely complex and easily reaching orders or millions of parameters, uh billions of parameters that we've seen even trillions more recently.

And the big part of the complexity is this uh the synthesis of the model from each data set you need to find the best set of network architectures, you have to do preprocessing training the parameters, fitting the given data hyper parameter optimization and so on. And often this these require highly skilled data scientists that are required to create good neural network architectures and rely on many years of accumulated experience uh you know, designing neural network architectures. And here there are multiple you know things that need to be taken into account the number and type and the order of the layers, the size depth of each layer, the connection between the layers, the nonlinear functions that you need to, to use what type of training parameters in terms of the learning rate, the optimizer choice, the loss function and many, many other things and these expertise that's created.

But each uh data scientist is basically lost uh from one model synthesis to another and it's not transferred. And this is not a scalable approach. And this was motivated basically auto A I for automating model synthesis for emerging and existing A IA I uh models. However, this has mostly, you know, um in the past focused on maximizing the accuracy of the model and what's you know. So this is basically you search for the best architecture that would give you the better a better accuracy. But what we're really, you know, trying to steer this toward and what this is also has been a trend uh in in the uh in the community in the research community that care about sustainability is we also wanna have models synthesized automatically models that are better in accuracy.

But at the same time, uh produce less latency and you reduce their energy footprint. And this is basically automating the design efficiency of these models. Uh the accuracy is still important, but now we have multi objective functions. So this is an important step towards democratizing A I to achieve efficiency and automation. So, uh and especially that, you know, as I mentioned, handcrafting these neural architecture network is expensive time consuming and then ad hoc process. So there is this, you know uh emerging need for hardware way neuro architecture search uh which would be key step towards democratizing A I and also achieving sustainable uh models. And if you look at neural architecture, search component, typically you find uh you need to compose a search space that you search from. It could be for example, a convolution search space uh or you know uh an N LP like a transform transformer based search space, then you need to use one of the search algorithm like a reinforcement learning a technique or if evolutionary ar you know uh search technique or gradient based or differential technique, and then you need to evaluate the model accuracy uh while you, you know, go through your big search space to figure out, you know what's the right architecture that I keep progressing in my search strategy to, that would give me the best uh model in terms of accuracy.

So now with hardware NAS or hardware where nor architecture search this, you know, the picture here is much more augmented because we not only look at the architecture uh in the search space uh for these uh neural network architectures, we could also look at the hardware search space in terms of, you know, different components of the hardware.

Uh We are also evaluating the hardware costs in addition to evaluating the accuracy. Uh So looking for example, at real time measurements or using performance predictor or building surrogate models, different ways to evaluate. And then looking once you get the best model, sometimes there are some post optimization strategies that you can use things like transferring the model doing quantization compressing the model through pruning or specification of the model and so on. This is a very active area of research right now, especially as we've seen uh A I models are increasing in complexity costs and their carbon footprint is huge. These are really limiting the practicality of deploying these models. And also uh you know, the uh energy efficiency of these models really need to be a kind of a top priority of everyone. So uh these hardware neural architecture search strategies are gonna be key to uh drive to automate the design efficiency of these neural architectures. And this just shows an example, uh for example, with quantization, you can significantly reduce memory and computing requirements for edge deployments. Uh so deploying deployments on the edge, especially for real time inference is key to many application areas and it significantly reduces the cost of communication with the cloud in terms of the network bandwidth. So quantization is becoming very popular to apply as an optimization technique.

Here we see we see here two examples using the SSD light using mobile net architecture. And another example you know this is for live object detection at the edge. And another example that shows live image segmentation. And with, with this ultra low precision and also innovations in A I hardware, there will be many, many more applications uh that would be possible and this will be necessary to accelerate and reduce. Also the carbon footprint for multimodal multitask A I applications towards, you know, uh the A G I or artificial uh general artificial intelligence. So uh so basically, you know many applications like uh smart healthcare where doctors are performing remote surgery, diagnostics and monitoring of patients, online, smart factories where you we have smart machines to improve safety and productivity and operate on heavy machines.

Uh especially those located in hard to reach areas. Also uh transportations with self driving cars, smart energy where multitask and multimodal A I is connected, you know, for example, in wind turbines, wind farms and many other many more applications. So I just want to conclude here by reinstating, you know, uh reiterating the uh the mission of the A I hardware center which really, you know, looks at these areas to drive more energy efficiency and build sustainable A I models. I thank you for your time and your attention.