Attention? Attention! - Attention Mechanism In Deep Learning by Nicole Koenigstein
Video Transcription
Welcome everyone to my talk today about the attention mechanism in deep learning. And I'm very excited to be part of the Woman in Tech Global Conference this year. And yeah, I'll shortly introduce myself.I'm Nicole Konigs and I'm currently working as the data science and technology lead at Impact vise and Impact Bias is a big data and analytics company using machine learning uh to analyze companies. All right. So uh today's talk will be uh as well as so I first will give you a short background about sequence to sequence modeling and in general attention mechanism. Then we go over the different types of attention, then we will dive deeper into self and multiheaded attention and then we will round up with some practical use cases, how you actually can apply the attention mechanism in deep learning. All right. So uh before attention um and transform a sequence to sequence were pretty much like the illustration on the current slide. And the elements of the sequence X one X two and up to XN are usually called tokens. And they can be literally uh anything uh for instance, uh text representations pixels or I in the case of videos images. And uh further the the two sequences can be of the same or even arbitrary length.
And in case you are now questioning yourself yes, recurrent neural networks dominated this kind of task. And the reason is really simple. So we just want to treat sequences sequentially and this sounds straightforward, right? But well, interestingly, the transformers proved us it is not and to look a bit from another angle and to to get a bit more understanding of the encoder and decoder, um there are nothing more than ST RNN layers such as LST MS and the encoder processes, the the input and produces then one compact per presentation called set from all the input time steps.
And it can be regarded as a compressed format of the uh input. And on the other hand, the decoder receives the context vector set and then generates the output sequence. And the most common application for sequence to sequence is language translation. And we can think of the input sequence as a representation of let's say a sentence in English. And then the output of the same sentence in Spanish and RNN based architectures used uh to work very well, especially with uh long, short term memories and the gated recurrent unit as a comp component. And all right. So well, now you might ask yourself so uh if that worked good. So what's the problem? And the problem is is that they just work well for small sequences like shown on the current slide here. So um with time steps lesser than 20 the problem is that the intermediate representation set cannot encode information from all the input time steps. And this is commonly known as the bottleneck problem. And the like the set needs to capture all the information about the source sentence.
And the mathematics in theory um indicate that it is possible. However, in practice, how far we can see in the past, which is the so-called reference window is finite. And RNNS tend to forget information from time steps that are further behind. And moreover, the stacked iron and layers usually create the well-known vanishing gradient problem as we have visualized on the current slide, right? So yeah, Alex Grave who is uh um a research scientist at Deepmind uh stated 2020 that memory is attention for time. So let's dive a bit deeper what really attention is all about. And to give you a bit more of an example, you might know that we as you use human, seldom utilize all available inputs to actually complete a task. So you can think about listening to one of your friend talking, maybe you're talking in a coffee shop with many other people talking around you that you will have different conversations going around you, people placing their orders and all of that tends to fall into the background because you with your complex and sophisticated brain and ears are capable to pay attention to only what's really important to you.
That is in that moment, your friend and selectively ignore the things to occur around you that are simply not relevant for you at that moment. And here, the important thing is that your attention is adaptive to the situation that is you will ignore the background to just listen to your friend only. Then there is nothing more important than your friend. So for instance, if there was going a fire alarm of you will stop paying attention to your friend and you shift your focus or your attention on this new important thing. Thus, the attention is about making the importance of the input adaptive or in machine learning terms, the features. All right. So now that we have some basic understanding what attention is all about, let's go over uh different types of attention. So we have soft and hard attention, uh also global and local attention. And then lastly, we look deeper into the self attention, right? So with soft attention, the weights for alignment are learned and softly applied for instance, to all patches in the original picture. And by doing so, the model is smooth and differentiable, which represents an advantage because we can compute with gradient descent because soft attention is fully differentiable because it's a deterministic mechanism. The disadvantage however is that it is costly and the source input is large.
And in contrast, if we have heart attention, which just chooses one area of the picture to work on at a time. The advantage is here that um less computation is required during the inference process. However, the model here is non differentiable which, which then necessitates the use of a more complex training method. So we have to estimate the gradients by month Carlo sampling. And because we use Monte Carlo sampling, it is now a stochastic process. And you see here on the upper row uh the examples with applying a soft tension. And in contrast, on the bottom line, we have the heart attention. And here is just how the the formulas look like in in contrast. All right. So uh attention often will be calculated across the entire input sequence which is then just global attention. And despite this simplicity, this calculation may be computational costly and sometimes really unneeded. And as a consequence, several papers have advocated local uh local attention as a possible remedy.
And obviously, this would be sometimes preferable for extremely lengthy sequences. And um local attention can be also seen as kind of heart attention since one must make a choice to reject certain input units. And um the main distinction here between heart and local attention models is that the local attention is near nearly always differentiable while, as you already know, now, heart attention is not all right. So now let's dive a bit deeper into uh what self attention is all about.
And it's really one of the most common attention types used and it's called self key value attention. And um self attention is used in transformer based architectures which are primarily used in modeling language understanding task. And with the use of of the attention, they issue the use of a recurrence in the neural networks as they trust entirely on the self attention mechanism to draw global dependencies between the inputs and the outputs. So, so remember at the beginning of the talk, I showed you how the sequence to sequence models usually work with all of the current models in the uh cells in between. So instead of looking for an input output sequence association or alignment, we are now looking for scores between the elements of the sequence as we see here on, on the current slides. So we have a probability score matrix and the then just look for for those relations uh between them. So, all right. So um what is then the math behind this, let's look a bit deeper. So um self attention is also a sequence to sequence operation and it's composed of the input, which is a sequence of tensors and then of the output, which again is a sequence of tensors. Each. Uh one is a weighted sum of the input sequence.
And to produce here the vector yy I, the self attention operation simply takes a weighted average over all the inputs. And here the simplest option to do this is the dot product. And so that is how it looks graphically. So uh we have our inputs at the bottom and we have our outputs at the top. And in between, there will be just some combination of all these inputs. So really at heart, the operation of self attention is very simple. Every output is simply a weighted sum over the inputs. And here the trick is really that the weights in this sum are not parameters as you might already know from normal neural networks. They are here derived from the inputs. And in mathematical terms, we do uh the following here. So we have WIJ and it is derived uh it's a derived value. And to get it, we first compute the weights for the prime WIJ. So we take the dot product of the element of the sequence with the I uh with the chief element of the sequence. And then after computing the weights, we simply apply the soft max function so that all the values sum up to one. And so that is how it looks graphically. So we have here a sequence of five inputs and five outputs. And now we are just focusing on the computation of Y three. So to do so we are taking the corresponding vector X three to generate the weights.
And we do that by taking the dot product with every vector. So we start here with X one X three, then we do the same with X two and X three and so on. And now that we have all these five weights, we take the soft max, but it sums up to one, then we are going to multiply each input vector by the weights we just compute and then we sum them all up and this gets us then the vector Y free. And of course, we are going to do the same for all output vectors. And you might see the problem here. So if we would do that for all the uh input vectors or output vectors to, to get the computation done, we would end up with a lot of loops. So obviously, this is something we don't want to do in machine learning or deep learning. So we want to have a vectorized version of it. So uh to then use a vectorized version of that computation, we compute all the raw weights of W prime. Uh We can simply compute a large matrix of all dot products of X with itself. So this W prime matrix contains every dot product of every input vector with every other input vector. Then we apply the soft mix function uh simply ra wise to that matrix.
And then we simply multiply this weight matrix with our input matrix X to give us all the weighted sums in one matrix multiplication. And one thing we need to consider here is that to compute the actual self attention, we have to, to acknowledge that every vector uh occurs in three different positions. So uh first, the vector that is used to compute the weighted sum that's finally provides the output and indicated here in blue, which is then called the value. Then second, as the input vector that corresponds to the current um output matches against every other input vector.
And this is called ben the query. We have that in her keys. And then lastly the vector um which is matching against the query. Uh This is here in, in the current slide in orange and it is called the key. And let's just go over the steps a bit more in isolation to really understand what's going on. So we have the input vector X of I and it's used as you right now uh in three different ways. So first, we are going to compute the attention weights for its own output and it is called the query. So secondly, we are going to compare this to every other vector to compute the attention weight. And we can think then of that as the key. And then lastly, we are going to sum this up with all the other vectors to form a result of the attention weighted sum, which then serves here as the value. And to give you a bit more of an intuition, how you can think of that operation or of the keys, the query and the values we can compare it here to, to a, a dictionary you use usually in using in any programming language. And I used here just Python. So we have the values the key and we can request it by the, by the query. And yeah, mathematical speaking, to just summarize it up here. So we can think of these processes processes as metric multiplication.
And then we just introduce a soft max function to make sure that it sums up to one. And we do that for each input vector with these three matrices. And then we combine the results in the three aforementioned ways to then produce the output for Y of I. And um if we are looking into the scaled self attention, which was used for the original paper attention is all uni unique uh which was uh kind of uh starting for the transformer models. We will have to use scaled self attention. And it is in basic, it's really simple. Instead of using the normal dot product, we just use uh scaled uh dot product. And we scale that by the root square of the input dimension. And we do this because the input dimension of the input vector grows. And so does then the size of the dot product and this factor is of a sequence uh of a square root of, of D. So with that, we normalize the average dot product which keeps the way it's done just in a certain range. And, and where with that trick, we do not suffer from the vanishing gradient. Problem from the uh from the soft mix multiplication because this one is not linear. And as you remember, I already mentioned that the RNNS do suffer from that particular problem and we can remedy kind of uh that problem with using the scale dot attention. And yeah, I already mentioned attention is all you need, which started with the error of transformer models.
And here you will also encounter multi head attach. And that just means that each head is a set of the three weight MS I just mentioned. So we have to multiply um these hats to learn the weight simultaneously. And uh here important to note is that if you have three heads, the matrix just becomes three times as big as one head. So we are basically learning those transformations at once. And why might this be interesting or why? Why do we want to do that? Well, in many sentences, there are different relations to consider. And here on the current slide, the meaning of a word bad is inverted by, by not right. So its relation or suitable word movie is completely different. It describes the property of the movie. And therefore the idea behind what had self attention is that multiple relations in that sentence are best captured by different self attention operations. And of course, uh since these heads are independent from each other, we can perform the self attention computation in pa parallel with different workers.
So we might use uh worker one here then work uh to for that one and then worker free for, for that last uh attention matrix. And this is really um speed up in the computational terms. And it's really uh an advantage in regards to the RNNS because with the RNNS, we can't do parallel computing. All right. So now that we know and we understand the basics of the attention, let's go into some use cases how you actually use the attention mechanism. All right. So I already mentioned attention is all you need. And this work really represented the first sequence transduction model based entirely on, on the attention mechanism by replacing the recurrent layers most commonly used in the encoder decoder architecture, which I introduced at the beginning uh with the multiheaded self attention and to give you a bit more um intuition how that actually work.
Let's look at a bit short at some code and we see here. Um Yeah, I added just the paper here uh with some links to more resources and I will skip the positional encoding used here. Um You can look that up later on um as a link to a pretty good blog post about that. So we have here the step one. So we have the word embeddings to get the positional uh information um about that. And then we stick those together in a single array to get them here. The matrix then uh step two, we have the weight matrixes. So note that um I just assigned them some numbers. Uh And in uh in theory, or in, in practice, you would actually train that during the training process of the network and the weights will then be learned by the network. And you will usually assign that with random numbers which are very small at the beginning. And then as step three, we are going to do the mentioned matrix multiplication to obtain then the query the keys and the values and then we can process each input vector to fulfill the the different roles I just mentioned to do the computation. And then we apply the scaled attention scores. So to um keep the base matrices in within a certain range, then we apply the soft max and then uh step five, we use here the scale of mix attention for each score.
And I just do do it here a bit more granular that you get a bit more of an intuition and insight how it's really done instead of using the aforementioned um vectorized version. So it's just a bit more granular. And then we do the computation for the score of one. And then as the last step, we just sum that up to get the first line of the output matrix. And of course, for step eight, we repeat all the steps from 1 to 7 to get it for all the inputs. Then here is the example how we use the Morehead attention. So we just assume that we have trained these eight hacks of the attention layers and what we then do, we concatenate these heads together to obtain then the output of the model and we have that here and to give you a bit more thing to play around with and to get a bit more of a feeling what you really can achieve with the transformer models in a very easy way.
I provide you you some code uh from the having phase API and you can just run that within the Jupiter notebook and you just uh use the pipeline what you wanna do. And then you enter, for example, the translation from English to German and put in the sentence you want to translate and then you get out the result. And as my mother tongue is German, I can assure you that it's the correct translation. And um as the next example, I provided you uh text classification and here we have the uh just the sentence I enjoyed the movie and the model correctly indicated that this is a positive review. So with over 99% certainty. And uh another uh also interesting thing is that we can use it for track text generation. And here we have the question and some sort of context. And for you uh those who already had read um hitchhiking through the galaxy know that the number 32 is the answer to life, the universe. And everything. And I will provide you the source uh code or the the link to that notebook later on. All right. So let's look at some more examples and use cases. So I want to also show you the vision transformer. And this work is based upon the theory that image patches are basically the sequence tokens like words or like I mentioned at the beginning of of that presentation.
And in fact, the decoder block is the same as the original transformer model. I just introduced you. And if the vision transformer is trained on a data set with more than 14 million images, it can approach or even beat state of the art convolutional no networks. And let's dive in a bit deeper and look how it actually works. So we split an image into patches, then we do flatten those patches, then reproduce a lower dimensional linear embedding from the flattened patch. Then we add some positional embeddings and then we feed the sequence as an input to a Standard Transformer encoder. Then as usual, we pre trainin the model with image labels which is fully supervised on the large uh data set. And then as the last part, we fine tune the downstream um on, on the data set. For image classification, we are actually want to classify. And here we see on the left, the trans the performance of the vision transformer then pre trained on different data sets. And on the right, we see a comparison of the performance on the computational trade off between uh Resnet, which is really one of the state of the arts uh big transfer um um convolution on our networks to classify images and to investigate the impact of the different data sizes on the model.
The researchers of the office of the paper trained the vision transformer on different image um data sets. And they used uh image net 21 K which consists of those 14 million images and 21,000 classes. And they used also JFT which has 300 million images and 18,000 classes. And then they compare the results of the state of the uh CNN. So convolution no networks and big transfer like a Resnet. And they, they trained those uh just on the same data set. And interestingly here, the vision transformer performed significantly worse than the CNN or the big transfer be trained on the all that data set with 1 million images. However, now then we use the image net 21 K with 14 million images. The performance is comparable. And if we use the large data set with the 300 million images, it now outperforms the state of the art uh ResMed. And another very interesting use case um is here to use it for time the uh anomaly detection. And here the authors of that paper used it in combination with a long short term memory which we have here in the architecture and they used it in, within the durational recurrent outer encoder. I won't go into what the durational outer encoder is. I will provide brochures to the paper and you can read it up later on.
So I just want to focus here on the attention part and um what they did they had here the um energy time series and that is how the time series looks without any um faults or anomalies. And we have here just a simple example of an inverter fault. And if we now look what they did, so they used the attention weights generated by the model to get more information about the anomaly in that time series. So if we, you compare that is just the plot from a normal time series and here is the accompanying uh attention weight. And here we have just a minor fall, just the brief shading. So it's not that big of an anomaly and we see there is almost no difference just here a little bit and here a little bit. But if we then go to the anomaly of having a snow, we see that the attention is distributed quite differently. And that really gives an enhancement to actually better understand how the anomaly is going on in the time series and to and hands and, and, and able to earlier on predict really the anomaly in, in that time series because it's an unsupervised task usually. And it's very hard to do because the anomaly is usually very sparse in, in that time series. And so you need to, to have more information how you really can predict than that anomaly.
And as a last example, I want to show you uh how to use that also for stock price prediction. And this is uh a work I recently published a paper on. So that's uh something I did by myself and I used the attention mechanism to modulate information between a modular structure of a network. And this is how the network structure actually looks like. So we have here active and inactive modules. I don't go very deep into how the um modules are and what they are just note that we have here uh sparse communication and the communication is by the uh key value attention. So the attention and the attention scores decide or help the network decide which information should be transferred between the previous and the next time step further, the network has also uh the input modulated by attention. So also the importance of the input and the activation of those models will be modeled by using the key value attention here. And here are just an excerpt of the uh results from training that network. So um I used here the closing, just the history of closing prices. And I used it here in the combination with closing prices and new sentiment.
And if we see here the prediction and the combination to the state of the art, a long, short term memory and just the simple recurrence, not a network, they see that the network here which uses the attention mechanism has quite an improvement in its capability to predict the stock prices.
And if we even add the second feature, which is the new sentiment, which is especially hard because it's an alternative or 10 data set and it has a lot of noise in it. And most um networks really struggle with handling that. We see that if we use the attention mechanism, we really can secure significantly outperform the state of the art network by using the attention mechanism. And yeah, moreover, as just showed for the example, with the energy time series, we also can use here the attention and the attention weights to enhance the explainability of the network. So to really analyze what actually is going on in the network and to better understand what decisions the network makes, to predict the time series in that certain way. And then to look further into. And here we just see on the current slide, a plot of the uh activation pattern for that network which is driven by the input uh attention mechanism. And here we have the different modules which get then activated by the importance of that feature time steps. And here we have the time and you can easily observe that we have quite a different pattern between just using the closing price, which is here in black and use the combination of the closing price of that start and the accompanying our records sentiment score, which is then here in CRE and they can really use that to, to then see what uh event was driven by that decision and learn or try to give the network more uh to learn about that event.
All right. So yeah, as mentioned, here are the sources to the use cases, all of the links to the papers. And um here is the link to my github repo. So you'll find all the code for the notebooks uh for my notebook about the um attention mechanism explanation in Nam Pa. And I provided you also the use cases. So the Vision Transformer notebook with, with an example, you just can run by yourself and train it and also two notebooks about the energy time series along with the code I implemented from from that paper. And so feel free to explore and if you have any questions, yeah, feel free to to ping me on linkedin. Just send, send me a private message. I'm happy to answer any questions you might have.