NLP 101: Natural Language Processing and The Era of Transformer Models
Ana-Maria Istrate
Senior Research ScientistThe Evolution and Impact of Transformer Models in NLP: From Word Embeddings to Chat-GPT
The field of natural language processing (NLP) has seen remarkable advancements in recent years, thanks to the development of transformer models. These sophisticated algorithms have revolutionized the way machines understand human language, pushing the boundaries of AI's capabilities in understanding and generating text. Let's delve into the world of NLP, its applications, and the transformative power of transformer models.
Introduction to Natural Language Processing (NLP)
Natural Language Processing, or NLP, is a branch of artificial intelligence focused on enabling machines to comprehend, interpret, and produce human language. The advent of deep learning models, particularly transformer-based models, has led to breakthroughs in this area.
Common NLP Tasks: Understanding the Range of Applications
- Question Answering: Training models to retrieve answers from text.
- Name Entity Recognition: Locating and classifying entities in text.
- Summarization: Condensing text into summaries, either through extractive or abstractive methods.
- Sentiment Analysis: Determining the tone or sentiment of text.
- Machine Translation: Translating text from one language to another.
- Entailment: Discerning whether one sentence can be inferred from another.
- Text Generation: Producing new text based on prompts.
Building Blocks of NLP: From RNNs to Attention Mechanisms
Transforming natural language into something machines can understand begins with word embeddings, which are numerical representations of words. Early state-of-the-art models, like recurrent neural networks (RNNs), processed text sequentially to create these embeddings. A key ingredient that led to the development of transformer models, however, was the mechanism of attention, which allows models to weigh the importance of different words in a context to produce more accurate representations.
The Transformer Model Architecture: A Game-Changer in AI
The transformer architecture, different from its RNN predecessor, processes entire sequences simultaneously rather than one word at a time. This shift to parallel processing not only speeds up training but also solves the problem of long-range dependencies in text—that is, understanding the relationship between words that are far apart in a sequence.
Enter the Era of Large Language Models
Transformer models are the backbone of today’s large language models such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), and T5 (Text-to-Text Transfer Transformer).
BERT: Focusing on Context
BERT is known for its ability to consider the context from both directions—left-to-right and right-to-left. This bidirectional context is especially useful in tasks that require understanding the full context of sentences, such as sentence classification or question answering.
GPT Models: Leading the Charge in Text Generation
GPT models, trained to predict the next word in a sequence, excel in text generation tasks. The latest iterations, like GPT-3 and GPT-4, have shown remarkable proficiency, able to perform tasks with little to no task-specific training.
T5: The Versatile Transformer
T5 operates with an encoder-decoder framework to perform various text-based tasks. Its training involves predicting spans of text, making it highly adaptable to different NLP challenges.
The Future and Beyond: Expanding the Reach of NLP
Transformer models aren't limited to text-based tasks. They're now being applied in diverse fields such as biology, with protein language models predicting protein structures—a testament to the versatile power of these AI models.
As the NLP field evolves, it continues to integrate more aspects of human cognition and interaction into its models. This integration is leading to an even greater range of AI capabilities, from conversational agents to multimodal models that can understand both text and images.
Conclusion
The development of transformer models represents a pivotal moment in the evolution of NLP. With each advancement, from word embeddings to the latest GPT releases, we edge closer to creating AI that can seamlessly interact with human language. It's a thrilling time for researchers, developers, and enthusiasts alike, as we witness these AI models reshape what's possible in understanding and generating natural language.
Get in Touch
For more insights into the remarkable world of NLP and transformer models, or to explore potential collaborations in the realm of science and education, do not hesitate to reach out to Anna Maria Estate at the Chan Zuckerberg Initiative. You can find her on LinkedIn or check out the research and advancements facilitated by the CZI Science team in pushing the frontiers of knowledge and technology.
Video Transcription
OK, so maybe we can just get started. Um Hi, thank you so much for being here today.Uh My name is Anna Maria Estate and I am a senior research scientist at Chan Zuckerberg initiative where I work in machine learning and uh particularly in natural language processing uh to support our teams in science. And I am uh so incredibly excited to uh be here with you today and uh talk about natural language processing and uh transformer models. I uh work with some of these models we'll be talking about today on a daily basis as part of my job. Uh So I'm incredibly excited to share some of this information with uh with you today. Um So we live in really exciting times. Uh There are many breakthroughs in a INML and um particularly natural language processing. Surely many of you have heard of or interacted with s with systems like chat GP T which is an intelligent conversational agent that is able to do things like answer questions intelligently summarize uh perform chain of thought reasoning, uh perform equations, uh write code. Uh And even as a picture here, plan a trip to Hawaii uh this is a system that has been developed by the team at OPEN A I who prior to CHA GP T also worked on de which is another really impressive machine learning uh uh generation system.
This is a text to image generation system where you can even inputs such as um a text input such as an oil pastel drawing of an annoyed cat in a spaceship. And uh then the model is able to give you that and give you back a picture of an oil pastel drawing of an annoyed cat in a spaceship, which I'm I find pretty astonishing. So um there's other examples such as uh alpha fold is a machine learning model that um has been uh used to predict the 3d folding structure of a protein from uh its amino acid sequence. So this is a model developed by deep mind and uh it's really impressive because this is actually a problem that has not been solved in biology for over 50 years. So being able to predict the 3d folding structure of a protein from its one what they call the one D uh structure of or just the amino acid sequence, that's that's just not been solved and machine learning has been able to do it. Um Alpha code is similarly a transformer based model that's been able to generate code that performs so well. That is at the top of coding competitions. Uh Google search uses B which is a transformer based model to understand uh the meaning behind the user's question.
So um what all of these models or or systems have in common is that they use these deep learning model architectures that at the core are based on transformer models. And that is what we will be talking about today. So today, uh in the time we will spend together, uh we will uh try to understand how we came to have these transformer model architectures and look at some examples and understand how they work. We will start fairly broad and we will end up quite technical towards the end of the end of the talk. So uh first, we'll start with a broad introduction to natural language processing and uh what are some common N LP tasks? We, we will continue by uh looking at some foundational N LP blocks. So we'll talk about word embeddings, uh recurrent neural networks. This were uh these were the state of the art models before transformer based models in natural language processing. And we will also look at the mechanism of attention which is what ultimately led to uh this uh this uh really performing transformer models. Um We will look at the transformer model architecture and uh dig deeper going into some more details on particular, some particular large language models such as B GP T and uh T five.
So uh that being said, I am super excited to go on this journey with you today. And uh let's get started. Um First of all, uh you know, to start by either definitions, what exactly do we mean by not excuse by natural language processing? Um or N LP. Uh N LP is a branch of artificial intelligence. And if we think about natural language as the language in which um we as humans communicate with each other. So whether that's through speech or text, the natural language processing is just the ability of a machine to understand or process the written and spoken human language. Not all natural language processing models are necessarily machine learning models or even deep learning models.
And not all machine learning models are deep learning models. Either uh when we say deep learning models, we usually mean these really massive model architectures that have hundreds of of millions or even billions or more recently with charge GP T or GP T for trillions of parameters.
So the intersection of these really large language models, which uh we call deep learning models and natural language processing. This is where, where these large language models live. So now we wanna talk about some of the, some of the things you can do with these models. So you know, what are some of the types of uh of tasks you can train these machine learn learning models to do or to help you with. And that's what we're gonna go over the next few slides. Um The first one question answering is here. Is basically you want to train a model to um to retrieve answers to questions that you're posing in natural language. So this is one of the features that we do see in uh in PT, for instance, uh here, given a passage of text, uh you can pose the question in natural language to the model, to the model and then uh you retrieve the answer back and hopefully the answer is correct. That's the goal. Um In this example, we have a sports based pa passage. You know, we have uh some questions such as which NFL team won Super Bowl 51 this AFC stand for uh what year was Super Bowl 50. And in each case, uh the model learns to identify the correct answer from text.
And here we have a passage but of course, you can extrapolate this and uh and uh build models that influence questions from documents, research papers, you know, maybe chapters of books uh and, and other more complicated things like that. Another types uh type of uh task is called name, entity recognition. So here given a text you want to locate and classify the name entities mentioned in the text. So let's say we have the sentence on the 50th of September, Tim Cook announced that uh that Apple wants to acquire ABC group from New York for $1 billion. And uh we have a number of entities we are interested in such date person organization. Uh G pe which stands for geopolitical political entity or uh money. And uh we train a model that is able to identify these entities and text and tell us that the 50th of September is a date, Tim Cook is a person. Apple and ABC group are organizations uh New York is a location and $1 billion is money. Um summarization is when you want to at, at its very core is when you want to reduce the text size of a document by creating a summary. And you can do this uh mainly in two ways. So um extractive summarization is where you're just extracting various senses of text from our various words from the text uh and patching them together. And we have an example here.
Uh Peter and Elizabeth took a taxi to attend a night party in the city while in the party, Elizabeth collapsed and was rushed to the hospital. So the summary we're getting here from the model is Peter and Elizabeth at Pen Party City, Elizabeth rushed hospital and you can see that you know, this is not necessarily wrong. It does convey the most important information and it's probably also similar to how people would text, but it might not make the most grammatical sense. And the model is just identify learning to learning to identify the most important words in in text and patching them together.
Uh obstructive summarization in contrast is where the model learns to paraphrase and uh generate the summary that makes more logical sense. So uh you can see that the model here learned to paraphrase more intelligently and that the summary Elizabeth was hospitalized after attending a party with Peter.
Uh it makes so it sounds sounds better uh and more coherent um sentiment analysis is where we want to classify the polarity of a given text. So for example, maybe you want to classify something as positive, negative uh neutral, whatever you wanted to classify as maybe you can even look at political types of questions. Classifications. Uh We have here two examples from Twitter. Uh first one, fresh baked bread, bread and pastries are some of the great joys of life. It's clearly a very positive positive take. Uh second one, I'm not gonna read it, but it's a little bit more negative. So we're able to pick that up. Uh machine translation is something we've all probably been using at least once or twice, especially if you, if you've used something like Google translate or maybe travel and it's where you want to translate a piece of text or speech from one language to another. Uh entailment is a really interesting task. So it's actually one of my favorite tasks. It's when you train a model to discern whether one sentence can be inferred from another. Uh Basically you want to understand, do these sentences mean the same thing. So um in this case, we have the sentence, an old woman walking down the street. And we have two examples that um or we have multiple examples.
The first two, an old, an old woman is walking outside and a lady is walking down the street, we can train a model twice identify that. These two examples can be entailed from our premise. Uh The second two or, or they basically, they mean the same thing, the same thing, the same two the second pair of sentences, a young lady is walking and ladies sleeping on the street. Uh These are contradictory to our premise because we know the lady is not young and we know the lady is not sleeping because the lady is walking. So uh we can constantly say these are contradictions. Uh The third pair is even more interesting because we have, you know, an old woman in red top walking and woman walking with her dog. This could or could not be true. We just don't know because we don't have a lot of, we don't have enough information based on our premise to, to be able to tell. So this is super interesting because the model has to be able to understand and not just the meaning of sentences, but it has to reason over them and find uh find accuracies or inaccuracies between uh the premise and the sentences that it's looking at.
Uh You can also do things like sentence similarity, which is where you want to determine how similar to pieces of text are. So in this case, we have a sentence, machine learning is so easy and then a number of other sentences we wanted to compare to uh ranked in order of the similarity score. Uh Of course, if we look at something deep learning is so straightforward, that's quite similar to saying machine learning is so easy. If you say something like I can't believe how much I struggled with this, that's almost a contradiction. So the similarity score will be very low.
And you can do this kind of similarity, not just at the sentence level, but you can do it at paragraph level or document level. Again. Uh You can kind of increase in complexity as much as you want. The last uh the last type of task I want to talk about this text generation. So this is where you want to produce new text based on a prompt. And this is a lot of what GP T and this uh GP T models have been trained to do. So you give your model a prompt such as once upon a time and uh then the model learns to generate text from that prompt. And one more in a recent interesting area is this uh area of prompt engineering, which is where you want to embed the the description of the task you expect the model to do in the prompt itself. So the model is then able to figure out the task it's supposed to do from your natural description of the task. So in instead of programming a model to for a particular task, you'd be able to just get tell the model in natural language what that task is. Uh So here you know, if you go back to our example of on the 50th of September, Tim Cook announced uh that Apple wants to acquire ABC group from New York for $1 billion we were looking at extracting uh various types of entities from this sentence.
So instead of training a model specifically for that, then entity recognition model, we can uh tell a GP T model that we want to, you know, extract all of the important entities mentioned in the text and tell it first, extract all company names, then extract all people names, then extract what other other types of things we're we're interested in.
So um the idea here is that your model learns to do that based on just this general knowledge without you having to particularly tune it for that. And this is why it's called one shot learning where the model has not particularly trained for this task, but it's still able to do it really well. And uh in some cases, you can even give it a few more examples. Uh and uh hope that the model learns from those few examples and that would be called few shot learning. And uh this is a fairly recent area that sort of became more popular with uh with uh these chat GP T models. So it would be really, really interesting and exciting to see uh this kind of where this area goes. Um This kind of uh concludes the first uh the first portion of of our time today. So we've covered uh what is natural language processing and what are some types of natural language processing tasks. Uh We will now move into to the second portion of today's talk. So looking at some of foundational NOP blocks and uh in particular, looking at the wording buildings, uh R and R sorry RNN, which are a type of machine learning models and uh also attention.
So um earlier in the talk, we define natural language processing as the ability of a machine to understand or process written and spoken human language and a word is one of the very fundamental blocks of natural language. Um So then one of the very first questions we want to ask ourselves is um how can we represent the word in a way that the computer is able to understand? Um the computer obviously can't understand human language. Uh It doesn't have a notion for that. Uh however, it, it can handle numbers, right? So then we want to be able to represent the meaning of a word through what we call a word, embed a word embedding or a vector. And uh these are vectors of numbers that correspond to meaningful representations of, of uh of a word. Uh The next question becomes, you know, how do you choose these numbers such that you get these meaningful representations of, of these words? And here, one of the key ideas is that a word meaning is given by the words that that word uh appears frequently by. And there, then there's this quote of you shall know a word by the company. It keeps that what that means is that when you're computing a words embedding, you're looking at its context or the set of other words that it's surrounded by.
So when you build the vector representation of that word, you use the many contexts in which you see that word and then you aggregate over them and then you can expect that words that appear in similar context will end up having similar a similar word embeddings. And uh this can uh be measured by doing something like a dot or coastline similarity because again, these are just vectors of numbers. So you can actually look at how close they are in the embedding space. And we have here just a few examples. If you look at a cluster of yellow things on, on the top, you, you see that things like oven, refrigerated refrigerator, microwave end up being clustered together. Uh The maybe the the uh cluster of red things on the bottom finish color and pain are are also close together and so on. And this is likely because these words uh had similar context in the sentences in which they appeared when the model was trained. So um as we know, words don't occur in natural language and isolation, uh they do appear in sentences. So then once we have these word embeddings, the next question becomes, how can you model the, the relationships between words in a way that the computer is able to understand?
And uh one of the most successful models in natural language processing to do this. And that was state of the art before Transformer models came uh came to be uh is uh is recur or our our recurrent neural networks or RNNS. And um the idea here is that you have a neural network that is able to model sequential data and um information is flowing from left to right. So you have a sentence like the students open their books. And what you first do is you compute some embeddings for each of these words. So you can do that, you do that through a look up or you know, some, some learn embeddings and then you pass this. So this would be like the very first purple layer that we have here. Then you would pass these embeddings through some hidden layers and then each word would be connected to the ones that follow it. So you can see these kind of arrows from left to right between the uh between these uh these cells in the, in the red. The second layer, the one that, that's were red and these, and these arrows, what these ar signifies is that information is flowing from left to right.
So from previous words to uh two following words, and this is how the model is able to hold information and have this notion of a past and a memory between words. So then when we're, let's say we're trying to compute the uh representation of the last word in the sequence here there, uh We technically have still the information that's been flowing from each of the previous words, including the very first one, which is the word, the. So this was a revolutionary idea and it's what made these models become, be uh become in a state state of the art for, for a while before transformers. Um In practice, it is kind of hard to keep track of these long range dependencies between words. Uh particularly if your sequences get really get really long and information might get lost. Uh Which is why we uh we have seen kind of like this development of, of new, of new and more performant models. So the other, the other notion I want to talk about is uh this uh groundbreaking which was also a ground breaking concept in natural language processing. Um is this notion of self attention. So, um here again, we want to compute the, the representation or, or inventing of a word and we want to know um what other words in the sequence. Should I put attention to, to compute my own in my own meaning?
So this is kind of this notion of self attention. What, what else in my own in my own sentence should I be paying attention to? So as an example, we have the sentence, the animal didn't cross the street because it was too wide. And uh if we want to compute a presentation for the word, it, we want to know is it's referring to the animal or the street. And as humans, we are kind of able to say that. OK, well, it probably refers to the word street, not the animal because the animal was not too wide. But then how the model is able to understand this is by computing these attention scores between the word it and every other word in the sentence. So um these attention scores end up determining how much of these other words should contribute to the next representation uh to the word representation of the word it. And uh if we see that in this case, we have a model that learns to attest to, to the word sweet the most. So the cool thing here is that among these RNN models, we've talked about where we had the sequential. You don't need to have the words in a sequence to be able to compute these courses. You can, you can fit your whole input sequence at once instead of one word at a time.
And uh this notion of self attention is actually what is at the core of the Transformer Model architecture, which is what we will go into next. So um the Transform Model architecture is uh at its core, it's a neural network architecture. And uh again, unlike RNN recurring neural networks, you don't have this notion of sequential words don't get read once at one at a time, but the input is gets seen by the model at once. And they are based on this idea of self attention. So uh we have here uh kind of like on the left of the slide, this this is what the transformer looks like. It's based on uh an encoder decoder architecture. The encoder at least the the very first version has six layers and process it's in charge with processing the words in the input sequence. The decoder usually, you know, have again the the the baseline one has six layers and it's in charge with output the answer. So you can see that inside of each of these architectures, both encoder and decoder, you have what it's called multi head attention. And this is where the self attention comes comes into play. Because what this what this uh this represents is you have multiple attention heads where each of these, each of these attention heads represents a different type of attention that the model is learning.
So the model is learning different types of relationships between words by attending to different things at different time and learning those things. And, you know, maybe again, maybe those can be different types of relationships between words, various degrees of semantic meaning and other types of more and more intelligent things. This uh this notion or this uh dismantle architecture has uh has been used quite widely in natural language processing.
And it's at the core of these large language models that we're seeing, it's also used in computer vision. And um yeah, it, it has surpassed the GU R and M in terms of being state of the art. So um why, why, why, why is that? I want to take just a little bit of time to explain that. So um for Ireland, you have this sequential again and reading one word at a time is that if you want to uh you know, compute embeddings or representations for, for words that are very far away from each other, then let's say maybe you wanna, you know, compute their relationships or like keep track of the relationship relationships between the first and the last word in your sequence, then that would be an o of sequence or your sequence length.
That, that that's the amount of number of steps you want, you have to take to compute those relationships and because of the sequential, these operations are not paralyzing, so you have, then you end up with o of sequence of length and paralyzing operations in contrast again, for the transformers, the input is red all at once you perform actually a very small constant number of steps every time.
And because we have seen these many advancements in machine learning hardware architecture recently that are really good at parallel processing that makes transformers better suited for this machine learning hardware because they are, they are way easier to paralyze than recur neural networks because they're missing the sequential.
So this in turn leads to faster training up to an order or many, maybe even more of magnitude. The other reason is that in RNN, we have informa we have this information point from left to right, right. But the information might actually get lost, especially for longer sequences because it's pretty hard to learn these long distance dependencies between words. And in contrast in the Transformer architecture, you have again, this notion of self attention, which means that uh you can model these relationships between words at once regardless of the respective position of a word in, in, in the sentence and information is just better preserved between words in the sentence because of that.
And uh it's actually it's been shown to lead to higher accuracy uh regardless of, you know, training time. So to recap the reason why these transformer based models have become so popular is because they have reduced training times compared to RNNS and information doesn't get lost in these long term dependencies. So you have this perfect combination of, you know, we have better hardware that's better optimized for paralyzed. You have these transformer models that, that are easier to paralyze and you have, you know, they're also higher performance because they're able to, to, to uh to use this notion of attention. So this is why we've seen this increase development of large transformer based uh models, not just language models, but you know, just generally transformer based models. Uh Some of the, the large language models such as bird, I mean, yeah, bird uh GP TT five, which is what we're gonna gonna talk about are, are all based on this uh this notion of, of transformers.
So how do these transformer based models learn or these large language models learn? Um The main way in which these models learn is through this process that's called pre training. So the idea here is that you start with some large unlabeled corpora. So a lot of these models use corpora that's been created from the web or maybe large co corpora of books or, you know, uh like Wikipedia, something like that. So you want to start with some really large and labor unlabeled corpora. And then the really cool thing is that you don't need to have any sort of supervised or label data to, to learn from the corpora. So you train them in what is called a semi supervised fashion. And there are multiple strategies. Just 22, I wanna go over that I think are some of the most common. So you have the mass language model modeling, which is where you're masking one word in a sentence and you ask them in, then you ask the model to predict it. So we have here an example. Uh students opened their books. When the teacher entered the room, the teacher started assess the lesson, someone raised their hand. So what you're doing here is you're choosing a number of words. So let's say you choose uh the word books and then the word started and then you're masking these, these uh these words in the model. And you're asking the model to learn to predict them. So you don't need any, any sort of label data for this.
Another strategy is uh next sentence prediction, which is where you're asking the model to predict if two sentences are following each other in text. So again, looking at our kind of little paragraph of three sentences, we can see that in the first, in the first example, the model learns to predict that uh these sentences are indeed following each other. And in the second one, it's, it's not. So um this is how the model learns. These are presentational distributions between words from a large and labeled corpora. And uh you know, these large models uh have usually have hundreds or of millions or billions of parameters and take a very long time to, to pre trainin.
So if you want to use them, and you have a particular task you're interested in. Um You can imagine that you don't want to train them from scratch. And you might also not really have the resources, whether that's computational resources or money. Um You can still get the benefits of using these models by uh using transfer learning and finding them on particular tasks. So functioning means that you're taking the weight of this pretrained model and then uh you use them as initialization for a new model that learns from label data. And this is usually a supervised task and you can fine tune these models on whatever task one such as name, anti recognition, uh question answering others that we've talked about in the beginning of the talk. And the freeing takes way less time because it has fewer parameters to learn. And uh these training datasets are usually smaller and this is how you get this be the benefit of using these uh you know, these super large uh large models on your own particular task. So now I want to move and uh we want to spend a little bit a little bit of time touching on some of these uh key model architectures that we've seen in uh in recent years.
And we're gonna go in depth into bird GP T and P five starting with birds. So bird stands for bidirectional encoder uh representations for uh from transformers. So, so the pre training methodology for bird um is that it's considering the bidirectional context. What that means is that it looks at the sentence both right to left and left to right when training. So um if we consider about the transformer architecture which we have here on kind of like the lower right corner, and if you remember the this model architecture, the transformer has an encoder and a decoder module. So bet only uses the encoder transformer architecture for pre training, which is why it's able to look at the entire sequence during pre training. So it will uh when you're looking at uh the word representation, uh a word presentation, you'll have access to both words that came before and after it. And it's between on uh on two tasks on the mask mask language modeling task where uh that we've talked about we're replacing some of the words with the mass token and predicted as the model to predict the next words or to predict those words.
And also on the next sentence prediction task, it's a train on a combination of the books, corpus and English Wikipedia uh Corpora that have a total of over 3 billion words. And because you have uh because you're able to look at the entire sentence left to right and has this what we call bidirectional context. It's better used in tasks where you, you need to understand the full sentence. So let's say maybe you want to classify a sentence or you want to extract um different types of entities and you name, entity recognition, uh word classification, uh extractive question answering, uh word models would be uh the best suited. Next. Uh I'll talk about the T GP T uh sorry GP, the GP T three model which really started uh The family of GP T model started with GP T one, GP T two and then they kind of evolved from there. And uh GP T one, the first GP T model started around the same time as bird. Uh So, uh the GP T models are generated models. And if we go back to this transformer architecture that has both an encoder and, and the D coder, if birds use the encoder to pre trainin, uh G uses the decoder to pre trainin. So they are actually trained to predict the next word in a sequence.
And what that means is that attention only has access to past words. So you cannot condition on future words. And uh this is called other aggressive. And it is the reason why, you know, it is, it is a generated model because it's only what what's actually learned to do is just predict the next word in, in, in a, in a sentence, they are trained on huge amounts of corpora that is largely the corpora is obtained by scraping the web, uh such as there's a number of data sets, the common crawl web text, uh to like Corpora books, Wikipedia.
Um And again, they are best used for text generation because, because this is literally what, what their training objective uh has been. And it's also why they're doing so well at generating all this content in uh in T GP T. So then uh you uh you have, you know, these uh innovative models, so you have GP T three GP T 3.5 GP T four. And I want to spend just a little bit of time kind of talking about some of the most interesting particularities between them. So uh GP T 3.5 is a GP T three model that is further fine tuned using something that's called reinforcement learning from human feedback. So what really happens here is that you start with this um GP T three pre trade model. Um And then you do, uh if you remember we talked about pre training and fine tuning. So then what they do is they do the supervised fine tuning where they basically uh have curators. So you use human A I trainers that um provide conversations in which they play both sides. So this is the, this is the training data they use for supervised learning. You have curators that are able to, you know, give a prompt and then a desired answer to that prompt. So once you, once you have that supervised model, you uh you create the reward model for reinforcement learning.
So similarly, using curators, you have your curators and ask them to con converse with chat G BT and then have them rank the conversations that uh or have them rank the answers that they, they get back and then give the model per particular or like specific rewards for each of the conversations.
And then using those, those those rewards models you can find in the model again using something called proximal policy optimization. And if you're interested, I highly encourage you to go uh check out the open A I blog post. They have a lot of uh a lot of content and de describing their their methodologies in there also, you know, point pointers to the like uh their research papers and reports. So uh it's super interesting to look into GT four is going one step further and it's a large multimodal model. And what that means is that it's able to take both the text and image as an input and output a text. So we have an example here where uh we have a user that gives an image and a question and you know, the user asks why is the image funny? And the model is not only able to tell you but also give you a detailed explanation of why it's funny. And uh this is called chain of thought where the model is able to explain to you how it derived to a certain conclusion. And uh that, that, that's super interesting in it in itself. Um One thing is uh check uh four, this model is generally better aligned with human intent. So making sure that the answers it give, it gives back, it's better aligned with what the human intended through the prompt.
And uh it uh achieves really high performance on a number of things uh both on professional academic benchmarks. Again, uh If you're more interested, I encourage you to take a look. One really interesting thing is that you, you know, it passed the simulated bar exam with a score of around top 10% of test takers versus the previous chat uh or GP T 3.5 scores was, was around the bottom 10%. So the last type of transform based model I want to mention very briefly is the T five models. So this a go again, going back to the Transformer architecture. This is uh uses both the encoder and decoder mo uh modules and it's pretrained through what is called spend corruption. This means that we replace random spins of text that contain several words with a single mask, special word. And uh then the objective is to predict the test that the text that this mask masked word replaced. So uh you know, all of these models we've talked about are really huge. Uh I want to spend just one minute talking about the sizes. We're, we're looking at, we started talking about bird which has around 300 million parameters. Uh D five has around 11 billion parameters. Uh GP T one is not on this chart, but it started at around 100 and 75 million parameters.
So kind of like similar to bird, you know, GP T two ended up with 1.5 billion parameters. GP T 3, 175 billion parameters and GP T four, it's kind of not really known, but it's estimated to have at least one trillion parameters. Uh That kind of includes the technical portion of our talk. I know we're also at time. Uh The last thing I'd like to share is uh a little bit about exciting applications of these large language models outside outside of natural language processing. And one thing I'm particularly excited about because one thing I care about is science, our protein language models. So uh here instead of working with natural language, we're actually working with protein sequences that are made of amino acids. And the paraly here is that uh one word corresponds to one or a sequence of amino acid in a protein sequence. So if I instead of words in a sentence, you end up having amino acids in a protein sequence. So then you can apply these models to protein and other type of related data and make uh make similar uh kind of contributions and investment to bio biological models. So um to sum up, you know, again, really exciting times, lots of cool models from intelligent conversation agents to multimodality models, text to image uh we have these uh applications to biology coding.
Uh This is really a very, a very exciting time to uh to do natural language processing. Um We definitely went on a journey together today. Uh We, we covered a lot, we started with an intro to natural language which processing uh placed N LP in the broader context of A I looked at various types of N LP tasks. Uh We then moved into looking at some foundational N LP blocks, talked about word embeddings, uh RNNS uh the notion of attention in particular self attention. And uh last but not least, we talked about these transformer based models and looked at uh the transformer architecture, talked about birth uh GP T and T five. So uh thank you very much for being here. Uh Here's my contact info. I uh do work at CC I Science. If you're not familiar with us or our work, we are a tech based philanthropy. We're focused on science and education, please check us out if you're interested. Uh also feel free to reach out to me. I am on linkedin. Um And would love to stay in touch if you have questions right now, I'd be more than happy to take them. I don't think we have much time.
But if, if there are lingering questions, I'd love to connect on linkedin or uh you know, through my email or anything like that. Um Thank you.