Using transformer models for your own NLP task - building an NLP model End To End by Ana-Maria Istrate

Automatic Summary

Understanding Transformer Models for Natural Language Processing

Hello everyone, I'm Anna Maria Rate, a Senior Research Scientist at the Chan Zuckerberg Initiative. My work focuses on the convergence of Natural Language Processing (NLP) and its applications within the scientific field. Today, I am delighted to discuss transformer models for NLP and share insights into building an effective NLP model.

Introduction to Transformer Models in NLP

Transformer models have gained significant traction in NLP and computer vision fields. Essentially, these deep learning models rely on a principle known as self-attention. This core concept assists in understanding a word's representation in the context of its input sequence. These models have been revolutionary in the NLP sphere since their introduction in 2017 and are now regarded as the state-of-the-art models.

Difference Between Recurrent Neural Networks and Transformers

Before transformers, recurrent neural networks (RNNs) were the gold standard for NLP tasks. Unlike RNNs that process inputs word by word, transformers process the entire sequence simultaneously. This capability allows more parallelization, reducing model training times.

BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pretrained Transformers), and ROBERTA, to mention a few, are some popular high-performance models based on transformer architecture. Such models are ideally suited for tasks such as named entity recognition, question answering, summarization, or machine learning translation due to their ability to capture intricacies in natural language.

Building a Transformer-based NLP Model from Scratch

Let's delve into the process of constructing an NLP model using transformers, focusing on training data sets, model development, and evaluation. We will exemplify this process using a named entity recognition model we, the team at Chan Zuckerberg Initiative, build to extract mentions of datasets and experimental methods from biomedical text.

Creating the Training Data Set

Creating a high-quality training data set is the first step in building a named entity recognition model. In our case, our pre-defined categories were datasets and experimental methods.

  • Defining Entities: Have a clear-cut definition of what you need to include. Based on your specialized task or domain, the parameters will differ.
  • Using Openly Available Data Set: If your target is general, you might find relevant datasets openly available to make your task easier. Websites like Hugging Face offer numerous datasets.
  • Creating Data Set from Scratch: However, if your domain or task is specialized, you might need to create your data set. Depending on whether you have product data or you want to create your dataset from scratch, the challenges vary. This process can be time-consuming and require domain-specific knowledge and resources such as Mechanical Turk or other data curation services.

Developing the Model

Transfer Learning: Adopting a pre-training model and fine-tuning it for your specific task can save considerable time. The idea is to take a pre-trained model like BERT and initialize it with the frozen weights of the model. Then, a linear classification layer is added on top.

Specialized Corps: Specialized domains tend to have different word distributions than general domains due to their specialized terms. For these cases, models pre-trained on the domain-specific data can provide better performance results.

Evaluating the Model

Every machine learning model should undertake both quantitative and qualitative evaluation. Quantitative measures may include precision, recall, accuracy, or F1 score. However, it's essential to sanity-check the model output through human evaluation. It can help identify where the model is failing and suggest possible improvements.

Conclusion

Thus, we have discussed the process of building a transformer-based NLP model by addressing key components like the training data set, model development, and model evaluation. Remember, focusing on your specific task and using appropriately tailored tools and methodologies can pave the way for an efficient and successful model. Thank you for being here. We hope to continue providing valuable insights! We invite you to check out the Chan Zuckerberg Initiative for more exciting work.

Happy learning!


Video Transcription

All right. Uh Hi everyone and uh thank you so much for being here. Um I am really excited to talk to you today about transformer models for natural language processing and building an N LP model. End to end.Uh First of all, hi, I am really happy to be here today. My name is Anna Maria Rate or Anna. I am a senior research scientist at Chan Zuckerberg initiative and my interests lie at the intersection of N LP and applications of N LP on the scientific domain in particular text mining biomedical literature, predicting impact of research outputs knowledge graphs and graph based models.

This is the outline of what we will be covering today. So I would like to share with you a little bit of a background, some background information about transformer models in particular how they relate to natural language processing tasks. And then we will go into the process of building a transformer based N LP model. End to end, we will be talking about the training data set and the importance of having good definitions of what we include in the training data set. We will touch all open, we will touch on the uh the options of choosing an openly available data set versus curating one from scratch. We will then move into model development and talk about transfer learning. The idea of fine tuning a model versus pre training from scratch.

And some of the challenges that come with specialized corpora, we will end with model evaluation and talk about quantitative and qualitative evaluation. For the second part of our talk, we will use an example of building a named entity recognition model to mine different types of entities from biomedical text. But if you're not interested in the biomedical domain or you feel you don't have domain knowledge for for the biomedical field, that's totally fine. That's not, that's not very relevant to our talk today and you'll still be able to follow along. Awesome. So uh with that uh with that in mind, I am excited to get started and talk about transformer models in natural language processing. So transformer models have become really popular mainly in natural language processing and computer vision and transformer models are essentially deep learning models that are based on this idea of self attention and self attention is a concept that basically wants to uh or helps us understand the representation of a current word given its context in the input sequence.

So for instance, we have this uh this uh example of a sentence on the slide, the animal didn't cross the street because it was too wide. If we want to compute a representation for the word it as humans, we might know that the word, it refers to the street and not to the animal. However, we need to, we, we need to instill this sort of information in machine learning models as well. And this is exactly what self attention aims to do. Basically, when computing a representation for a particular word, we want to know what other words in the sequence we should attend to what other words are relevant to our current representation. And the notion of self attention is at the core of the Transformer architecture. And the transformer architecture has firstly been introduced in 2017, once introduced transformer based natural language processing models pretty much became state of the art in natural language processing.

And before that the state of the art in N LP tasks were recurrent neural networks or RNNS. The difference between RNNS and Transformer models is that in RNNS the input gets processed sequentially. So one word at a time. However, in Transformer architectures, we are able to process the input all at once. This means that transformers allow for more parallelization than recurring neural networks. And it also means that we get reduced training times. So as an example of how this happens, for instance, we have this diagram of a recurrent neural network at the bottom. When we're feeding input in a recurrent neural network, we're basically reading the input one word at a time because Transformer models don't, don't need to, don't uh don't need to follow this sort of like pattern. It means that we are, we get reduced training times. And this led to the development of a of a really large number of models of pre trainin language models based on the trans on the transformer architecture. So maybe you've heard of things such as B which stands for bidirectional encoder presentations from Transformers GP T which stands for generative pre trainin Transformers D five, Roberta, Electra, Albert Ernie. And the list can go on and on and on.

There are a lot of, of really high performance models based on the transformer architecture. And the reason why you know the reason why they uh they uh these models perform really well on a variety of tasks such as name, entity recognition, question answering summarization or machine learning translation is because these models are being pre trained on huge amounts of text.

So they learn to capture complexities in natural language from the training corpus really well. However, the really cool thing is that you don't have, if you want to use one of these models, you don't have to train them from scratch every time you want to, you want to do that, you can use them by fine toing them on your own particular task. And this is behind this concept of transfer learning so fine, it means that you're basically taking, taking the weight of a of a model that has been pre trained, already, pre trained and you can use, use these weights as initialization for a new model. For instance, we have the, we have the bird model which has been pre trained on the books, Corpus and English Wikipedia data sets. And we can take this model architecture and finding it on our own particular tasks, whether this is named entity recognition, question answering or sentence classification.

All right. So hopefully that gave us a little bit of an of, of an understanding of transformer models and how they are being used in natural language processing. The next thing I want to go over is building a transformer based model, a transformer based N LP model. End to end as mentioned, we will be talking about the training data set model development and model evaluation. And for this part of the talk, we will focus on an example of building a named entity recognition model that mines different types of entities from biomedical text.

And we built this model at Chan Zuckerberg initiative. Our task with this model was to extract mentions of data sets and experimental methods from biomedical research articles. So essentially if we have a piece of text that's ha that happens to be coming from a biomedical research article.

And we have an example of this paragraph on this slide. Our task here is to be able to extract data sets and experimental methods from this piece of text data sets can look like identifiers in a database or repository. So basically, these GSE mentions that we see here on the slide and experimental methods can look at this uh method that that's uh being found in this paragraph, gipsy, which is the sequencing method. Our ho how how we end up extracting these mentions of data sets and experimental methods from biomedical research articles is through a name, entity recognition model that basically learns to recognize these entity types from text. And if you're not familiar with the name, entity recognition, task name, entity recognition aims to locate and classified named entities in unstructured test text into predefined categories. So in our case, our predefined categories are data sets and experimental methods. And we want to be able to locate and classify all of the name entities in this piece of text that fall into either one of these two categories. So the first thing I wanted to talk about in this process is building the training data set. And the first step we took in building our training data set is we asked ourselves what gets included. And I do want to emphasize the importance of having entity definitions for us. It was very important to know what exactly do we mean by datasets and experimental methods.

What exactly gets included into each of these predefined categories and definitions are particularly important, especially for specialized domains or tasks or boundaries are not very clear. So for us, you know, we want to be able to identify experimental methods. But it wasn't very clear if we should include things like software equipment, regions or other types of entities. And luckily, we do have a team of biomedical experts at CC I that we can collaborate with and that can give us domain expertise. So we collaborated with them to come up with these definitions. Once we have, once we have good definitions for what we want to include in our data set, the actual, the next step is actually creating the data set. And this and here we usually have two main options. One, we can use an openly available data set and there are lots of data sets that are openly available. There are even there are more than 5000 data sets available only through the hugging face API and these are easy to load and use with just a few lines of code. And the more general your task is the more likely it is that you will be able to find an openly available data set for your particular task. The other option is to create a data set from scratch.

So the more niche specialize your domain or task is the more likely you are to have to create your data set from scratch. In our case, we not only dealt with a specialized domain like the biomedical field, but we on we also have a very specialized task of being able to identify particular mentions of data sets and experimental methods. Creating a data set from scratch comes with its own set of challenges. And it usually involves some type of curation, you know, maybe have product or user data and you want to be to build, able to build a model that learns from this. Maybe you have things like user logs or user interaction data, which is, which is likely to be very noisy and it is very likely you'll have to clean up this data in order to be able to build a high quality training data set for a machine learning model. Or maybe you don't have product or user data and you just want to build your data set from scratch. In our case, we do have this via creation team that I mentioned that we can use in order to create these data sets from scratch. If we want to, if you don't have access to that type of resource, there are other types of resources such as mechanical Turk or you know that use C uses crowdsourcing. And there are also curation companies who specialize in building high quality training data sets.

And some of them have also specialized expertise. We we've been using some of them in the bio for the biomedical field as well. Creating a data set from scratch again, comes, comes with its own set of challenges because it is time consuming could be expensive and it might involve specialized domain knowledge. In our case, we use the combination of both where we use openly available annotations. There is a platform for the biomedical field, it's called E MC. This platform provides openly available annotations from biomedical research articles. However, it did uh their annotations did not, did not exactly fit our own definitions. So we took this openly available, openly available data and we further curated it with the help of our bio creation team for us that meant excluding excluding terms that did not fit our own definitions. Cool. So we've talked about the training data set. We've talked about the importance of having good definitions about the options of uh using an openly available data set versus curating one from scratch. The next part of our of our talk today will focus on model development. So going back to this notion of transformer models and transfer learning where you can use one of these pre trainin pre trainin language models on your own particular task by fine tuning them. Without having to train from scratch.

We we define fine tuning again as taking the weights of a trained model and using them as initialization for a new model. How does this look like in practice? So in practice, our, we took a pre trainin model such as B this is kind of like the blue module that you that you see on this diagram. We initialize the model architecture from a pre trainin language model uh with uh the frozen weights from from this particular model. And then we add a linear classification layer on top. And this is essentially the fighting part. This is what gets trained in our case. The linear classification layer was in charge of basically classifying each of the tokens as being part of either a method or a data set. I do want to talk a little bit about specialized domains and mention the fact that specialized domains have different word distributions than general domains and specialized domain. That's because specialized domains usually contain specialized terms. So for the biomedical field, this could be genes, cell lines or diseases.

And there are other specialized terms such as finance or law that have the same types of challenges. The idea here is that representations learned by models from general domain text might not be able to capture all of these complexities from for for specialized domains. The solution then is to use models that are pre trained on your own specialized domain because these tend to give better performance on your specialized task. So for the biomedical field, we have models such as biop bird, cyber and ped bird. And these are essentially models that have the same model architecture as bird. The only difference between them and bird is the corpora they have been trained on. So some of these models have been trained on biomedical research articles or scientific articles. And there are similar models for other types of domains such as Fin Bird, for for the financial field, field lawd for law. And I'm sure there's other for other types of domains. Cool. So we went over the training data set, talk about definitions using an openly available data set versus curating from scratch. We touched on model development and the idea of transfer learning, fine tuning versus pre training model and some of the challenges of specialized corpora.

The last thing I want to talk about is is model evaluation and talk about quantitative versus qualitative evaluation. So the first thing that you, you do in a mo uh when evaluating a machine learning model is you'd look at uh quantitative measures such as precision recall F one score, accuracy, blue score, whatever your task is, there's usually some widely accepted metric quantitative metric that you can use to evaluate your machine learning model.

One very important thing that I do want to advocate for is human evaluation. So just sanity checking the output there more often than not you can do, you can get really, really high uh high metrics such as, you know, 95% accuracy, 99% F one square. And when actually looking at the output of the model, it doesn't make sense or maybe, you know, it's not exactly what we would expect. So without sanity, sanity checking the output, we would not be able to like catch that and go back and fix it. And you don't really have the main expertise for that. You don't really need to have the main expertise for that. However, sometimes the main expertise it is helpful to have. So for us again, we have this team of biomedical curators that we are able to consult at different stages in model development. So more often than not, we take uh output of some of our our train machine learning models and we put it in front of them. And the uh they are able to tell to tell us where or like to give us feedback on where is the model failing? What is the model not picking up? Where is the, where could the model be improved? This feedback is very valuable for us because we can then go back and incorporate it into the model development. So this is all I wanted to uh share with you today as a recap of what we went over. We talked about transformer models for natural language processing.

We then talked about the process of building a transformer based N LP model. End to end, we talked about the training data set and the importance of having good definitions and what to include. Uh and what not to include. We talked about using an openly available data set or curating one from scratch. And then we touched on model development and talked about transfer learning, the idea of fine tuning versus pre training a mo a model and some of the challenges of specialized corpora, we ended by looking at model evaluation and talked about quantitative versus qualitative evaluation.

Thank you so much for listening, for listening. If you want to check uh to check us out, please check out the Chan Zuckerberg initiative or CC I handles and also check out the Science initiative within CC I. Thank you so much for being here. I hope you have a great rest of the conference and great rest rest of the day. Thank you.