Using transformer models for your own NLP task - building an NLP model End To End by Ana-Maria Istrate
Introduction to Transformer Models for Natural Language Processing and Building an NLP Model
Hello everyone, my name is Anna Maria Rate, a senior research scientist at Chan Zuckerberg Initiative, and I am thrilled to dive into the exciting world of transformer models for natural language processing (NLP) and demonstrate the process of building an end-to-end NLP model. My interests revolve around NLP and its applications in the scientific domain, specifically in text mining biomedical literature, predicting the effect of research outputs, knowledge graphs, and graph-based models.
Why are Transformer Models Crucial in NLP Tasks?
Transformer models have gained popularity mainly in natural language processing and computer vision. These are deep learning models grounded in the concept of self-attention, which enhances our understanding of a current word's representation given its context in an input sequence.
The relevance of transformer models dates back to 2017, when they were first introduced, and quickly became state-of-the-art in NLP tasks. Transformer models stand out from Recurrent Neural Networks (RNNs), the preceding state-of-the-art NLP models, as they accommodate more parallelization and reduce training times.
Popular models like BERT, GPT, D-5, Roberta, Electra, Albert Ernie, among others, have been developed based on the transformer architecture, demonstrating high-performance on a variety of tasks such as named entity recognition, question answering, summarization, and machine learning translation. A crucial element behind their success is the application of transfer learning - fine-tuning pre-existing models to suit the task at hand.
Constructing a Transformer-Based NLP Model, End-To-End
In the creation of our transformer-based NLP model at Chan Zuckerberg Initiative, our objective was to isolate mentions of data sets and experimental methods appearing in biomedical research articles using a technique called Named Entity Recognition (NER). Our demonstration in building this model revolves around three main stages;
- Creating the training data set
- Initiating the model development
- Performing model evaluation.
1. Creating the Training Data Set
Our series of steps in constructing our training dataset began with the question of what to include. We outlined clear definitions of terminology, particularly datasets and experimental methods, that would proceed to shape our dataset. This understanding was crucial, especially in specialized fields or tasks where boundaries are not clearly defined.
When creating the dataset, two primary options were available to us - using an openly available dataset, or creating one from scratch. Our task being specialized, we opted for the latter, which though time-consuming and potentially expensive, suited our needs perfectly with assistance from our bio curation team.
2. Initiating the Model Development
At this stage, we revisited the discourse on transformer models and the role of transfer learning which involves fine-tuning the weights of a pre-trained model and using them as initialization for a new model. For specialized domains like the biomedical field, we spotlighted the importance of using models that are pre-trained to suit the specific domain.
3. Performing Model Evaluation - Quantitative versus Qualitative Evaluation
The evaluation stage for our NLP model saw us delve into both quantitative measures including precision recall, F1 scores, and accuracy, plus human evaluation, a crucial ‘sanity check’. Collaborating with our biomedical curators provided vital feedback that significantly improved our model’s performance.
Conclusion
Our dissection of the transformer models, plus the procedures of constructing a transformer-based NLP model from the ground up, enabled us to highlight the critical role of understanding and addressing the specific needs of the task at hand. This involved a detailed definition of tasks, choosing the appropriate data set, understanding model development, and incorporating comprehensive evaluation methods.
To learn more about our work at the Chan Zuckerberg Initiative, feel free to check our initiatives and have an enlightening exploration. Thank you for your time!
Video Transcription
All right. Uh Hi everyone and uh thank you so much for being here. Um I am really excited to talk to you today about transformer models for natural language processing and building an N LP model. End to end.Uh First of all, hi, I am really happy to be here today. My name is Anna Maria Rate or Anna. I am a senior research scientist at Chan Zuckerberg initiative and my interests lie at the intersection of N LP and applications of N LP on the scientific domain in particular text mining biomedical literature, predicting impact of research outputs knowledge graphs and graph based models.
This is the outline of what we will be covering today. So I would like to share with you a little bit of a background, some background information about transformer models in particular how they relate to natural language processing tasks. And then we will go into the process of building a transformer based N LP model. End to end, we will be talking about the training data set and the importance of having good definitions of what we include in the training data set. We will touch all open, we will touch on the uh the options of choosing an openly available data set versus curating one from scratch. We will then move into model development and talk about transfer learning. The idea of fine tuning a model versus pre training from scratch.
And some of the challenges that come with specialized corpora, we will end with model evaluation and talk about quantitative and qualitative evaluation. For the second part of our talk, we will use an example of building a named entity recognition model to mine different types of entities from biomedical text. But if you're not interested in the biomedical domain or you feel you don't have domain knowledge for for the biomedical field, that's totally fine. That's not, that's not very relevant to our talk today and you'll still be able to follow along. Awesome. So uh with that uh with that in mind, I am excited to get started and talk about transformer models in natural language processing. So transformer models have become really popular mainly in natural language processing and computer vision and transformer models are essentially deep learning models that are based on this idea of self attention and self attention is a concept that basically wants to uh or helps us understand the representation of a current word given its context in the input sequence.
So for instance, we have this uh this uh example of a sentence on the slide, the animal didn't cross the street because it was too wide. If we want to compute a representation for the word it as humans, we might know that the word, it refers to the street and not to the animal. However, we need to, we, we need to instill this sort of information in machine learning models as well. And this is exactly what self attention aims to do. Basically, when computing a representation for a particular word, we want to know what other words in the sequence we should attend to what other words are relevant to our current representation. And the notion of self attention is at the core of the Transformer architecture. And the transformer architecture has firstly been introduced in 2017, once introduced transformer based natural language processing models pretty much became state of the art in natural language processing.
And before that the state of the art in N LP tasks were recurrent neural networks or RNNS. The difference between RNNS and Transformer models is that in RNNS the input gets processed sequentially. So one word at a time. However, in Transformer architectures, we are able to process the input all at once. This means that transformers allow for more parallelization than recurring neural networks. And it also means that we get reduced training times. So as an example of how this happens, for instance, we have this diagram of a recurrent neural network at the bottom. When we're feeding input in a recurrent neural network, we're basically reading the input one word at a time because Transformer models don't, don't need to, don't uh don't need to follow this sort of like pattern. It means that we are, we get reduced training times. And this led to the development of a of a really large number of models of pre trainin language models based on the trans on the transformer architecture. So maybe you've heard of things such as B which stands for bidirectional encoder presentations from Transformers GP T which stands for generative pre trainin Transformers D five, Roberta, Electra, Albert Ernie. And the list can go on and on and on.
There are a lot of, of really high performance models based on the transformer architecture. And the reason why you know the reason why they uh they uh these models perform really well on a variety of tasks such as name, entity recognition, question answering summarization or machine learning translation is because these models are being pre trained on huge amounts of text.
So they learn to capture complexities in natural language from the training corpus really well. However, the really cool thing is that you don't have, if you want to use one of these models, you don't have to train them from scratch every time you want to, you want to do that, you can use them by fine toing them on your own particular task. And this is behind this concept of transfer learning so fine, it means that you're basically taking, taking the weight of a of a model that has been pre trained, already, pre trained and you can use, use these weights as initialization for a new model. For instance, we have the, we have the bird model which has been pre trained on the books, Corpus and English Wikipedia data sets. And we can take this model architecture and finding it on our own particular tasks, whether this is named entity recognition, question answering or sentence classification.
All right. So hopefully that gave us a little bit of an of, of an understanding of transformer models and how they are being used in natural language processing. The next thing I want to go over is building a transformer based model, a transformer based N LP model. End to end as mentioned, we will be talking about the training data set model development and model evaluation. And for this part of the talk, we will focus on an example of building a named entity recognition model that mines different types of entities from biomedical text.
And we built this model at Chan Zuckerberg initiative. Our task with this model was to extract mentions of data sets and experimental methods from biomedical research articles. So essentially if we have a piece of text that's ha that happens to be coming from a biomedical research article.
And we have an example of this paragraph on this slide. Our task here is to be able to extract data sets and experimental methods from this piece of text data sets can look like identifiers in a database or repository. So basically, these GSE mentions that we see here on the slide and experimental methods can look at this uh method that that's uh being found in this paragraph, gipsy, which is the sequencing method. Our ho how how we end up extracting these mentions of data sets and experimental methods from biomedical research articles is through a name, entity recognition model that basically learns to recognize these entity types from text. And if you're not familiar with the name, entity recognition, task name, entity recognition aims to locate and classified named entities in unstructured test text into predefined categories. So in our case, our predefined categories are data sets and experimental methods. And we want to be able to locate and classify all of the name entities in this piece of text that fall into either one of these two categories. So the first thing I wanted to talk about in this process is building the training data set. And the first step we took in building our training data set is we asked ourselves what gets included. And I do want to emphasize the importance of having entity definitions for us. It was very important to know what exactly do we mean by datasets and experimental methods.
What exactly gets included into each of these predefined categories and definitions are particularly important, especially for specialized domains or tasks or boundaries are not very clear. So for us, you know, we want to be able to identify experimental methods. But it wasn't very clear if we should include things like software equipment, regions or other types of entities. And luckily, we do have a team of biomedical experts at CC I that we can collaborate with and that can give us domain expertise. So we collaborated with them to come up with these definitions. Once we have, once we have good definitions for what we want to include in our data set, the actual, the next step is actually creating the data set. And this and here we usually have two main options. One, we can use an openly available data set and there are lots of data sets that are openly available. There are even there are more than 5000 data sets available only through the hugging face API and these are easy to load and use with just a few lines of code. And the more general your task is the more likely it is that you will be able to find an openly available data set for your particular task. The other option is to create a data set from scratch.
So the more niche specialize your domain or task is the more likely you are to have to create your data set from scratch. In our case, we not only dealt with a specialized domain like the biomedical field, but we on we also have a very specialized task of being able to identify particular mentions of data sets and experimental methods. Creating a data set from scratch comes with its own set of challenges. And it usually involves some type of curation, you know, maybe have product or user data and you want to be to build, able to build a model that learns from this. Maybe you have things like user logs or user interaction data, which is, which is likely to be very noisy and it is very likely you'll have to clean up this data in order to be able to build a high quality training data set for a machine learning model. Or maybe you don't have product or user data and you just want to build your data set from scratch. In our case, we do have this via creation team that I mentioned that we can use in order to create these data sets from scratch. If we want to, if you don't have access to that type of resource, there are other types of resources such as mechanical Turk or you know that use C uses crowdsourcing. And there are also curation companies who specialize in building high quality training data sets.
And some of them have also specialized expertise. We we've been using some of them in the bio for the biomedical field as well. Creating a data set from scratch again, comes, comes with its own set of challenges because it is time consuming could be expensive and it might involve specialized domain knowledge. In our case, we use the combination of both where we use openly available annotations. There is a platform for the biomedical field, it's called E MC. This platform provides openly available annotations from biomedical research articles. However, it did uh their annotations did not, did not exactly fit our own definitions. So we took this openly available, openly available data and we further curated it with the help of our bio creation team for us that meant excluding excluding terms that did not fit our own definitions. Cool. So we've talked about the training data set. We've talked about the importance of having good definitions about the options of uh using an openly available data set versus curating one from scratch. The next part of our of our talk today will focus on model development. So going back to this notion of transformer models and transfer learning where you can use one of these pre trainin pre trainin language models on your own particular task by fine tuning them. Without having to train from scratch.
We we define fine tuning again as taking the weights of a trained model and using them as initialization for a new model. How does this look like in practice? So in practice, our, we took a pre trainin model such as B this is kind of like the blue module that you that you see on this diagram. We initialize the model architecture from a pre trainin language model uh with uh the frozen weights from from this particular model. And then we add a linear classification layer on top. And this is essentially the fighting part. This is what gets trained in our case. The linear classification layer was in charge of basically classifying each of the tokens as being part of either a method or a data set. I do want to talk a little bit about specialized domains and mention the fact that specialized domains have different word distributions than general domains and specialized domain. That's because specialized domains usually contain specialized terms. So for the biomedical field, this could be genes, cell lines or diseases.
And there are other specialized terms such as finance or law that have the same types of challenges. The idea here is that representations learned by models from general domain text might not be able to capture all of these complexities from for for specialized domains. The solution then is to use models that are pre trained on your own specialized domain because these tend to give better performance on your specialized task. So for the biomedical field, we have models such as biop bird, cyber and ped bird. And these are essentially models that have the same model architecture as bird. The only difference between them and bird is the corpora they have been trained on. So some of these models have been trained on biomedical research articles or scientific articles. And there are similar models for other types of domains such as Fin Bird, for for the financial field, field lawd for law. And I'm sure there's other for other types of domains. Cool. So we went over the training data set, talk about definitions using an openly available data set versus curating from scratch. We touched on model development and the idea of transfer learning, fine tuning versus pre training model and some of the challenges of specialized corpora.
The last thing I want to talk about is is model evaluation and talk about quantitative versus qualitative evaluation. So the first thing that you, you do in a mo uh when evaluating a machine learning model is you'd look at uh quantitative measures such as precision recall F one score, accuracy, blue score, whatever your task is, there's usually some widely accepted metric quantitative metric that you can use to evaluate your machine learning model.
One very important thing that I do want to advocate for is human evaluation. So just sanity checking the output there more often than not you can do, you can get really, really high uh high metrics such as, you know, 95% accuracy, 99% F one square. And when actually looking at the output of the model, it doesn't make sense or maybe, you know, it's not exactly what we would expect. So without sanity, sanity checking the output, we would not be able to like catch that and go back and fix it. And you don't really have the main expertise for that. You don't really need to have the main expertise for that. However, sometimes the main expertise it is helpful to have. So for us again, we have this team of biomedical curators that we are able to consult at different stages in model development. So more often than not, we take uh output of some of our our train machine learning models and we put it in front of them. And the uh they are able to tell to tell us where or like to give us feedback on where is the model failing? What is the model not picking up? Where is the, where could the model be improved? This feedback is very valuable for us because we can then go back and incorporate it into the model development. So this is all I wanted to uh share with you today as a recap of what we went over. We talked about transformer models for natural language processing.
We then talked about the process of building a transformer based N LP model. End to end, we talked about the training data set and the importance of having good definitions and what to include. Uh and what not to include. We talked about using an openly available data set or curating one from scratch. And then we touched on model development and talked about transfer learning, the idea of fine tuning versus pre training a mo a model and some of the challenges of specialized corpora, we ended by looking at model evaluation and talked about quantitative versus qualitative evaluation.
Thank you so much for listening, for listening. If you want to check uh to check us out, please check out the Chan Zuckerberg initiative or CC I handles and also check out the Science initiative within CC I. Thank you so much for being here. I hope you have a great rest of the conference and great rest rest of the day. Thank you.