Demonstrating the Machine Learning Cycle
Modernizing Financial Services with Arius' Data Flywheel
Hello and welcome to our blog on the role of Arius' data flywheel in revolutionizing financial services. We are thrilled to share insights into how we tackle complex problems at Arius using machine learning (ML) and data.
Introducing Arius Team
Today, our team: Evan, a senior technical product manager working on core client and API platforms, Kirsten, a technical product manager focusing on document classification and data capture products, and Sylvia, a senior machine learning engineer, will provide the details of this exciting approach.
Resolving Finanical Service's Challenges with Arius and Oculus
The lending industry’s process is heavily manual, especially documentation. Invariably, this method is both time-consuming and susceptible to errors. We asked, "How can we streamline this while maintaining high accuracy levels?". Enter Arius, a fintech infrastructure firm that processes financial documents and outputs structured data with over 99% accuracy. This advancement facilitates quality decisions for financial services companies without compromising efficiency.
Our Contribution with PPP
One of our most significant impacts was made on the Paycheck Protection Program (PPP). We were able to help save over 5 million jobs by processing 8.7 million PPP support documents, which ultimately enabled over 3.6 million small businesses to receive loans.
The Power of a Data Flywheel
At Arius, our product revolves around a data flywheel concept. The data flywheel is the principle that as the volume of documents processed increases, the more data we gather, allowing us to build better machine learning models and products. Our human verification activity enhances these models, leading to more users, more documents, and ultimately more data, creating a continuous improvement loop. We have generated 750 million data labels by verifying over 80 million financial documents with our team of 700 data verification employees; all feedback into the data flywheel, continually enhancing our products.
Machine Learning at Arius
Machine learning is paramount in automating our document handling, essential for processing millions of documents. Sylvia explains how we exploit machine learning for data entry and document classification to boost efficient operations at Arius.
Value of Ensemble Models
Given the advantage of both image and text data, we construct ensemble models to combine multiple machine learning models. This technique improves overall accuracy by classifying documents, verifying accurate document upload, confirming document completeness, and extracting crucial information.
Blending Machine Learning with Human Verification
But the story does not end with machine learning. Kirsten explains how an interactive and streamlined interface between humans and machine learning models leads to optimal performance from both sides. This approach makes error spotting and correction not only easy but also speedy.
Validating Accuracy
The next challenge is how we measure our system’s accuracy, especially when interacting with a machine learning model. By processing the same document through our system repeatedly, comparing the extracted data labels, and comparing against an expert opinion, we can confidently assess the system’s accuracy.
Introducing Automation
The ultimate result of high accuracy and confidence is the progressive introduction of automation for document handling. Our system efficiently identifies which document pages can be processed solely by machine learning models and which still require human verification. The ability to automate aspects of our process and simultaneously maintain high accuracy is truly transformative for our product.
The Impact of Machine Learning and Automation
Finally, we can improve the efficiency of our processes by embracing data flywheel concept and machine learning. We optimise human-machine interaction, continually better our models, enhance our user interfaces, and improve customer interactions with our API.
Conclusion
The financial services sector is witnessing a radical transformation through companies like Arius providing technological solutions to age-old problems. Our data flywheel approach facilitates continuous improvement, helping us stay ahead in the evolving fintech landscape. The blend of machine learning with human verification ensures high efficiency, accuracy, and speed. Ultimately, our technology infrastructure is bridging the gap in loan underwriting efficiency, revolutionising financial services.
Video Transcription
Hi, everyone and welcome to our presentation on how Arius uses a data flywheel to modernize financial services. We're really grateful for the chance to be here today and to share some of the problems that we're solving at Arius and how machine learning and data play a critical role.
Here's our agenda for today. We'll first introduce ourselves and then move into an overview on the problems facing financial services and how Oculus fits in. After that, we'll give an overview of machine learning at Arius and how we enable humans and machines to work together to build a better product for introductions. My name is Evan and I'm joined by my teammates Kirsten and Sylvia for today. I'm a senior technical product manager at Arius building out our core client and API platforms.
Hi, everyone. I'm Kirsten. I'm a technical product manager here and I work on our document classification and paste up data capture products.
Hi, everyone. Um Sylvia and a senior machine learning engineer and Oculus.
Thanks Kirsten and Sylvia. So to jump right in the lending industry is full of manual work, especially when it comes to processing documents. Now, loan applications are made up of multiple documents that come from several different sources. And these different lending industries require different sets of documents that are specific to the borrowers or the lending products. So for example, you'll need to give your lender different documents depending on if you're applying for a mortgage or a small business loan.
But regardless of the type of lending lenders need to review these documents throughout several steps of their process. And there are lots of inefficiencies here. There's a lot of back and forth between the lender and the applicant through email to get all the documents in one place.
Lenders are then manually checking to see if the applications have all the required documents in them in the event that something was incorrectly uploaded by accident from there. They're looking at the data on these documents and then typing each data point into their internal systems.
And finally, any kind of analytics or data verification is done with back of the napkin calculations or tracking values from one PDF to another. So overall, it's safe to say that it's time consuming and error prone and it comes down to the point that these documents are fundamentally paper based while lenders have moved their workflows and processes online. So there's a problem statement, how can we make document processing more efficient while maintaining high levels of accuracy? And that's where Arius comes in. Arius is a fintech infrastructure company that takes in financial documents and returns structure data for what these documents are, whether it's a bank statement, a pay stub, a tax form, et cetera and also what data is on these documents with over 99% accuracy. This allows financial services companies to make high quality decisions with trusted data without compromising for efficiency. We've been able to cut down the time it takes to underwrite loans and process applications for clients like Bluevine bre X square soy and paypal. One area of impact that we're really proud of is our work with the paycheck protection program or PPP PPP is a program that we launched last year in the US by the Small Business administration that offers loans to small businesses to help keep their workers on payroll during the COVID-19 pandemic and some really cool stats that I can share with everyone today.
Um Over 3.6 million small businesses receive PPP loans that were processed by awkward list lenders. This means that we help save more than 5 million estimated jobs through these loans by processing over 8.7 million PPP supporting documents. Now, how do we do that? Our product is based on the idea of a data flywheel and you might be familiar with what a regular flywheel is, which is essentially a really heavy wheel that takes a lot of force to get it to start spinning around. And just as the flywheel needs a lot of force to start it off, it also needs a lot of force to make it stop. So as a result, when it's spinning at a high speed, it tends to want to keep on spinning because it has a ton of momentum, pushing it forward. The data flywheel is similar to that. And then it's the idea that with the more documents you process, the more data you have, you can build better machine learning models. And with the help of human verification processes, ultimately, a better product that will lead you to more users, more documents, more data and the cycle continues. And if you don't know what machine learning is, Sylvia will go into this in a bit more detail. In just a minute.
At a list, we have over 700 data verification employees who along with our machine learning models have verified over 80 million financial documents to generate 750 million data labels. All of which go back through our data flywheel for a continuous improvement loop. So throughout this presentation, we'll be talking about each component or step of the data flywheel. So to kick off this first step is pretty straightforward, the more users we have, the more data we have. And now I'll hand it over to Sylvia to talk about how we can use that data to build machine learning models,
I think seven. Um So now that we have a sense of what Oculus does, I'm gonna be talking specifically about machine learning at Oculus um Machine learning is an essential step in automating document processing for millions of documents. But before we talk about machine learning Oculus, I wanted to start by reviewing what machine learning is. Machine learning refers to algorithms that use lots of data to make predictions. In the last few years, machine learning has become a big part of all of our lives. It's used to predict what shows will like what we should buy what words we'll type next and what our Google searches should return. The recent boom in machine learning comes from a boom in data, big data as technology has advanced, the amount of data we're able to store has increased exponentially. This graph shows the total amount of data in the world over the last 10 years. As we'll discuss, big data is needed for most machine learning algorithms to work well. Meanwhile, without machine learning, a lot of data isn't necessarily very valuable. A this is like a dream come true for a machine learning engineer because we have a lot of very interesting problems to solve. And an ever growing amount of high quality hand label data at Oculus, we use machine learning for many different types of problems.
But two of the problems as mentioned are particularly well suited. The first is that lenders um often in upload incorrect documents and also that data entry is error prone. So um that's the place. OK. Um So the first problem to identify document type is what's known as a classification problem. So given a few pictures of a few 1000 pictures of cats and dogs, a model can learn to classify an unseen image as a cat or a dog. Image models use image data to classify images. And likewise, text models can use text data to classify text. An example you might be familiar with is emails which are often classified into spam or not spam. So let's look a little closer into this image classification algo example.
Um as humans, it might seem easy to distinguish a cat or a dog. But in these photos, we can see how tricky the problem is. We have cats with humans, multiple cats, different backgrounds, cropping, different breeds and ages and the same for dogs. So this example illustrates why these models need a lot of data and why the quality of the data is important. Imagine if we only had images of gray cats or if they were all sitting in the same position or if some pictures of cats were mislabeled as dogs for a model to learn the abstract rules of what a cat looks like. It needs to see diverse data. That's representative. Lots of high quality data is mandatory for a good model. In the previous slide, we talked about image classification and text classification at OPER list. We have a unique advantage in that we have both image data and text data. So we use an algorithm called OCR which stands for optical character recognition that can get all of the text from an image OCR isn't perfect, but it's a great starting point. And using our large team of over 700 verifiers, we can get perfectly clean data to our clients. And for our models personal, we'll go into more detail on our unique human in the loop approach shortly. I'm sorry. Can you go on back Kirsten? OK.
So because we have image data and text data, we can build what's called an ensemble model that combines multiple machine learning models into one allowing us to use the best models for text something called Burt and for images efficiently. So we can use those together to get better accuracy than either model gives individually with ensemble classification algorithms. We can classify documents, we can verify that the correct document type was uploaded. We can verify all required documents have been uploaded and we can handle trickier problems like when a user uploads all of their documents in one file, we can then split them up and label each of them. We can also extract key information. So we can use machine learning to pull specific data from the page such as account number or begin date. Um So we don't need to do A T DS data entry, right? Um In addition to high accuracy, our models are dynamic, we can train specific models. So for a specific form or a specific client or we can leverage a diverse data set. We continually integrate new data so that the models never fall behind if a new form or template is introduced. And finally, hand labeling allows us to create feedback for our models to further improve accuracy. Now that we've gotten some into some of the details of how we use machine learning. And Oculus Kirsten is going to guide us from where we take it from there. Thanks Sylvia.
Um That's a very exciting stuff in our machine learning space. And what I'll be talking about next is what do we actually do when we have a machine learning model in hand? And how do we optimize the way that humans and computers interact to be as productive as possible? And by productive, we mean as accurate and efficient as possible. So one of the first questions that we ask ourselves with an ML model in hand is can we make humans more accurate? And our sort of minimum threshold is 99% and we really do reach over 99% accuracy. Um And one of the things that we want to continue to do is have machine learning, leverage that accuracy first. So just taking a look at this image, this is actually two machine learning models plugged in. Um The first machine learning model is where is the data on this pay stub? And the second is what does the actual text say in that location? And so what we've tried to do is try to make it as easy as possible for a human to interact with these machine learning models and to correct anything plus get suggestions for things that might be wrong.
So for example, first, the location model has a line between the bounding box or location on the left and where the actual text is populated on the right. So a person can basically see does this location look right? And if not, they can correct that location. The second bit is more, more of the optical character recognition that Sylvia was talking about where we have the actual text in that box and we're reading Tyrion Lannister. And so what a human can do is actually edit the text right underneath that bounding box on the left so that they don't have to move their eyes from left to right across the page, which makes them, it makes it easier to see errors and it also makes it easier to correct errors last.
It makes it easier, it makes it faster to correct errors. Um So that's, that's the first component is how do we make humans and computers interact? And then the second component is making sure that our accuracy is where we say it are. Say it is. So the first is the first like idea that we needed to sort out as a company was, how do we actually measure accuracy between a machine and a human system? And how do those tie out together um when it comes to building some kind of automation infrastructure. So how do we do it? We basically drop the same document kind of over and over and over again to our system and have multiple people uh answer the the document and by answer the document, I mean, extract the data from the document and we basically compare those labels against each other. We then compare it to someone's expert opinion, which is what we call ground truth in house. And that sets us a baseline for what our accuracy should be. Um Once we get to a point where we have a machine learning model plugged in, we can actually compare how well the machine learning model has done against those human labels and see at what point does the machine learning model beat out the average human.
And when we do that, it gives us the ability to do something really cool, which is actually starting to skip some human steps or all of them. So once we figure out how we do on accuracy, we can actually do a bit of analysis. So here I'm referencing a classification uh that, that we have for machine learning where I know that uh the first four pages have been historically extremely accurate. And we are also getting extremely high confidence from our machine learning models. So very high precision, very high confidence and historically, very high accuracy basically means that with a four page document that has these factors successful, we can actually skip those four pages. So we can actually pull out those four pages and have a human uh and have no humans touch this, this part, these particular pages for this step. Now, the other ones still have to go to a human. But what we're basically doing is starting to pull apart the problem and figuring out what we can automate first so that we can start to move uh a lot more efficiently for our customers. So what does this actually look like in terms of data? The result of the fly wheel over the co over the course of four years is honestly a series of really pretty graphs. We've reached over 100 and 20 document types.
And if you look at that bottom graph, we have gone from about an hour and a half from upload to returning data back to about 10 minutes, which basically means that what what other companies will do is significantly slower than what we'll do with that high over 99% accuracy. And basically what we can say is that this efficiency is truly the most disruptive part of our product. The most important part of our product obviously is accuracy because people are trying to decision on this data. But with that efficiency comes the the sort of like inability to say no. And this basically just gives us a huge competitive advantage and basically brings me back around to what Evan was talking about, which is more users. So more and more people are getting convinced that automation is a very, very good way to go for document processing. And our company is playing a pretty critical role in that, which is pretty awesome. OK. So this is the cool thing about working in fintech at Ali. There's a huge gap in loan underwriting efficiency that's largely based on getting data from paper documents. We can help solve these real world problems with our technology infrastructure that provides a systematized step by step approach to continuous improvement. And overall, our theme is iteration.
We improve our ML models regularly. We improve our accuracy constantly, we improve our user interfaces and we even improve the way that clients interact with our API.