Laveena Ramchandani My journey of testing a data science model


Video Transcription

Welcome to my session. Um Just let me check one second that I've unmuted myself. Yes, I have. Um OK. So if everyone can mute yourselves as well, cause I can hear some noise in the Yeah, perfect. OK. So let me share my screen, right?So, welcome to today's session on testing a data science model. Um This is a complete uh two different worlds. Um And it's quite interesting uh It's, there's quite a lot to learn from and it's an emerging practice. So what I mean by emerging is that um I've done some research, I've asked some testers around the world if they've actually been involved uh with testing a data science model. Most of them came back saying that um they've heard of these teams but they've never actually been uh directly in touch with them. So they just test um the product at the end. So I think when they came back with that information, I thought there's a lot of opportunity for us to go in there and actually test the results these models produce. So, yeah, I think today's session, I'll give you an overview of what I do and some processes and useful tips and hopefully I can encourage you and inspire you to actually go and explore this emerging area.

So, a little bit about myself, I've graduated with a degree in computer science. I fell into my first job as a graduate and it was to do with testing. I had no idea what testing was. And today it's been over seven years that I'm testing and doing quality assurance uh roles. I thoroughly enjoy it. Um There's a lot to learn from uh loads of uh people you can actually um learn from and give advice to. It's, it's, it's an interesting area and um I've worked in various industries, industries as you can see. And some of my hobbies include dancing, yoga, teaching, and baking. So um let me take you over the contents of today's session. So today, I'll be obviously giving you an overview of what is a data science model, some background, some base knowledge and uh the process of actually testing a model and some takeaway tips and complexities that I've seen and how we could overcome those. So what do I do on a day to day basis? So Q A means quality assurance, it's making sure that a project is confirming to expectations of the stakeholders, basically the requirements that we get from our clients, making sure that we stick to those requirements and we are not derailing ourselves and testing actually works hand in hand.

It's the process to actually explore the system. So when the developers implement the expectations of the stakeholders, how, um are we actually making sure that um, something's working as intended? So many people think of testers as those people who actually go and find bugs in a way, it sounds a bit negative, but it's positive too, isn't it better that, um, we find bugs before, um, rather than the user, the end user actually finding them? So, in a way it, it, it works well. So, um, I'm not sure, maybe some of you may have come across this little scenario here. It's an edge case. What I mean by edge case, it's a boundary level test. So imagine I'm, um, ordering a coffee, um, from the, from an app from Deliveroo and I basically test to order a coffee. So I'm expecting a positive, um, response. I'm not expecting any errors here. I will get a coffee. Then I tend, then I try and order a zero coffee. I'm not sure if any of you have ordered zero coffees, but it definitely won't work. You should expect a error message. Then what you would do is you order tons of coffee and obviously I'm expecting an error here as well because I, I'm sure Starbucks is not gonna come to my doorstep and deliver all of that coffee for me. Then what I would do is I'll order a lizard.

I've never heard in a coffee shop that they sell lizards. So it wouldn't work. I expect a negative uh test here and I expect an alert. Then I tend to order a minus one coffee that doesn't exist as well. So I expect sort of an error there. Um Then I would try in a different language. So p on cafe does the application even support different languages. So that's another test and then order some gibberish. So obviously I'm gonna expect an alert. So this is the kind of world testers live in. Not essentially all the time we do this, but it's good that when we go into a new application, we actually try and explore it. So let's go into the next slide and see some stats that I found. So currently, 80% of UK businesses are looking to hire data scientists or seek data consultancy. IBM anticipates that data science will account for 28% of all digital jobs by 2020. So this is all good news. This means there's more opportunity for us as Q A people or testers going in there and actually understanding that data and understanding what the models are actually doing and make sure that the results look decent enough, right? What is data science now? So this is your base information that would help you obviously um achieve testing properly in your uh projects or for your clients. So Data science is a collection of tools and algorithms to find hidden patterns from raw data.

Just so basically finding trends and then it allows decisions and predictions to be made. What I mean by predictions is not that if you use certain amount of configurations for the model, um your business is gonna boom in like two days. No, it's giving you some sort of guidelines. So if you use um these kind of parameters, you should expect better sales, for example. Um And basically the aim is to making informed decisions. So data science is more of a forward looking approach. It's an exploratory way to basically focus on analyzing past and current results. And there uh there's a buzzword that I'm gonna give you now predictive analytics. So this is your buzzword number one of the session. So basically organizations use predictive analytics to look into patterns contained within data in order to um detect risks and opportunities for their clients or within project. So this was a little bit about data science. What is a model? So let me give you an analogy. So imagine you are um commuting to work in the morning. Uh You would basically get into your car and you'd sit down and you'd start driving. So that's your experience. You've got your air con on, you might have music on, that's your experience. So that's your data of driving. Then um you would obvious you might come across some traffic in the morning and you might decide, oh Let me take the first, right or the first left to get out of this traffic because I'll, I'll be late to work if not.

So you're basically finding different driving patterns to learn what works best for you. So that is your computer side of things. And then, um, what happens next is the target value. That's the model side of things. So the target value in this case would be, how long did it take for me to get to work? So, if you reached on time, that's good. If you're getting late, that's not a good um uh way of seeing the model performing. But the sooner you get there and the better uh routes you find means that you're training your model and you're getting optimized results. Let me give you another another example. So for example, in a manufacturing firm, you could check what they sold last week and based on the same assumptions you input into the model, you could potentially see what an acceptable sale would be for next week. So this is basically improving a production line for a manufacturing firm. Next. What you uh basically um in a nutshell, uh models are designed to discover um relationships between different behaviors. It assesses a set of conditions and helps you uh it guides you to make informed decisions.

Something that I'd like you to uh bear in mind is that uh a model is a statistical black box. What I mean by this is, it uses all these nice funky algorithms to actually give you useful guidelines. Also models use probability distribution. What I mean by this is they use um some sort of graphs to understand if their data looks uh good or if there are any outliers and if those outliers are acceptable or not, also, the model is more of a what if situation. So for example, if a customer uses these parameters for the model suggested, then what would their business look like? It's not like this is definitely what it would look like? It's a more of a what it would look like. So your buzzword number two is coming. Uh Now, genetic algorithms, what is genetic algorithms? Genetic algorithms reflect the process of natural selection where the fittest individuals are selected. And this is inspired by Charles Darwin. So with genetic algorithms, there's something that exists some sort of stochastic and randomness in results which doesn't essentially mean um negative results, but I'll share some more knowledge on this in the next few slides. So in a nutshell model produces some indications um that are being used as guidelines by your clients or within your project. And what do testers do here?

Basically, as we are expecting the model to produce some sort of results, we are testing towards those expectations and making sure that we align to those, right. So let's go to the next slide, the benefits of data science model. So it helps you craft better decision making skills.

Obviously, um you can identify newer opportunities. So something that a company didn't know about uh data science modeling could potentially help actions based on trends to help define goals. So finding the best patterns using predictive analytics and then uh data driven evidence.

So in this kind of area, you definitely need data, then you have the identifications and refining of target results. So as I said, you have to train your model, you have to make it better to give you better results every time. And it's a good, it's a very interesting area because it mimics real life scenarios. So you're actually using a lot of technical knowledge as well as learning business um skills from this, right? As I said, a model needs data. So uh data could be divided into these four areas. So structured, unstructured, semi structured and metadata. Let me give you a little bit of overview on each of them. So structured, basically adheres to predefined data models. Uh That means you can easily aggregate data from various locations and databases if you're using databases and it's easy to store process and um access. So things like Excel files, then you have unstructured data, which is more of like videos and audio, which is not very organized in a predefined manner and it has irregularity, therefore quite quite ambiguous. So this is the kind of data I use unstructured. So what we tend to do is we provide a data schema to our clients to say that could you please give us your data in this manner? So we could actually run it through the model without having any ambiguities.

So, and the next one obviously is semi structured data, which basically is a self describing data. It uses tags and markers such as for JSON files and XML files. And lastly, you've got metadata which is uh additional information about a data set. So for example, you have photographs you can understand from them, where was it taken? And when was it taken in a way metadata is a, is a type of a structured data set as well. So this was all about data and for a model, you definitely need to input some data, right? This is a architecture that um was I created but to go through this architecture, let's have a scenario in place. So the scenario is going to be say if you're providing money on credit, um then the probability of customers making future credit payments on time is a matter of concern for you, right? So let's use that kind of idea and see how we could build a model out of this scenario. So you would potentially have databases, you would have logs um and then which um com uh basically you get your raw data out of it and then you have features, what I mean by features are new implementation, new requirements, keep coming in.

Clients want specific things for their um businesses. So your raw data could um be used for the new features and see what the data looks like. And then you have your learning algorithm. So learning algorithm is the data science bit which uses predictive analytics. So let me give you another recap of what that was and why do we use it? So it's uh uses statistical algorithms and machine learning techniques to ident identify the likelihood of future com outcomes based on historical data organizations that use this to actually sift through current and historical data uh to detect trends and forecast events and conditions that should occur at a specific time based on supply parameters.

So parameters are like configuration settings to help the model run. Then you would obviously push that through to the model and you get your genetic algorithms working there. So some sort of randomness would be generated there and then you get your futuristic guideline of data set that you could provide to your clients. So let's finish the scenario here. So here you could build a model which can perform predictive analytics and on on the payment history of a customer to predict if the future payments will be on time or not. So other companies that are using a lot of data science models are like Uber, um Netflix and Amazon Prime. Because when you're watching something it suggests what could you watch next? Because it's learning about your trends and the patterns of the type of genres you're watching And then uh as I just said, credit card fraud and payment finance, financial areas as well. So let's go to the next slide. Now, right before we start testing in a team, we have to be careful of the elephant in the room.

What I mean by elephant in the room is if you have requirements coming into you, you need to basically all, all your team members should be on the same boat. None of you should derail because it's essential to stick to the point. Now. Uh It's important for all of your team members to have the requirements right after the requirements, have some planning sessions and understand um what are we trying to do? Uh After that you could have some design sessions where you design a solution and see what is the most viable solution, then you implement it. So that means the developers go there and code it and you can test it as well. Try and keep this incremental. Don't leave testing at the end, make sure testers are involved in each and every area because we try and understand what a new change into the model or application would impact testing. And if it's testable or not, something that I've come across in an agile team is the three amigos session. Three amigos mean the three friends which are the business analyst or the product owners who bring the requirements, then you have the quality assurance and tester testing people and developers, what they tend to do is they sit down together, they go over these requirements and they ask, what is the problem that I'm trying to solve for my client?

What is the solution that could best solve this situation for them? How could we make it better? How could we make the customer's life easier? And then the testers think about how would I test this? Is it testable or not? So these are the kind of questions and you could actually do in the three amigo session quite easily. So this is something of a pretesting uh stage. Now let's get started into the testing stage. So models exist in two styles. So you have off the shelf so that you could buy it from um somewhere or you could actually have it custom made in your team in my team. For example, we have it custom made. Um when you decide what model you're going to use, you need to have some processes and strategies around that in terms of testing it and other important areas in your team. Other considerations you should also have are around um training your model. How often are you going to train your model? That means how often are you going to run it to provide better results every time you run it, you get even better results. And obviously, to optimize the model means you get much, much better results. So those are the two things you should think about. And then obviously the randomness, the stochastic would be super important because as a team, you need to decide what kind of discrepancies are accepted discrepancies in your um model results. So let me give you an, an another analogy.

Um imagine five plus three, right? Five plus three for me and you is eight, right? But for a model, this might be 8.000001 or 8.0003. When we look at this without having some stochastic or gen genetic algorithm uh knowledge, we think that we've just got a completely wrong answer, but it's not wrong. We need to decide in our team what would be an acceptable deviation be uh for example, um um in my team, uh 3 to 5% of a deviation is acceptable. But for other teams who need to be 100% precise, maybe they need to try and keep it less in terms of their thresholds. So maybe just up to 1% is OK? But nothing more than that. I'll touch base more on this in the next few slides for you to understand better. OK. So the testing process, so uh you don't have to follow this uh step by step, but it would be vital if you could because it helps you understand the product in a better manner.

So uh you have to obviously verify the data set to use for the model, understand what the client is trying to do understand the data, make sure you can input these files into the model and the model is accepting them next, you would validate the input versus output data results.

So you need to make sure whatever I input is actually uh matching or not in the output for those areas, which should give you an exact match. And if my um obviously if my columns are looking fine when I'm comparing them or not, and then um other things you could uh think of uh the stochastic in some kind of columns that you might get. Other things you could do is test in areas you're 100% certain about because you are not expecting any errors over there, but then also test the uncertain areas where you could potentially find, you might need to have some thresholds. So it's always good to have happy path testing and unhappy path testing just to have that variety and understand then what you could do is you make sure that data pipeline works as expected. Obviously, we it's a very data heavy area. So you need to make sure that no data is dropped or no duplicates are formed because that leaves you with uh uncertain guidelines for your clients. Then what you could do is you could test using different parameters, what I mean by parameters and I've mentioned it earlier as well are some sort of configuration settings which aid your model to run. Now, these parameters will vary per client.

So what you could uh try and do is um what I do is like edge case testing as well. So I would in a field where only characters are accepted, I'll try and put numbers in there to see if those integers are acceptable or not. And I'd also try and do some cross browser testing, seeing uh try and see if the parameters look fine. And for example, my Mac chrome uh is it OK in a windows chrome as well? Those are the kind of different kind of testing that I do as well. So cross browser testing and then um lastly explore the model, go in, go test, um give yourself half an hour to one hour, have no agenda here, but just try and see if you can find something that doesn't look right to you or something that you've learned just by yourself. So it's very important to do exploratory testing because you never know you might find a critical error, which we couldn't find earlier. So yeah, these are the testing processes for you now, some more on testing in this, in the team that I'm in, I've experienced two types of testing uh mainly um when new features come in and regression testing. So what do I mean by this? So um when a new feature comes in, um obviously, you don't have nothing to compare this new functionality with. So what I mean by this is, you don't have any historic run with a new implementation in there, you just have old runs, which didn't have these implementations.

So what you could do is obviously try and come up with a test plan with a data scientist and even have your own uh um plans on the side. You don't have to depend on them all the time. What you could also do is uh since it's a new implementation, you are bound to finding some issues. So I'm not saying you can, you will definitely find an issue. But if you are then uh make sure these bugs are resolved and are not repeatable because an application with repeatable bugs, you can't have much confidence in. So make sure uh you take that off and then obviously uh when you're uh when you've got your results, con consult these results with the data scientist for more conciseness and understand if your results look good or not. And if they don't look good, then deep dive further. The logic of a new feature is tested on unit tests by the data scientist. You could sit and understand with a data scientist what has a unit tested, he or she have unit tested as well. Next, what you could potentially try and do, which I do as well is run a model prior to a new implementation and then run it afterwards to see what is the new thing and how it has improved.

Uh The model and then there's no one stopping you to go in and doing some exploratory testing. Now, in terms of regression testing, what is regression testing? Regression testing is um it confirms that any recent program or code change has not adversely affected existing features or existing functionality and make sure that any bugs that you found in your new feature testing are not, are definitely fixed when you're doing regression testing.

So in terms of having confidence in your model to do regression tests, what you do is you compare your historic run. So maybe you might have done a a regression test, uh post a new release um maybe 23 weeks ago. What you do is you use that baseline test versus a new test and see if my current results look better or they've improved, they've optimized and try and see any trends. So remember, some models can have a degree of randomness because of the genetic algorithm algorithms involved and no equation for the best answers exist, rely on a good enough and fast enough result. So what do we get from this? Basically, we provide accurate predictions or guidelines and insights that can be used to enable critical and strategic business decisions. I know some of you might be thinking about the uh topic around automation. Automation is really hot in testing many people like automating, but in terms of uh data science models, um I don't think, well, I'm definitely saying this, it's it's not something that we could actually do because it needs extensive maintenance and human intervention. What do I mean by human intervention intervention is a model works according to how you train it, how you are optimizing it, the algorithms you are giving it an automated test won't be able to understand that um understanding between you and the model or the data scientist and the model.

So the parts you could automate essentially would be, for example, like I have a front end which drives the model. So what I could do is I could automate um adding configurations on my graphical user interface. Uh make sure that all the configurations are there. I could press the button to run the model and then the model is running in the back end. So these are the kind of areas you could potentially automate, right? Some useful questions that I was, I've been thinking about and would be useful for you to think about would be what is an acceptable test has a run completed and nothing broke within various stages of a model. Have we found any anomalies? Did we drop some data? How do we know what we produced is the right result? How accurate are my results and what is an acceptable deviation? So for these kind of questions, as I said, within your team decide as a process, what is the minimum threshold you can accept? For example, if I see a difference above 5%. That is alarming to me. I just straight away, raised that to the team and we try and dig deeper as to why did this happen? Was it a configuration that didn't work or do we need to optimize something? Is there a bug in the model? So these are the kind of areas you can take if you have processes of thresholds uh when working with data science models. Uh in terms of um how do we know what we produced is the right result.

Uh Basically what you could uh try and do is um if you have a delta run a baseline run that you've run in the past, that should give you enough confidence and precision of this, of that. My model looks good enough to me and I can uh release into the next uh uh side of things and add more um new features. So that could be something, as I said, you have to look into historical runs to basically give guidelines of futuristic runs. So why shouldn't let's go and practice this? So I've given you a very, very simple uh example. So these are some locations. So a one to D six and then uh we've got the number of birds in these locations. So some kind of questions you could think about would be like, how many sites are there? How many birds were counted at the seventh site? Uh What is the total number of birds counted across? All sites. Uh What is the average number of birds seen on a site? Then I could do more of the predictive analytic side of things as to and genetic algorithms as to how many more birds increase or decrease wise do I expect in each of these locations? So it's not necessary everyday site. A one is gonna have 28 birds, it might have 10 tomorrow, you never know. So we could push these uh push this data into the model. And the things that would definitely match are the locations.

You should definitely see this in your dashboard or wherever you're checking the information in a in a database table, you should see these matching but the numbers of birds would vary. So it's obviously um something that you could uh work in your team and decide what is the maximum threshold we can accept. So questions you need to think about is have we got a good understanding of what uh the model has provided to us are the predictive analytics and genetic algorithms working as expected. So that's good to sit down with a data scientist and understand that. And then does the shape of my data look good? What do I mean by shape? I don't mean triangles, circles or squares, I mean histograms, I mean graphs, I mean standard deviations, the means the absolute difference. These are the kind of things that help you understand the shape, the shape of your data complexities. That I've seen are around uh client data and anonym organization. So the most easiest option if you have less clients or maybe one or two clients, best option would be to have synthetic data. What I mean by that is mimic the client data co uh don't copy it but just mimic it.

But uh in terms of myself, I've got quite a few clients. So mimicking each client is very time consuming. So what we do is we anonymize we anonymize the client data and we try and use that. The complexity I saw around anonym organization is the more we anonymize the and the more we bring new features, more complexities come around um the data set. So what happened once was um a client had uh data, uh a description which had around 50 characters of a description of one product or and the others, what I did was I went and anonymized it to description 12345. Clearly, my file size is much smaller than the client, actual data site uh uh data file. So the consultants actually who were doing this run to provide some results to the client. So an error pop up that the file sizes are too big. Then we digged a bit deeper and we asked the developers and then one of the develop developers said, oh, I've put a limit into the amount of uh the file size that we can accept into the model. But this was obviously not communicated well enough. So obviously, we had to change to to obviously increase the file size and accept a bigger file size. So these are the kind of areas you should uh look out for. Then obviously, data policies, these are super important.

You are definitely, you definitely shouldn't be using client data in a raw manner. You have to either mimic it or anonymize it just um for data uh issues with the client and protecting the data model complex complexity. So as I said earlier, a model is a black box full of statistical funky algorithms. So we I'm not telling you that you need to understand 100% of the model. Just have a good understanding of what is this trying to do. OK. And then silos, what I mean by silos is some pe some people in the team might um obviously understand something in a better way, but others might not understand it in a great way. So what you could do is be more vocal communicate, write it on your um on communication um platforms where all of you can see the information. So keep it open for everyone and then dependencies. So for example, I have a a graphical user interface to run my model in the in the back end. So if something was changed in the model like a a little setting or a configuration front end developers were not aware of this. So when I came to testing on the front end, something didn't look right to me. And this was because the model was doing the right thing, but the front end didn't capture the latest thing. So as I said, let's avoid elephant in the room.

All of us should be on board with the same knowledge and um try and work towards that, right? Some tips for you, you guys to take away um understand the data from a business point of view, understand the model, understand what your team is trying to provide. Obviously, um don't derail, as I said, um understand the consumer requirements, loads of assumptions are gonna exist. There are quite a lot and what happens with assumptions is that we can't pick each assumption realistically and it creates more complexities and it stops you from delivering on time.

Next would be to review and analyze your results, uh document any important findings because something that you found useful may be useful for someone else as well. Understand the uh specifications, define your tests, have scenarios in place, report any issues you find um pair up with your developers and data scientists and understand more things in a better way. Next. Uh You could think of would be obviously your model mimics a real uh real world scenarios with tons of assumptions. So be ready for discrepancies, even though discrepancy sounds like a negative word, it's not really a negative word. We should actually brace the uncertainty because it's helping us understand what we are trying to provide to the client. So obviously, use of thresholds would be quite useful. Um also by providing statistics and facts to your peers in your team. What you're trying to do at the end of the day is you're creating a smart and savvy team uh to use insights uh to drive more business actually. So that's really positive. And um once you have processes in your team, try and stick to them um and obviously optimize your um teams as well. So it's not about just optimizing a model and training it.

You also have to be uh you also have to look in your team and um become better at things also um be present at all vital meetings. It's OK not to know everything, there's tons in this field. So just have a good understanding and that should be good. Uh Pairing is super useful. What I mean by pairing is sit down one day with a data scientist and understand what they're trying to do with a new feature that's coming in, sit down with your developers and see what they are trying to do. Share some knowledge, get some knowledge from them. It's a good, it's a two way thing here and then raise your concerns, be vocal, make sure uh people are aware of things that you are finding and then if you have new features. So for example, like, um this month, if you have 10 new features that have come in. Make sure you have a native development, make sure you're releasing them so that it could be tested and clients could have it have a feel of it, see what they think. So what this happens is, uh with this, you're actually staying on top of things and also, uh make sure, uh it makes, uh, it helps the team making sure that any bugs that you found previously are not being repeated. Because if the clients are testing it, we would know if a bug that we had previously is repeated or not as well. So what does this leave us with?

This leaves us with a hybrid tester because you came in with your testing and quality assurance skills and you gained data science uh and modeling skills. Um So it's, it's great. You, you got two skills. Now, some useful tools uh that I found for myself and that could be handy for you all would be SQL. So um if you're using databases, it's quite useful to have some SQL knowledge. Uh some statistics, you don't have to be a uh a pro at this. But uh some idea would be useful. You can uh uh essentially use some statistics already in Excel uh or Google Sheets. Um Also um developer tools would be super handy because um if you're using a graphical user interface to run your model, there might be some um discrepancies in between uh and to do with API end points and things. So it's useful to have um developer tools open on the side and then Miro dashboard. So as I said, the three amigos session uh mirror dashboard is quite useful because you can use sticky notes and share all your concerns assumptions, ambiguities. And all of you could actually put it in the right kind of columns or see how you can best uh dilemma. So this was all um let me give you a little summary of what we went through today.

Um So it's quite important for you to obviously do some edge case testing and uh have some uh exploring around the model, make sure the model is uh providing accurate uh results with precision metrics and make sure you can perform at a scale even when new features come in and the model is maintainable.

So um this was all um hope you enjoyed it. Uh Feel free to uh connect with me via linkedin and I'm happy to have a coffee catch up with you. And yeah, if you have any questions, uh please feel free to ask. Now, let me just uns share my screen. So I don't see any questions or let me scroll up. So I can see Sharon Leach's question. Uh Can, can't we use machine learning to automate? Well, um I've not used machine learning directly but I'm sure if, if you can, if you try, I'm sure there is a way of automating things. Um The thing is with machine learning as well. Um It's something that a human is giving the machine uh to understand, right? So, um you need to um be careful that my understanding is not exactly what the uh automated test might understand. So you need to be slightly careful there. Uh Another question, OK. How uh this is Qt Kushla, how do you use testing to identify and eliminate biases in your data model? So how do we use um testing? So obviously, as I said, you've got quite a few assumptions and whatnot. Uh It's important to stick to uh what your processes are in your team and understand what you're going to test. There are loads of types of testing.

So make sure you try and capture as much as you can um try and not get too derailed from things because that takes the focus of your uh of the importance of uh delivering your requirement. So I would um suggest that as a solution KT I hope that helped and more questions. OK. I don't think there are any more questions. Ah OK. I've got one last one here. Uh Alexandra. Uh that's uh if models fail on production or gives wrong results, do you have a backup model to fall on? Yes, we do. So we tend to use DEV ops DEV ops basically would uh trigger this um bill. So if you're releasing, uh we always have uh uh kind of a safety. Um What shall I call it? We've got a safety um model version that we've saved. So if anything goes wrong in pre-production or production environment, we can revert back to it. So, Devops is quite useful here. Um I tend to do a lot of smoke testing in uh prep production and production environments. So before, until I don't have uh a thumbs up from myself to do the uh uh with the results and what I've tested, I de I don't uh let anyone let the consultants or the clients know that we have released a new release.

So it's super important to have a backup and uh make sure um you have that handy just in case things go wrong, obviously, you never know what might happen. So um just keep your backup with you and I think there's one more uh from GTA. Um What advice would you give to a fresh undergrad student willing to get into data science world? My advice would be to obviously have a good understanding of what data science is. So over here, I didn't cover everything about data science. I gave you the overview and um I hope uh that it gave you kind of a baseline understanding of it. And what I would suggest is actually I've written some names down here. Um I think the um the few people are speaking about data science uh in a few hours, there is one called Bora Anjum who will talk about evolution of data science, the new Hassani who will be speaking first steps to data science. So uh Geeta that would actually help you uh Simran Yada, who would be obviously talking about getting started in data science. So again, that would be helpful. And um I would also like to give a shout out to Pin Khan who I've actually worked with and she will be actually pairing and um she would actually be talking about pairing and learning.

So as I mentioned in my, in my session, uh pairing with your team members is quite important, especially with your data scientists and the developer. So that should be helpful, obviously attend all the other talks because they are quite useful and and quite interesting. Um I've got another question from sev uh actually Tanya. So how do you approach unpredictable data sets for instance, data sets collected from COVID have been quite unpredictable. So how do you deal with them? So obviously, you have data engineers in your team potentially or data scientists who could understand the unstructured data that's coming in. What you tend to do then is you have a data schema prepared and try and uh whoever your client is, for example, uh try and give the schema to the client. Have this as a process in your team, have a data schema to provide to clients to get structured information back and then push this to uh your data sensing uh models and tools to understand your data better and have it in a, in a much more organized way because I can understand it's too time consuming to sit and organize data, right?

Um I think I've got another question from ST A. Uh can you expand on the ambiguity and unstructured data and how schema helps avoid the ambiguity by an example? So, um for example, I work in supply chain and um what we do uh in terms of unstructured data, as I said, uh we use uh data schemas to help understand um to help us understand how the model will pick this data. Obviously, we cannot just push in unstructured data that comes from the client because that creates a lot of ambiguities. So um I would, I I don't have a example but I would definitely say have a process to actually have uh data schema and the data you can accept from your clients because that helps you massively and saves too much time. Um I've got a question from B ANA. Is it possible to use agile techniques when working with data projects? Because there are many dependencies when working with data. How do you create a sprint in such cases and for providing increment incremental deliveries to client? So, BNA, I'm actually working in an adult team. Uh My sprints are 22 weekly sprints and um we basically bring in a set of requirements, new features, we decide requirements acco according to the obviously to the importance of uh what needs to be delivered.

Um First, uh obviously, it's not just new implementations. We have bugs as well, we have some critical bugs. So we need to take the critical bugs out of the way first before working on any implementations. So that's how we do. And obviously, we use Azure DEV ops to actually do incremental um uh incremental development. And as earlier, um a member had suggested how to actually be careful of uh crashing in pre or pro environments. So make sure you have a backup option as well of your model. Uh Do you, I've got another question from Magdolna. I hope I've pronounced your name accurately. Uh Do you do data wrangling? Once you get the data, we used to do data wrangling when we didn't have a scheme in place. So we had a data engineer who would understand the data coming from the client. So he would actually go and do the wrangling, but it was way too time consuming. So the best solution we came up with again, uh to give more light on have data schema, provide these to your clients uh via your consultants or um have a face to face kind of meeting, help them understand that this is the data we are going to accept and hopefully these makes things, these make things more smoother for you.

So yeah, I think um I don't have any more questions. Oh What is data wrangling from Sharon? Data wrangling is trying to um uh make the data right according to what you think. But then again, these creates assumptions. So obviously, you need someone who's quite a professional in this area, who's worked with these kind of clients to understand how to um how shall I say? Um Wrangling could be another word for and like imitating the client data, basically understanding um how you can potentially change an ambiguity there for them. But yeah, uh that is all then. Uh I hope all of you enjoyed p please feel free to connect with me on linkedin. I'd be happy to help you and give you some sort of ideas of how you could test better and yeah, enjoy your day and hope to connect with you soon. Thank you. Bye.