Career Path for Data Science - How to be a Data Scientist? by Isha Shah Claire Zhang
Video Transcription
Good afternoon, everyone. Uh Thank you for joining this session today. Uh My name is Aisha Shah. I'll be presenting with my colleague Claire here on career path for data science. How to become a data scientist.Uh We are very excited to share our journey uh As we started our journey as uh like data science. And I hope people who uh want to start their career in data science or are looking to transition from data analyst to data science chat can benefit from this session. Uh So let's start with the introduction. Uh My name is Isha Shah. Uh I, I studied my master's in information system from University of Illinois Chicago and my first job was a data analyst. Um So I started as a data analyst for a couple of years after that, I switched to data science track and currently I'm a senior data scientist at tiktok. Uh My current focus is on tiktok like and our team focuses on building like new features for our content creators to make sure that we keep our platform safe. Uh So that's about me. Uh I'll pass it on to Claire.
Thanks, Isha. Hi everyone. My name is Claire. It's so excited to be here today with everyone and same as Isha, I work for tiktok Trust and Safety. And uh my day to day job is basically how to protect and find out this kind of bad content, especially regarding to harassment and bully online to better protect our user to making sure they are having great time on our platform. And before tiktok, I work for linkedin and then um but in linkedin, I actually work in a completely different domain. I work at, I work as a go to marketplace. So I'm pretty much like moving from the front end and try to count it back to protect more user. OK. So this is pretty much about me and um so for today's agenda, actually, we're gonna divide everything into three sections. So the first one will be a quick introduction of data scientist. And the second one will be the main part like how to become a data scientist. Also, the third one is we are gonna share everyone some projects and also use for resource for your reference. And for the first one, as I said, who is a data scientist? Also, we try to uh help people understand what is the value of a data scientist to the business. OK.
Um I have to be 100% honest with everyone that data science, especially data scientists is actually one of the most inflated or even ill defined job title. In this industry because you can see a scientist, a very fancy word in the title. So basically people think, OK, great.
Are these people seem smart and probably like very very top talent people same as these other scientists in the lab? OK. So especially to most of people who are not familiar with data scientists, this is more than a buzzword and even to people who are familiar or even working with data scientists, for example, me in my daily work, no matter it's from my PM or from my stakeholder. Actually, I was always asked the same question like Claire, should I trust you? Should I trust data or should I follow my instinct? OK. So basically, these are two types of questions that we always find people have no matter you're familiar or unfamiliar with data science. So today we try to give a a thorough like different perspectives to people what data scientists doing and by sha sharing you guys what we do to help you guys better understand what we are. Yeah. So question came to what else we do. OK? So I generally uh divide this uh job responsibility into eight sections. And the first one is actually about data pipeline applications. But OK, forget about these fancy words. OK? Listen to me. So let's take a very easy um like metaphor. OK? Let's think we have like a lot of data, huge of data and we can divide your data into like small, small cubic, for example, there's one cubic user, there's one cubic about location, there's one cubic about behavior blah blah blah.
So you will have many, many, many cubic and try to think if you are a popular platform like Tik Tok, you're gonna have like a large scale, a seriously large scale tons of data every day. In this case, if you just want to find 11 tiny use cases of data, do you actually need to go through all these qubits to find your data? That's kind of impossible, right. So data scientists are actually smart. So that's why we build this kind of a a second layer on top of these small cubits by different combinations of these qubits to making sure we only give small chunks of data to cater to each specific use case. So in the future, no matter how big your business is growing on how more data you are getting. So you you actually probably have this small cubic sorry small chunks of data growing a little bit as well. But still it is much faster and efficient to access this small chunk of data instead of going through all these cubic to finding and hunting for that specific use cases data. Yeah. So this is what we're doing and designing as an efficient way to extract transform and also load the data. So this is basically what we call a data pipeline application as the first uh responsibility also second one analytics framework.
So actually a lot of times my manager will come to me with a very, very vague question. For example, hey Claire, can you tell me how many bad guys are there on our platform? Great. OK. So usually when we have this kind of a very vague question, so this is a time we try to understand what is the goal of this question and transform this question to a mathematical question to something we can quantify and help them with. And after we understand the question, we also need to come up with a very logical structure to decompose the question to provide and also provide solutions from all related dimensions to make sure the Venus stakeholder or any other stakeholder working with us can understand.
So this is what we in generally call analytics framework. And the third one is interactive dashboard. So this one is very easy to understand, right? So let's say instead of me sitting there answering everyone's same question again and again, day by day, how about I just put everything into a dashboard and this dashboard will automatically updated every day with the new data flowing in. So next time you don't even need to come to me, you can just see the dashboard, you can see, oh, we have this many user today. We have this many uh new eros today. So this will be a great and also easy way for everyone to be on track. Of everything. And the first one is A B testing. This one is even easier to understand. So because for example, my P MA came to me Claire. Do you think the tiktok logo can be a darker color or a lighter color or do you think we can change the logo to a different one? How can I know it? OK. So we can basically conduct the A P test to use the data and use a more statistical significant way also scientific way to tell the PM to tell everyone what is the correct or even better answer. Yeah. And for and the fifth one is about the data research.
So for example, there may be a lot of latest model, a lot, a lot of latest paper try to um uncover some latest technology can be applied in this industry. So to keep our own work as always to the top of the industry level and also to making sure we always, we can always leverage the best technology, data research is one of the direction we will need to keep in mind and also keep uh loop in our daily routine. And I think this one is very easy to everyone because it's almost like another buzzword is machine learning. So try to think instead of us as a human, try to learning and uh understanding all this kind kind of huge label of data. How about we just let the machine learn it, let the machine, understand it and let the machine just predict everything um to help us make a further decision. So machine learning is definitely one of the main content next that is programming language. So this one is actually very obvious as well. We use a lot of like programming tools like SQL, Python R. All this kind of uh language depends on what use case and what to we are using every day. And the first one and the final one is actually causal inference. So try to see sometimes when we don't have that enough of data to, for example, to help us set up or design A B test.
But we still need to understand the causal what is the root cause behind it. And usually this time we'll probably refer to a causal inference. Yeah, which is more sophisticated but also really powerful in this industry as well. Great. So after everyone understand what does a data scientist do? I try to uh it will be easier and better way for people to understand the value of us? Yeah. And as I said, after we extract the data, we try to give the insight to all these stakeholders. So in this case, we actually power the management with, with more info and also the way to make a smarter decision. Also by um giving them uh the analytical framework, we also try to leverage data to help them break, break down their business problems. And also by leveraging the uh interactive dashboard, it will be much easier for our cross functional partners to monitor their matrix and take action if there's anything on you. Also finally, through this A B test and A B test result, we actually can scientifically test multiple ideas and make the best decision out of it. Yeah. And also through the data research, we can apply the latest technology and also method in the industry to making sure we always keep up to date, we always use the most efficient method.
And also by leveraging machine learning, we actually can imply or encourage the business with the latest A I technology to be efficient as well. Also is uh yeah, as I said, we build up this kind of uh data pipeline application because we have tons of data. So this is a way for us to efficiently process the millions of data. And also we provide a very clear business problems into a mathematical ana analytics model not only through this kind of uh uh machine learning model, but also some called inference settings. Yeah. So basically this is pretty much about the uh what do we do and what is the value of us? So through this to provide everyone an idea, what, who is the data scientist? Great. And now I'm gonna transfer to ASHA to help us understand how to become one.
Thank you so much Claire. Yeah, let me walk you through the uh different data rules that we have in uh in industry. So when you apply for a job, you'll see there are different data rules. Uh Can we go to the next slide, please? Next one. Yes, thank you. So, um yeah, as I said, uh each role will have different responsibilities, right? So it's very important to know what the difference is between each of the role and depending on your skill set and the experience you have, you can apply for the right job. So starting with the data analyst, right? So we have data analyst, data scientist, data engineer and there's ML engineer. So there are four like different categories of the data roles. Uh Starting with the data analyst, you'll see that just highlighted some of the key uh responsibility here.
Obviously, there's a lot more to it. Uh But looking at the first one, you'll see if they are more responsible for gathering and cleaning data and providing valuable insights and mostly like focusing on the analysis portion uh and share it with these stakeholders. Uh going for the next one, we have data scientists. So as a data scientist, as Claire already mentioned, what we do, it covers most of our like functions that we work on. So these are all complex data problems and the expertise are in mathematics, statistics, machine learning and data analysis. The third one is the data engineering. This is a little different. You'll see that uh data engineers they build and test and maintain data pipelines, provide quality data for our machine learning uh models. So the skill set required is a little different than data analyst and data science. And the last one is the machine learning engineer. So this is more focused on uh building and deploying ML models into production. Uh So if you look at the skill set, your data analyst would require proficiency in SQL uh building dashboard storytelling. Uh In the data science, you require all that plus also like mathematics and statistics, A B experiment, design, metrics, monitoring, and some scripting languages like Python or SQL for data engineering. Um The focus will be more on ETL development.
So more focused on scripting skills and obviously proficiency in SQL. Last one is the machine learning engineer. For this. We have need to have a solid programming background uh would require mostly it would be equivalent to a software engineer uh but mostly focused on the machine learning uh models. Yeah. So let's go to the next slide, please. Let's talk about the now we know like uh different roles, right? Uh We'll be fo in this uh session, we'll be focusing on the data scientist track and uh the skills required to be a data scientist. Next slide, please. OK. So uh when we apply for the data scientist uh interview, so there are a lot of interview rounds. Um For example, there will be four on-site rounds. And for that uh for the four rounds in the on site. First would be like product round or we call this analytical framework. Second would be stats or uh more focus on the statistics. Part third would be like SQL or the technical round SQL or Python. And then you'll also have a behavioral round um based on that some companies uh do have machine learning round as well. So we'll uh also focus on that, like what are some of the resources and things that you can focus on uh for, for cracking this interview.
So let's start with the first one that the product ground or we call the product metrics. Uh This is one of the typical like interview uh around in data science. The the product interview questions from company, they want to uh understand how you would um work on like the whole life cycle of a product. Uh So usually like, practically you'll work with a product manager, you'll have engineers, data scientists and they'll work together on building like a feature for a and understanding how do you measure the performance of the product. So starting uh we divided into four categories here.
These are the general like categories where you have questions from. Uh First is investigating metrics. Uh Second is measuring success of the product. Third is feature changes and fourth is the metrics trade off. So starting with the first one which is investigating metrics, uh the questions that you'll ask would be why is feature X dropping by Y percent. So to uh to understand that question first is important to know like clarify the question, gather context information. So they want to know why there is a drop, right? So we want to understand the cause for that define like high level reasons, define hypothesis and then for each of your hypothesis, explain your theory and how, how will you use uh to fix that problem? So there is a framework follow the framework to answer like each question. So the each framework is different based on the question asked. So going to the second one measure success. So the question typical would be how to measure the success of, let's say tiktok live or how do you measure the success of Facebook marketplace? So it's a very uh broad question, but it's mostly focused on the success of the feature. So if we launch this feature, uh how will the company know uh was it a successful or not? So identify clear goals are defined like clear right? Metrics, which is very important.
What would be the core metrics to measure the success of the product, understand how users are behaving uh using a product. So the uh one way to do that would be to test it. So we do like a B test and measure the impact of this new feature. So understand like how the framework would work and apply that framework in that question. Similarly, third one is feature changes. So this will be more focused on just that feature. So we want to add update or improve a feature. What metrics would you use to track that again?
Similar process follow the framework for those type of questions last is a little different. That is a metrics trade off. Uh So this is generally asked when, when we want to decide between like two features, we have feature X and feature Y, which one should we launch first? Uh So for that understanding the of each feature is very important. Uh We want to frame the trade off. What are some of the uh unifying like metrics between each of the features? What is the impact of launching this feature on our company like high level goal? So the higher the impact better the feature would be and then understand the trade off of that uh from, from that do a ab test and see how your feature is doing. So that's one way to um go through for the metrics trade of question. Uh Great. Uh Let's go to the next part. There is a prep material. Uh I personally like the stellar appears, they have a lot of uh examples of the product side questions that we just talked about. So they also have some of the examples and how you can walk through that framework, there are other materials too attached, so feel free to go through that. Um That would really help you in the interview process. Next slide, please. Great. So for stats and probability, we have attached some popular concepts here. So like mean median variant standard deviation, these are the very popular questions asked in interview would be like, what is the value, confidence interval, confidence level?
And the last part is mostly focused on the A B test. So if we have some question on A B test, very important to know what is hypothesis testing. Z test T test, sample size estimation. Uh A no A all the different like uh tests that you can do. Let's move on. Uh next attach is the prep material. Feel free to click on the link and go through the uh material attached here, we can move on to the next one. Great. So A B experiment, so this is another important interview around uh generally us. So this is a common method that we use uh when whenever we want to launch like new feature. So the main uh main idea behind A B test is to split the users into different group that is control group and treatment group. So in the control group will actually provide them with the existing feature and the treatment group will have the new features that we want to launch or we are planning to launch for that. So the idea behind uh spreading is to understand how users are behaving in different uh groups and then evaluating the metrics. And comparing the statistical significance between the treatment and the control group. So the first step is to always start with the hypothesis.
Understand the problem, define your hypothesis that are measurable and valuable. So once we have the uh hypothesis, the null hypothesis is always that, that there is no difference between the treatment and the control group. And going from that, we are often interested um like after formulating the hypothesis, we collect the data. That's the main part. So the data will have like what is the sample size? How long do we want to run the experiment? What is the statistical power? The alpha value? So generally common value for statistical power is a percent. So 0.8 and for alpha value is 5%. So 0.05 is the common like value that we use in the industry. But depending on the use case, you can adjust that. So if you want to have enough statistical power, we want to make sure that we have enough sample size. So that's very important. And how long we want to run the experiment? Uh So to do all that, uh another thing is obviously start with the success metrics. So what's your core metrics? And what's the guardrail metrics? Once we have that we launch the experiment, we analyze the results and then we see do we want to launch or not depending on your core metrics? Did we see a stat or significant difference between those metrics uh that we are looking for. Uh great uh We can move on in the interest of time. I'll just like to. Uh yeah, can you? Yes.
For the A B experiment design prep material, the there are some courses which you can take and there are some uh good examples, the industry examples that are uh that would be really helpful uh to know how the A B experiment works. Great. Thank you for uh yes, for the machine learning part. Um We have uh uh there are some common topics. It's very important to know linear regression logistic regression decision tree, uh random forest. Uh So all the machine learning model and how it works and to also know like advanced topics, you can uh look at the imbalance classification di dimension reduction feature, engineering part. Uh Another thing that's very important and very commonly asked in the interview is under fitting and overfitting process evaluation metrics. Like what is precision? What is we call f accuracy score metrics? Uh Let's move on to the next slide, please. Yes, we are. So there's a prep material.
Uh stat quest is another great youtube channel by Josh Farmer. It has like all good materials, not just machine learning but also statistics too. Uh Next slide please. Yes, last we have like SQL and dashboard SQL is another very common interview round. Uh For that uh There are a lead core is another like great way to practice and there's a prep material which has like more analytics and Sequels of which you can do. So the advanced topics would be like Windows function, uh union uh joints and uh the advanced like uh functions that we use. Uh make sure you learn that and then practice in the elite code. Uh for dashboard, we can use like Tableau which is pretty common in the industry. Many people uh many, many companies use Tableau for building the dashboard. So it's good to know the basics of how to use that. Great. Um We have like two minutes. Maybe we can uh ask uh if anyone has any questions, we are happy to answer that. Uh Yes, we are going to send the slide. I think the process is we have to upload the slide and we can share the link. Uh feel free to type in the chat too. If you have any questions.
Actually, I was thinking, can we just share them the Google Doc link or we, we have to go through the system?
Um Yeah, we can share it here. Uh But I, I'm sure there is a system to share that. Yeah. Sure. Yeah.
Yeah. Actually I have a question. What is the most difficult part of your role? Hmm You sure you want to take it?
Yeah, sure. So I think for me, the difficult part is uh when we are working with the product managers and when we have, let's say we have this new feature, we have launched it and we have to convince uh like based on our data part, right? If things doesn't go well, sometimes based on the metric, uh but sometimes there's a pressure to launch the feature. Uh but it's very hard for us to convince that the data doesn't support it. Uh So that's one of the things that generally happens with me is to like convince them that data is not uh we have to follow the data and not just follow our instincts as Claire said. Um So that's very hard, but that's the process we have to follow. Um So that's one of the thing that comes to my mind.
Yeah. So for me, the difficult part is actually um since I'm saying I'm working with the bully engraftment and try to protect all of one or all everyone who use our platform from this. So it's actually a difficult part for my role is sometimes I have to face this kind of a firsthand information or these kind of these escalations when I see people being bullied. So this actually give me a lot of like pressure and but I try to transfer this pressure to more motivation. Like how can I better like protect our user to making sure they're far away from this being harmful uh from being harmed. Yeah, on our web, on our web platform. And then uh I see Alisa has a question, any suggestion for PM, our technical background work with development team. Hmm So
yeah, go ahead. Yeah. Uh like for PM without technical, like if they don't have any technical background, uh but at least if they have some good sense of the product, it's very important to know how, what the product is and achieving what the goal of the product is, right? So for the technical side, I think uh as a data scientist, we can help the product manager uh have that mindset of like we have to use our data from the technical side, like help them as much as possible, but stick to the goal here, stick to what your goal is and uh impact. But I have seen like most of the product managers, they are very good, they could they have like good product sense. Uh It's very important to have that good, like collaboration with the product manager.
Yeah. Um so I on Asia, I think PM is actually also a very vague title. So PM here. Do you mean project manager, program manager or? Uh yeah, so what specific title and responsibility do you fall into and on your responsibility? Do you, I mean is technical background that great? And also if you work in a different industry, probably away from technical industry, would that still be like a really uh prerequisite? I don't, I probably don't think so. Yeah, project manager also for project manager. Yeah, that in different company will have different responsibility as well. I do know some project managers, they are more on track. Uh they are more working in a way like help project to deliver on track, help manage all different state orders. So in this kind of a case, a technical background is not that important for them. I also see other more technical PM that they actually have to have to have a very strong technical background so that all depends on the responsibility. And if you probably don't have that much confidence in the beginning, start with the one that has a light uh light foundation on the technical side and gradually moving on and picking up a lot of the work making until you're comfortable with like more technical in the, in the old work environment.
Yeah. So that's all for me, I think. Yeah, hope this helps. And if you guys have any other questions or anything you would like to discuss or even want to know about me about Isha about tiktok or any other companies we have been working before. Feel free to contact us via linkedin. We are happy to connect with everyone and look forward to talking with everyone. Yeah, I think that's pretty much about today's session. Thanks everyone for attending.
Thank you, everyone and we'll share the presentation link. Ok, thanks.
Bye. Great bye.