Bushra Anjum The Evolution of Data Science and Advanced Analytics since Feb 2020
Understanding Data Science, Machine Learning and Artificial Intelligence
The world today has experienced rapid changes, especially in the field of data science. There have been several alterations since February 2020, which coincides with the spread of COVID-19. This blog aims to shed light on the evolution of data science amidst the pandemic.
One common question that arises, particularly for individuals who are new to the field, is how to differentiate between data science, machine learning, and artificial intelligence. These terms often coincide, making it challenging to establish a difference. However, one efficient way to differentiate between them is to consider the end goals of each area.
Data Science
Data science aims to provide value to a business or organization, which could be monetary or goodwill – any value that a business cares about.
Machine Learning
Machine learning, on the other hand, is about learning to predict. Its end goal is to enable better predictions and inferences based on data.
Artificial Intelligence
The end goal of artificial intelligence is to provide humanlike thinking to digital agents. AI strives to create systems that can function independently and mimic human intelligence.
These three areas overlap since they all aim to provide value, make better predictions, or impart humanlike intelligence. However, the end goal of a project determines its categorization.
How Technology is Being Leveraged during the Pandemic
One example of leveraging technology during the pandemic is through companies like Doximity. Doximity, a start-up founded in 2011, has created a closed, secure medical network where clinicians can exchange patient information, make case referrals, and discuss specialty and subspecialty related topics. The platform is HIPAA compliant, making it a safe and accessible platform for clinicians.
Doximity also launched a new feature in response to the COVID-19 pandemic – the Dialer Video Call. This feature enables doctors to call their patients with masked caller IDs, thereby protecting their privacy. The video call feature was swiftly implemented as a response to the growing need for telehealth services, and it ensures that patients can easily consult with their doctors without the need for installations, sign-ins, or registrations.
Data Science and the Fight against COVID-19
Data science has proven to be instrumental in various aspects related to understanding and combating COVID-19.
- Prediction of COVID-19 Spread: Data science has helped us understand and predict the spread of COVID-19 more comprehensively. This includes determining the number of cases and effectively managing resources.
- Development of Effective Treatments: Computational biology and gene sequencing play a significant role in the quest to develop effective treatments. An immense volume of data has been made freely available for this purpose.
- Planning Resumption of Economic and Social Activities: Data science is also assisting in planning the resumption of economic and social activities. This involves figuring out how to implement effective social distancing practices and how to manage the behavioral aftereffects of the pandemic.
Amidst this pandemic, there has been an unfortunate increase in hate speech and misinformation. As such, data science is playing a critical role in understanding and combating this menace. One notable project addressing this issue is the "Waves of Hoaxes" initiative by the International Fact-Checking Network (IFCN). This initiative involves housing a single database comprising fact-checks related to COVID-19, thus providing a reliable platform for debunking misinformation.
Beginning your Data Journey
Finding credible data sources for research or project purposes can be daunting. Several downloadable data sets are available for coronavirus-specific research from the John Hopkins University, the New York Times, and the European Center for Disease Prevention and Control.
The key here is open science and improved access to relevant information, which will undoubtedly prove crucial in tackling the ongoing pandemic. The future of data science seems bright and promising, with endless opportunities to leverage and explore.
Video Transcription
All right, it's almost 250. Welcome, Antonia. Welcome, Giuliano Toronto. Welcome, welcome. All right. So let's be respectful of everyone's time and I know we have a very short duration.So let's try to dive right into the topic, how data science has evolved since February 2020 which as you can see aligned with the, the spread and the pandemic verification of COVID-19. Since we have only 20 minutes here, we might not have a lot of time for Q and A and for that, you are very welcome to reach out to me after the chat uh after this talk uh via linkedin via Twitter or by my website. And I'll share the contact details with you too shortly before we get into the thick of things. Something that comes up quite often, especially for people who are new to the field or at looking at it from the outside. That what really is the difference between data science and machine learning and artificial intelligence? Is there even a difference? Is it, is it possible to differentiate between them? There are a few ways to look at this, but one way that I would invite you to consider is that look at these terms or areas beyond the tools and technologies that they use.
So for example, if you're looking at a clustering algorithm that may be part of data science, that may be part of machine learning, that may be part of artificial intelligence. Heck they may be part of even multiple of them at the same time. Tools and technologies evolve. I would like you to consider the end goal of each of these areas. It's easier to differentiate between them that way. What do I mean by the end goals? Data science is about providing value to a business or an organization. That value could be monetary, that value could be goodwill any value that a business cares about data science aims at providing value to that business. Second is machine learning, machine learning is about learning to predict. That's its end goal, you have a bunch of data, you want to learn how to make better predictions and better inferences based on that data and artificial intelligence. Its end goal is to provide humanlike thinking to digital agents. Now these three terms as you can imagine coincide with, with each other quite in interestingly. So for example, in order to provide value to a company, you might have to have better prediction models or in order to have work on good prediction models, you may need to know which will actually add more value to the company or while you are working on providing humanlike intelligence to uh to a digital agent.
You may want to add some prediction models to it so they overlap, but it's really the end goal what you are trying to achieve with this project. Either you are trying to achieve better prediction, either you are trying to impart humanlike intelligence or are you working to provide value to our organization that really determines what your project should be categorized as? So with this clarity in mind, let's move forward. What do I do?
I'll keep it very brief because our time is very limited. I have a phd in computer science. My area was performance evaluation and queuing theory which naturally that self to data science when the fields became more prevalent. I have worked for Amazon as an STE two for their prime program for about four years and then recently joined the uh start up doximity. About two years ago, I joined as their senior senior data analyst and then about a year ago, got promoted to lead their data science department as the data analytics manager, do quite a bit of volunteer work as much as my time permits. I'm the AC M Women standing committee's chair, also serve as a senior editor to their magazine and also for anita.org, a lead, a special project called Technical Leaders Monthly call. Talking specifically about the company that I work on right now. Do me a start up that was founded in about 2011 and its uh main focus was to provide a medical professional network for doctors. I would say clinicians because later we expanded our definition to include doctors, nurses, fourth year medical students. So basically a closed, secure medical network or social network if you will, where doctors and nurses can exchange patient information if they want to, can make case referrals can discuss many specialty and subspecialty related topics as you may know because of H IP A rules, doctors and nurses cannot just have that information or that chat over an email or a chat message.
So this platform because it's closed and because of other privacy related uh enhancements, it's H A compliance. So that gives a network opportunity for clinicians. We have over 70% of us doctors as their members. It is only us specific for now. And uh uh because we have over 70% of doctors, it in the data department allows us to do some really interesting things like it allows us to conduct National level research service which we have conducted. And you're welcome to search for that online. We are also partnering with us news and we work with them for their annual US hospital rankings. One project that I would like to spend a few seconds on is the dialer video call. That project was not on our, on our road map at the beginning of this year, but since March, we prioritized it. So with doximity, we already have an app, which is, which called Dialer. And it's the opportunity, the opportunity that it provides is that a doctor can call his or her patient from his or her phone, but they can mask the caller id. So if you want to just check in on your patient after say a surgery and want to know how they are doing, you don't have to stay late at the medical facility to make that call.
You can make that call from the comfort of your home or your car and the patient will still see the caller id of the medical facility. So when COVID-19 happened, we immediately spun into action to add video capabilities to dialer. And the beta was launched in April and our product is product is launched in May. Our main focus was that it should be as easy for the patients to have a video consult with the doctor, no installations, no sign ins, no registrations. And that's what we achieved. Whenever a doctor wants to have a video call with the patient, the the patient receives a text message on their phone, they just need to click that link in the text message. It automatically opens up a video connection feed with the doctor. So taking care of and moving away all the the complexities from the from patients and HIPAA compliant, the calls are never recorded. The calls are encrypted and in our beta of views, we had over 1 million video calls made just in the month of April from about 100,000 doctors across the country. Now, it has been launched and it has made to the top 10 medical lab list, a product that I'm really proud of being a part of the data department going forward. We do have open positions available both in our data department and other departments too.
If you are interested, have a look at work at.doximity.com/positions. So having talked about what we have been doing for COVID-19, as far as data science and data value is concerned, what are some of the other areas that have really gained traction? Well, one is of course, as you might imagine, understand and predict the spread of COVID-19 better. It requires a lot of ensemble models to make better predictions. Um The new number of cases, how effectively we are managing scarce resources where are hospital beds and ventilators most needed, etcetera.
Then of course, there is the whole area for getting effective treatments and it might not be immediately obvious how it's a data science problem because in order to define or devise the optimal drug, you have to understand the biology first. And when you talk about biology, then you talk about genes, you talk about gene sequences, you talk about antibodies. So the whole discipline of computational biology and gene sequencing comes into play a major role in this area. A lot of data sets have been made freely available too and I'll get to that in, in the next slide. Third area is how can we resume economic and other social activities? How can we have good social distancing practices going forward? There are a number of uh psychological and behavioral aftereffects that we will be dealing with for a very long time to come, how to take care of them. Then there are the age old classic ce and ce epidemiological models. You may have heard of them ce is the susceptible uh in effective uh recovered model and using advanced modeling techniques based on those to figure out the best way to open up the societies, special distribution of COVID-19 kind of links to the previous point but also detect which which societies are the most vulnerable both to the primary impact of the disease and also to the after effects which may cause even a humanitarian crisis.
How can we help local communities to be more informed about the resources that are available to them, like free food or shelter? Maybe there are a couple of very interesting projects going on. Um You might have heard the news that Apple and Google recently came together and they made the software exposure notification available for different government organizations that they can create mobile apps based on that to kind of do contact tracing. Similarly, there are other projects.
One is going on at the Duke University, I believe that research on digital biomarkers that your mobile phone or your smart device can, can keep track of your own heart rate or sleep patterns or exercises and then can make better predictions that ok, you are more susceptible to getting COVID-19 or any other infections.
How to keep track of those. The final one is as interesting as it is unfortunate that how the hate speech and misinformation has risen as a result of this pandemic. It's extremely unfortunate but the early results of many research studies indicate that the there has been a meaningful increase in in xenophobic language and hate words and paranoia around that. And it's not just prevalent on any fringe web communities. It's part of our main web communities quite a bit of time too, like Twitter. So how are they spreading? What are any effective lawyers against them? Same with the misinformation and talking about misinformation. This is one project I wanted to bring to your attention, which I refer to quite a bit, it's called waves of hoaxes. So in early January, I believe 85 or so, uh news organizations from around the world came together and decided to have a single database that where they'll pour in all their fact checking related COVID-19. This database is currently maintained by the International fact checking network IFCN and it's part of the Poynter Institute.
So if you go to their website, they have had these wonderful visualizations where all the misinformation is categorized into different categories, the headlines, the, the commerce that have been made are part of that data, that data is also freely available for any further research you may want to do on that.
That's an excellent resource from over 70 countries over at least 40 languages, a data, a resource for you to both do further research on or or just check once in a while if you want to know the validity of a claim, talking about data, what are some of the other freely available good data sources that you can perhaps start playing with today?
Two of them have been referred quite a bit in a lot of different projects. One is by the job uh John Hopkins University, their Center of System Science and Engineering and the other one is by the New York Times. Both of these data sources are available on github and I have uh posted their name of their github repositories here. The the New York Times uh times. One is the aggregate data from state and local governments um from various health organizations and departments across the United States.
And if you want to look at similar data outside of the United States, another wonderful resource is the European Center for Disease Prevention and Control. If you go to their website, uh E CDC dot Europa dot EU and go to their section which is called publications data. Tons of good data are freely available for you to download and play with these three data sets are more focused on the the spread of this disease itself, the number of infected, the number of recovered the areas most hit by it. Then there are a few other very interesting data sets that uh can lead to solving interesting problems. One is the AC A PS COVID uh excuse me, it should be COVID-19. I forgive my title here. Government, my year's data set. So this is basically a data set or a repository of the initial measures adopted by governments worldwide as they were responding to COVID-19. So what were the measures taken? What were the um social protocols that were in place? Which ones are still in place, which ones have been lifted? So a chronological analysis of the introduction and phase out of this pandemic if you will, another interesting data source is the informed COVID-19 risk index. So this risk index is the both global and regional risk informed resource allocation.
So for example, it can be used to support any prioritization or preparedness to meet the primary needs of the disease in an area or it can identify the countries where as I was mentioning earlier, the secondary impacts are likely to cause havoc and humanitarian crisis. So which areas which localities, which countries are most at risk as we both are going through this pandemic? And after this pandemic, excellent data set. Then the last two, the NCB I SARS COV two and the next train. These are massive dumps from sequencing projects, human genome sequencing projects. So there is data on genomics, there, there's big data on genetics. So these, these are used primarily by many research companies like Cerner and others in order to define effective drugs and they're open for public if they want to play with them and find out additional insights. Having said that there is tons of data available, having free data available is really not an issue. Rather we run into another issue that there is so many data resources that it becomes difficult to figure out where to start from. And for that, I would like to bring two meta projects to your notice. One is the COVID-19 open open source uh project recommender, excuse me.
So github has over 30,000 open source repos that are working on different projects related to COVID-19 and this github project is built on top of the other projects. So based on your experience, uh keywords and the language of your choice, it gives you a recommendations of which different projects you might be interested in working on Tas A Mara project project based on project to provide you with some guidance. Another really interesting project is TM COVID.
It's more for the academicians and the research community. Again, as far as their scientific literature is concerned, there are 10 to 20 new publications on COVID-19 happening every day. So how does one keep track of all this if that's what they want to do? TM COVID is one of the projects that daily scripts through public Central, which has the, which is the repository for full text scientific uh articles. And it then does some processing on, on top of it based on machine learning and natural language processing and presents a summarized table with valid tags on top of the articles and uh pick up uh and put the articles in certain categories. So for example, it could identify our tag articles based on uh gene names or chemicals or any other diseases or disorders that are uh that uh might be co mention with the COVID-19 related diseases. So if someone wants to get into dive deeper into the scientific academic literature related to COVID-19, this is a great starting point. And from there, you can uh you can dig deeper as much as you want public data sets programs. So I talked about a few data sets that were available. And I see there is a question on the chat that will these slides and presentation be available? There will be and you are most welcome to reach out to me directly.
If you see my website is mentioned on the slide, Bora and jim.info, there's a contact page there that goes directly to my email. If you send me a message there, I'll personally send you a copy of these slides. So coming back, data data sets are available. But what do you exactly wanna do with them or how do you start today? Well, one bit could be, of course, you go to the data repositories, you download the file CS V RJ. So format or any other format, you load them up into your favorite database and you start querying or analyzing them. But can it be done faster? Can you just have your hands dirty in the data right there? And then without doing any downloads or installations, one thing a resource I want to call out here is the Google Bigquery Open Data Open Data set initiative. So uh Microsoft Amazon and Google have all stepped up alongside other companies to provide more literature around the topic to help the researchers and the tech community, Azure has made available some scientific literature. About 40,000 or so scientific articles free of charge.
Amazon web services have uh some similar scientific literature and a few other data repositories. Also, I believe they have the New York Times repository also that I mentioned earlier, which is available on S3. It's freely available on S3. There is no charge to access it, but you do need an AWS account. And if you use any additional services to analyze that data, like the uh Amazon Athena service, then that will cost you money but accessing the data will not Google Bigquery at least today tops both of these in its initiatives. It has multiple data sets available. It has the John Hopkins University data set available that I mentioned earlier, some of their global health data from World Bank and openstreetmap. The data is the data storage is free. Also the data querying provided by bigquery is free. So not only do they host the data for free, they are providing you with the capabilities to analyze that. You don't have to download anything. You don't have to make any account. You just have to set up a big queries, uh sandbag uh excuse me, um sandbox that you can search online. It's like hardly takes 10 minutes. And using the web platform, you can start analyzing the data today using the all the power of Bigquery.
This analysis capability is is available free of charge till September 15th. It may get extended but at least for now for the next three months, it's available to you loaded curated data sets uh available for you to allow for analysis. Having said all that we're almost at the top of the hour. Again, here's my contact information for any questions, my website and contact page Twitter, linkedin. One final thought, what do we really need in this time? As far as data science is concerned? I will leave you with two thoughts. One is that assembling is required. We do have a variety of data but we don't have enough data because it has only been a single season of COVID-19 and it's been less than a year. So we need to combine various parametric models to come up with good insights. And two, we really need open science and improved access to relevant information where multiple people from multiple part of the world and disciplines can come together and help with the drug discovery, the diagnostics, the screening, the patient care. It's a great opportunity for new data, scientists, the truth and the data are out there. Please go get them. Thank you so much for your time and I hope to connect with many of you in the virtual world. Thank you.