Observability In Data Ingestion by Nancy Chauhan
Understanding Observability in Data Ingestion with Nancy Chohan
Welcome to a virtual session exploring the concept of observability in data ingestion, guided by none other than developer and tech writer, Nancy Chohan. Delving into the intricacies of how data ingestion works and why it's crucial in businesses, this session promises valuable insights. It will help the audience comprehend the role of observability in enhancing the data ingestion process, data pipeline operations, and overall business analytics operations.
Components of the Session
- Purpose: Learn why data is an essential part of all organizations, and how data ingestion, transformation, and storage are crucial for business analytics operations.
- Defining Observability: Understand observability’s role in monitoring, finding, and rectifying errors, resulting in efficient storage and organization of data.
- Speaker: Get to know Nancy Chohan, a developer and tech writer who contributes her knowledge and skills to the open-source remote development software tool, Gipo.
Understanding Data Ingestion
Data ingestion, at its core, is connecting various data structures in a required format and quality. Nancy explains this process with a well-illustrated representation, stressing on the concept of a 'single customer view'. It aggregates user profiles into an overview, representing holistic customer data which can be leveraged by businesses to interact better with customers and tailor marketing strategies.
Diving into Observability
Nancy explains how observability helps answer critical system questions, thus simplifying problem-solving. Focusing on its importance in managing distributed, complex systems with numerous microservices, Nancy emphasizes how observability aids in questioning and understanding system behavior. The practice of utilizing observability in software development increases transparency and provides numerous benefits like enhanced visibility, effective alerting, streamlined workflows, reduced time in meetings, and accelerated developer velocity.
Role of Observability in Data Ingestion
Nancy emphasizes that a proper process to track failures and successes is important for business analytics operations. With data streaming at high volumes through data pipelines, maintaining a high error-free rate is challenging. Irregularities can lead to incorrect data processing, wrong business decisions and significant losses. Hence, having an observability process helps track failures, maintain transparency, and reduce potential losses.
Process to Achieve Observability
By inspecting critical parameters like data freshness, volume, lineage, distribution, and schema, businesses can develop a custom solution for observability. Various metrics were proposed by Nancy for assessment, including the volume of data processed per data pipeline run, total number of records processed, error rate, and error messages compilation. These metrics can be displayed in a unified dashboard for easy monitoring and efficient operation.
Conclusion
Nancy positively wraps up the session, offering her blog posts and demos on GitHub for those who want to explore observability in data ingestion further. She encourages proactive learning and communication, inviting attendees to interact with her on social media with their thoughts and queries.
Video Transcription
Hey everyone. This is Nancy Chohan. Welcome to my session. It's about observable in data ingestion. I'm so excited to have you here. I'll just start my screen sharing. Awesome. So uh today we are going to talk about observable in data ingestion.Now, before we start, why should you attend this talk? Well, uh data has become an essential part of all the of almost all the organization. And almost everyone deals with ingestion transformation storage of huge volumes of data which is very critical for all the business analytics operation.
So this talk is about data ingestion and when we have data ingestion, uh we want some control system uh by which we can determine the accuracy, efficiency and consistency of our system so that we can catch, we can identify the failures and that's why we need observable. So um observable, it, it will help you to determine the completeness accuracy efficiency and consistency for any data which is uploaded through either batch or streaming sources. And it also reduces the time invested by the tech team to investigate the errors because you will always have the errors in front of you a little bit about me I am Nancy. I am a developer and tech writer. I'm currently contributing at Gipo. Gipo is um it's, it's a remote, it's an open source de uh open source remote development software tool and uh and remote is the future and that's how we are here uh doing the conversation doing the conference remotely. Uh Previously I worked at Growers and uh ZO up and I was developing solutions for software reliability. Last year, I did the talk uh on monitoring. And this year I'm doing the talk on Observable. I love open source. I keep contributing and I really love uh talking about tech through blogs. So you can check out my blogs at Nancy johan.in. OK. So let's get started. What is data ingestion?
Well, uh data ingestion is the process of connecting wide variety of data structures into where it needs to be given into required format and quality. And this may be a storage medium or application for further processing. So basically, we are pulling a lot of data from different different sources. Uh And then we are going to map it and then we are going to store it into uh the target database. So basically we can say the focus of data transition is to get data into the system that require data in a particular format for other operational users, which is later basically used for mainly for data business analytics operation. Let's just understand this with a very nice illustration which has been done by me. Um So as you can see there's a user uh and we can collect various information from various sources which can be website, your mobile app applications. You keep surfing mobile phone. Uh It can be cr MS and data is collected through various methods like uh like we have API S. Uh we can, we can also have even trackers. Uh We have, for example, we have javascript tags and sdks or maybe server to server integrations or it can be manual imports as well.
Now, when this happens, when we collect all this data from different various sources, we are going to remap it because we want that data in a particular format. And then in the end, we are going to store this data into the target database. Now this can be further useful to create a single customer view, which is basically uh what CD PS do. They aggregate your user profiles into a singular, single customer view, which means uh you can view all data of, of, of one individual customer at one place. So this is often this is often basically a holistic representation of your customer data. And then businesses can leverage this to learn how to better interact with customers or tailor marketing messages, all the recommendation system that you see um on Instagram or other social media app.
They are basically a part of this. Basically, it just evolves from this then moving next, what is observable. My favorite topic because I've been working on this since two years and I feel it's super, super important to understand this and adopt this in all our Softwares. So observable simply answers that YX is broken. And when we get the answers to the questions, things become simple. That's so easy. If something is broken and we know that why is it broken or, uh, something, you know, your website is not working, the latency is very bad or maybe your, it's, your website is kind of slow and you know that why is it happening? You actually know when uh like uh you actually know how to fix it and you will immediately fix it. That's what observ really do. It's very simple now. Um Like if you have simple systems, they have fewer moving moving parts which make them easier to manage. Uh If you want to monitor CPU memory database and the other networking conditions, it is usually enough to understand these systems and apply the appropriate fix to a problem. But I will say that observable is better suited uh for the un ability of distributed systems. When we have complex systems, we have a lot of microservices because it allows you to ask the question about your system's behavior as the issue arise. And you can simply ask that YX is broken or why it is causing latency right now.
So these are the few questions which observable answers. So observable is something it's, it's, we can say it's a practice which we have to adopt, we will have to modify our Softwares. We will have to adopt certain practices to get observable of your system. Uh If we talk about data ingestion, like one of the benefit of observable is that you are basically handling the data of customer and or the clients. So clients need to know as well that what is happening when the data is moving forward through the data pipelines or why is it even failing? So there's kind of a transparency which we can ensure uh through adopting observable and data INGE which will talk in detail more about it in the late slides. But yes, uh observable provides transparency. It it tells you about all the failures. Uh and so that you can quickly, you know, mend it. Um So there are different benefits of observable which I've talk, which I've talked about in this um slide. It's the better visibility. It's easy that you are able to uh you know, it's, it's quite visible that where your data is when it is moving through the data pipeline. What's happening? Is it failing or is it passing is what's the success rate? What's the failure rate?
There's a better visibility, better alerting. If something is breaking, you get alerted or if something is loaded is getting loaded. For example, there's a lot of traffic uh or a lot of data is suddenly uploaded into your data pipeline and your data pipeline is not uh capable, your infrastructure is not capable enough to hold that much data, then you will quickly get alerted because now you know that memory up, so better alerting so that you can quickly fix it.
Uh better workflow you observable allows developers to see request end to end journey. They can see the end to end journey along with relevant contextualized data about the particular issue which in turn streamlines the whole investigation and debugging process, less time in meetings.
If we have a process, then definitely the time gets reduced. It's quite easy. It it it applies in a normal life as well. It applies everywhere. So observable, it provides you a process, then there's an accelerated developer velocity. Uh If you don't have to waste time in investigating so much and if you have uh just a single area where you can see what's happening uh with your whole software journey, uh then it becomes easy, right? Uh Your developer, you can focus on building features, you can focus on uh if something is broken, you can quickly fix it. So it's basically accelerating your developer velocity then uh observable according to Twitter's blog. Uh and I really love their blogs. Uh According to that, there are four pillars of observable, which is metrics, alerting traces and logs and it provides deep visibility uh into distributed system. And it allows teams to get into, get the root cause of the you know, multitude of issues and eventually it improves your system performance. This is something um which is a statement that observable of data by then it de it helps you determine efficiency and consist consistency of any data which is uploaded. Now, this is quite important.
Why is absorbability important in data ingestion we can figure out till now that why is it important? But let's actually dis discuss this in detail that why is it important? There's a large volume of data which you have, which will run through uh your data pipeline and and it's obvious that it cannot be 100% error free. If we have large volume of data, it's just not possible. So if you uh so we need to ensure if you have incorrect data that should not affect your correct data. And if there is any failure, your data pipeline is able to handle it gracefully, your errors should not affect the whole process. So we should log so that we can take quick actions. All this comes all this is a part of observable and this shows that why it is so important to have this process, then we have this is maybe we can understand this in a better uh better exam like uh if, if we, if we are talking about data, we are talking about business. And if our data ingestion is delayed due to some XYZ problem, maybe your system is not working, your infrastructure is broken.
Or it could be anything, then it's going to cause the problem in business as well because your business analytic corporation will be waiting for your result for the data ingestion output. And hence, we can see that it's, it's quite important to focus on this. Yeah. Uh I've already talked about how it can affect your business, eventually affecting your revenue. Also very important. If your data is incorrectly processed, you can take wrong decision. Let's understand this in this way. For example, uh let's say you're working for a payment, you're doing the data ge for payment system. Let's say your company's payment system and you have some data of different cities uh or different states. Uh Let's say your uh data is not correctly um processed for one city and there is some decision making which involve, which involves the data for all the cities and one of them is corrupted, then definitely going to make a wrong analytics if your data is incorrect for the whole city and this can eventually lead to a big loss.
So we need to ensure that there is a proper process to track all these failures. And this, this shows that how much important it is for the business revenue and uh you know, important uh right decision making to have a proper process for observable um Forbes as well. These are some newspaper clippings. I just wanted to put this, uh I just wanted to show this that even uh in Forbes, it has been featured that data observable is quite important. And PBS defined this as a set of tools that allows data and analytics team to track the health of enterprise data system, identifying troubleshooting and fixing the problems when things go wrong. In other words, data observable refers to an organization's ability to maintain uh the content pulse of their data systems by tracking monitoring and troubleshooting incidents to minimize and eventually prevent the data issues downtime and basically overall improving your data quality. Yes. Um Cool.
So how can we achieve it? Now, let's talk about how actually we can achieve it. Now, before achieving it, we might need to ask few questions um Before building any solution for it, we need to ask you questions about freshness. Is your data arriving on time? Is it up to date? Uh We need to ask question about volume. Did all data arrive or did it get corrupted or all the data tables are complete the distribution waivers uh the data delivered to then lineage. What are the downstream investors and upstream sources of the given data? And schema is this data in the correct format? Because if it's in the wrong format, it might cause failure in your whole data pipeline. Then we have um yeah, then we have uh in the end, we have to build a custom solution. And what I see as a solution is and what I've worked in my previous organizations as well is uh having one dashboard where you can see all those errors and uh where you can see all those failures, where you can see all the error messages in your uh what whatever is happening.
Uh When your data is moving to like when your data is doing a journey uh in data pipeline. If you have one dashboard where you can track all this, then it's good. It saves you time and it also creates transparency in your data pipeline. And you can also enable the alerting and can quickly take the action. Well, uh some of the metrics which I feel are quite important and I've worked on is um the volume of data processed per single run of pipeline. Uh total number of records processed, you have a lot of data. And if you can somehow uh you can calculate the total number of records at the source of growth and then you can calculate the total number of records. When the whole data ingestion process happens, then you can create the efficiency, then you can calculate the efficiency of the whole process. And if there are so many records which has been dropped uh within the, you know, within the whole data data pipeline, within the whole process of uh remapping and uh ingesting into the target database. If something is getting lost, then something is wrong. And then we need to take action. Now when you're so, so this will calculate your efficiency. And you can track like when you're uh you know, making the doing the remedies are all these, uh you can track through the metrics that if things are happening or not.
So these are some of the metrics which you can go through, we can have error rate, error messages. So this is what we can do. We can collect all the error messages and we can show it in our one dashboard Cool. So uh we are almost and we have just five minutes left and I just want to explain about one metrics which I uh and I will, I would love, I would have loved to have done a demo, but we are short of time. So I will, you know, paste all the links in the comments section. I I will paste my github link of the demo. And uh I I, the best part is I've done, I've made the blog post as well. So if you have any questions, you can quickly, you know, you can just ask in that blog post in the comments section of that blog post. So yes. So one of the metrics which I worked on was uh there was a use case in which there was a streaming source and I wanted to calculate total number of records for streaming sources. Uh For this, I use uh this tag driver. Uh My uh whole data was there in Google cloud. And uh Stag driver is uh basically um Stag driver is provided by Google uh by Google Cloud.
And where uh I can, I can just, you know, understand from this uh from this diagram, we can see that uh a user wants to track all the number of, you know, all the uh pop some messages which are there for streaming sources. So it will quickly open the stag driver dashboard which is there in Google cloud. And then stackdriver will call pops some messages that hey, please send me the metrics and then you can see all your streaming sources like all the all the number of pops, some messages in real time, like all the data which is getting ingested into the data pipeline in real time, you can eventually uh achieve this problematically using the Stag driver API and I, I've talked about this in detail in my blog.
I've also done a demo which is here and yes, you can open there. So yes, this is it. Uh This is it. I would also love to share about the um I'll just quickly share the example with you. Yes, I hope it is there. So yes, this is the example. You can check out this on github stag driver example. And you can open this in GIT for uh basically Gipo is an open source developer platform for remote development. Uh So you will get everything prepared. You will, you won't have to uh you know, just install anything. Uh You'll just get out of the box so you can quickly uh open this in G and get your work done. I have put all the instructions here. It will be cool, you know, to see all your uh real time messages, you can uh track the metrics of your real time messages and it will give you the feel of how you can achieve uh different metrics for observable. So yes. Uh this is it about the talk. Uh Thank you so much. Everyone. I really enjoyed this talk. Uh If you have any questions, please personally ping me on Twitter or linkedin.