Data Mesh and the Modern Data Stack by Colleen Tartow, Ph.D.
Understanding the Modern Data Stack and Data Mesh
Hello, everyone! I, Colleen Tarto from Starburst, am here today to explain to you two significant aspects of modern data strategy. These aspects are the modern data stack and data mesh, both of which are being utilized by businesses worldwide to make data-driven decisions for improved competitiveness and business growth.
The Importance of Data Management
Data management ought to be a top concern for every modern enterprise. The astonishing exponential growth of data, as reflected in many research graphs, constantly shows us that data has become crucial for businesses to excel in their field. But how does a company transform this sea of data into actionable insights?
Data management is the answer that can help an organization competently control its data assets. It involves efficient organization and management of data and its related resources, such as the people who handle the data, and the technology used for data movement and analysis.
Type of the Companies That Need Data Management
Data management is substantially important for a large corporation or enterprise scale company, given its huge data volume and growing complexity in data handling. As the company and its business grow, the complexity and volume of data also increase, significantly amplifying the task of efficient data management.
Modern Data Stack and Data Mesh: The Key Concepts
The modern data stack and data mesh are two terms that you are likely to stumble upon frequently if you are a 'data-enthusiast'. Let's dive into each of these concepts to have a better understanding of their principles, functionality, and utility.
What The Modern Data Stack Really Is?
The modern data stack refers to the technology or tools used by a company to extract value from data. This involves the complete process, from curating raw data to employing analytics on it for decision-making. It typically comprises a data pipeline, data storage domain like a data warehouse or lake, and an analytics tool.
Isn't "Modern Data Stack" Really Modern?
If you reflect on the historical development of data stack architecture, you may, ironically, notice this "modern data stack" is a cloud and SaaS-based upgrade of the 40-year-old legacy data stack. However, the legacy stack was created due to the hardware limitations of transactional systems at the time. The modernization has brought in ELT (Extract Load Transform) replacing ETL (Extract Transform Load), separation of computing and storage, cloud computing, and usage-based pricing models.
Data Mesh: A Strategy for Large-Scaled Business Operations
Now let's get to the concept of data mesh. Emerging from a modern data management strategy, data mesh is indeed a miracle scheme that resolves many hitches of data handling faced on a large scale. It presents a way to package and approach the ideas around data management neatly, making data handling a more streamlined and organized process.
Key Pillars of Data Mesh
- Domain-oriented decentralized data ownership: This means the-who-produces-the-data-knows-the-best principle applies. The creators of the domain must own the data.
- Data as a product: The heart of data mesh lies in treating data as a high-quality, consumable product.
- Self-service data infrastructure: The central IT organization will provide the tools that the domains need for data processing.
- Federated computational governance: Some aspects are global while others are owned by the domains.
Connecting Modern Data Stack and Data Mesh
The modern data stack and data mesh, although seem to be distinct concepts, are not entirely disparate. They are stops on a journey, from a startup to a large-scale business.
Modern data stack is advantageous for companies starting their journey with less complex data where the centralized data scheme would work. On the contrary, larger and more complex businesses may benefit from data mesh's decentralized data strategy.
Ultimately, the idea is to sustainably and proactively manage data, leading to strategic business advantages and growth.
If you’re interested in discussing these topics further, please reach out to me at [email protected]. Don’t forget, Starburst is hiring, and we would love to hear from you!
Video Transcription
Hi, folks. Thanks for joining me today. I'm Colleen Tarto from Starburst. And today I'm gonna talk about the modern data stack and data mesh, what each of these are and how they relate to each other as important parts of a modern data strategy.So I'm going to start with a confession of sorts. I think a lot about data management. It is such an interesting problem and it's a really integral part of how a modern company uses data to their advantage. Uh We all know that data is growing exponentially as this graph shows. I'm sure that's not news to anyone here. And the name of the game in modern business is to use data to make decisions rather than relying on anecdotal information or just opinions, right? So doing things like organizing your data really well and organizing the people and technologies and architectures that manage and transform and analyze your data, that becomes just an absolutely huge first class problem, especially as the amount of data in your organization grows.
So what works at a small company will very quickly become something that enterprise scale companies really need to focus on evolving as they grow as the business grows, the amount of data grows and complexity increases the number of people involved in producing curating and analyzing data grows.
And so this is where you really get into what used to be referred to as big data a few years ago. And the point is that as data grows larger and more complex every day, the opportunity and the the value inherent in that data become more complex as well. But on the flip side, the good news is that data can be a really competitive advantage if you manage it well. And if you're proactive about managing the people processes and technologies around your data. So let's get back to the two main ideas that I really want to think about today, that I I think about a lot, the modern data stack and data mesh. If you're a data nerd like me, you probably hear these two terms all the time. Um I did a quick Google search of each of them and there are hundreds of millions of hits on Google. So many articles about each. And what's fun is that in terms of the modern data stack, literally, every data analytics technology vendor has a blog about why they're the key to the modern data stack, including the one I work for I may have written that article.
Um It's incredibly vendor driven though and in just a minute, I'll talk about what that means and why the modern and what the modern data stack really is. But the point here is that modern data stack very popular. It's been around for a few years now. And then we've got the new hotness of data management, which is all over the technology zeitgeist right now. And that's data mesh now. Data mesh is a data management strategy. The term was coined three years ago in 2019. So again, you see a ton of articles about data mash and how this is the miracle data management strategy that solves all the world problems. And every vendor is again the key to that solution. So what I'm gonna do today is walk you through each of these and then talk about how they connect and relate to each other. So first let's get back to that modern data stack and really think about what defines it. So the term modern data stack has been used to describe the technologies that are invoked to really help a company get value from data. The real idea is to take data from a source curate it and do analytics on it. Now, the analytics could be anything from reporting to machine learning to A I. So the end state of the data needs to be quite flexible.
And over the past few years, this term modern data stack has entered the vernacular of the data world and it describes this standardized cloud based data data and analytics environment built around some classic technologies. And in its very simplest form. It looks like this.
You've got a data pipeline which is either ELT or etl extract, transform, load or extract load transform. And that's moving data from its source into an analytics focused environment. Then you've got a target data warehouse or a data lake and then you've got an analytics tool for creating business value out of the data. And that's it. And this technology stack at its simplest form is based on the fundamental idea that you have to move data away from the source to a centralized location in order to gain value from it. So one thing to note, however is that I always chuckle when we talk about the modern data stack because it's really just a re envisioned cloud and SAS version of the legacy data stack from 40 years ago, but with better analytics tools. So originally this entire paradigm was created because of hardware limitations within transactional systems. So instead of being hardware driven, now, we have all these really amazing technological advances in the past 20 plus years. And what started out as a stack with a database, an ETL tool analytics focus storage and a reporting system became a modern version of the exact same functional process. So don't get me wrong. There have been a lot of great updates we can swap ELT for ETL.
We have exciting developments like separation of storage and compute architectures. We've got managed and hosted Softwares of service. Uh usage based pricing models. And of course, there's this whole cloud computing thing that we all know and love at this point. But even with all of these advances, I would still argue that the modern data stack really isn't that modern from an architectural perspective. So for folks like me who have been working in data longer than I will admit here, that doesn't seem so revolutionary. We're still taking data and we're moving it from one place to another, copying it, curating it and doing some analytics just with fewer hardware concerns and better tooling, which is really nothing fundamentally new. And the reason I'm pointing this out is because we still have the same pain points that we had four years ago because we haven't really changed the story enough. And first, there's an inherent latency that you get by introducing this paradigm because you are copying data around.
And second, this looks very simple. I've got one ar one arrow between each thing. But in reality, there's a lot of complexity that's introduced when you add in various data sources. And you end up with really complex pipelines and a variety of data sources of formats. And it's a huge headache. And third, you end up with a lot of vendor lock in that comes with this modern data stack. For example, once you've chosen your cloud data warehouse, it's a huge headache if you want to change things around and it's very expensive. So as an aside what is a truly modern data stack look like? Well, the name of the game here is really to focus on speed to value for data, which means that we should shorten this path between the source data and its analytical or strategic value as much as possible. So one way of doing this is to use a cutting edge query engine, modern query engine instead of copying the data. And what this would do is alleviate some of the pain points of the modern data stack. And there would be absolutely no or very little latency introduced in the process. And the data would be closer to real time. And there are a few copies, fewer copies of the data, fewer pipelines. The process is a lot leaner.
And then by focusing on simplicity and scalability, you can rely on more common tools like SQL for data access. And so in the end, you get less vendor lock in. Of course, this is just one idea. But my recommendation is that before falling into falling into the same old so-called modern data stack, folks take a moment to reflect on the ultimate business goal and therefore shorten and simplify the path to the value for data as much as possible. So that all said, let's think about who actually uses the modern data stack in my mind. The simplest version of the stack is the most useful when the company is really small. For instance, in the early stages of the start up when you have very few data sources, not a huge user base, not a lot of consumers. And this is straightforward, you take data from the sources, combine it in a common place. You can curate it, you combine it and you can do your analytics. But what happens as that data grows? You end up with a lot more data sources, more and more complex pipelines and the vendors that made sense when you had smaller data volumes and velocity are gonna become really expensive as your data grows.
So you'll probably end up having to rearchitect things, hire data engineers to put full time eyes on the data pipelines and you'll end up serving your downstream data customers like analysts and data scientists through those data engineers and then things will grow even more.
And this is getting to be one of the ugliest pictures I've ever drawn. Um And your data and analytics team will both get bigger and more complex and you know, your company is going to try to go public, it's going to acquire another company, you're going to designate new business units. All of those things may have their own data stacks with their own centralized data teams and maybe you try to combine them to keep with the centralized strategy in parallel to the modern data stack. But yes, I am absolutely underrepresent the amount of excel in there. Thank you for that comment, Monica. Um But in my experience, it's at this point that the centralized data strategy just breaks down and falls apart. And that's largely because when an organization matures and grows and its data matures and grows exponentially, you get to a point where a single team can't possibly understand the breadth and complex complexity of the organization's data. It's just too much for a single team.
And at this point, various architectures and technologies are going to be brought in to try and address the complexity to once again find that path from the data to the value. But the bottom line is that that centralized data strategy doesn't work at a global scale. And any challenges present at smaller scales are gonna get hugely exasperated at the enterprise scale. So when you move the data away from the people who are experts in it, you're creating this bottleneck. So now is the part where we get to talk about data mesh data mesh in a lot of ways isn't anything new, but it's a new way of packaging and thinking about some really cool and beneficial ideas around data management. Um As an example, people have been talking about domain driven design practices and engineering for years, but until recently, it wasn't really applied to data with a hyper functional and really clearly defined architecture and also without accounting for the people organization side of the business transformation.
So data mesh is based on four pillars, first and foremost, is the idea of domain oriented architecture and ownership, which is a mouthful. This simply means that the people who produce the data are the people who understand it best we call those the domains and those should be the folks who ultimately own the data. The next pillar is treating data like a product. And this pillar is the heart of the data mesh. That data is a first class high quality product that you need to treat like any other product in your business domain. And these data products are served to downstream consumers in a way that makes data really easily consumable.
The third pillar gets to the idea of a self service, data infrastructure. So the domains don't also need to become experts in infrastructure, but that you can have some sort of central it organization which standard in enterprise and those that it organization will provide the tools that the data, the domains need to curate and serve the data products.
And lastly, it's another mouthful Federated computational governance. And what this means is that some aspects of governance are controlled globally by the business and others will be owned by the domains. And there's simply a contract delineating that Federated ownership. And there's very aspects of that various aspects of that ownership that are implemented by automation such as access control or data quality. And so that's the computational part. But the thing that makes data mass unique among other data strategies is its focus on data products as well as its acknowledgment that there's this organizational element that is key to the success of a data strategy that is um people need to have be organized in an optimal way to enable these data products.
And there's that cultural element as well. So this goal of domains producing high quality data products on the self service, data infrastructure and available cross functionally following that Federated governance model. This is just the next evolutionary phase of data management.
And so data mesh has seen a lot of press and buzz over the data world for the last couple of years. There's a lot of Twitter wars over it. Um And I think that's because it codifies and lays out a strategic plan for how to treat data as a first class product. And people have a lot of opinions about data. Um But the reason why data mesh really resonates with me personally is that I have led data and analytics organizations that are central at growing companies. And so I have felt these pain points myself. I have seen the challenges of the bottlenecks that you get in a centralized data team. And I've seen how there's this real struggle to take that centralized strategy and adjust it to account for like the rapid changing growth in a start up, for example. So now that we understand both of these things that data mesh and the modern data stack. The next thing I think about is how do they play off of each other? I talked about this a little bit earlier and that the data, the modern data stack is ideal for a start up or a company with Lexus complex data. And it largely is gonna lend itself to that centralized data strategy.
But on the other hand, large and complex companies can really benefit from that decentralized data strategy of the data mesh. So this is really an evolution. It's a journey I don't think you'll ever get there, right? I think you should always be iterating on this to try to shorten and consistently improve that value between the data and the strategy that comes from it. But in my mind, it's sort of a spectrum. On the one hand, you have smaller organizations with smaller data volumes, centralized data team, fixed and vendor driven data stack. And you've got low complexity of your data driven decision making. And on the other end of the spectrum, you have enterprises with incredibly complex data decentralized data teams, dynamic and varied data stacks and then very intricate layers of data driven decision making. So organizationally and architecturally data mesh marries the ideas of the truly modern data stack with the concept of data as the top tier product at a very large scale. So data producers are treating data consumers as a first class stakeholder in their work. And then that consolidation of the technologies for data consumption gives you this revolutionary simplicity of the data and analytics model at scale. And I think that's why people are really glomming onto the data mesh with its pillars. It's guiding tenants.
Data mesh is really cementing its place as the future of the business data ecosystem. And these aren't opposing ideas, the modern data stack in the data mesh, but there stops along a journey and most enterprises won't benefit from a modern data stack as much as a start up would. And then on the flip side, a data mesh would be overkill for most start ups because it's more appropriate for very large scales. Um One thing I think is interesting and I think this will be the next big challenge that we face is this transition from one to the other. How do we get to a more mature data strategy? And how do you transition from centralized to decentralized at what point does it happen and how does it happen? So one thing data mesh helps us articulate is the people side of that strategy and that's needed to really support the transition. And it includes like articulating the roles and responsibilities of people involved and things like moving data from mo sorry, moving data engineering from a centralized team into the domains and a truly modern data stack would help ease this transition because you would end up with less vendor lock in less early complexity.
And a focus on shortening that data path between the source and the value. So a data mesh will also help this transition because it gives us a road map of what success looks like and what the resulting strategy should be. So as the slide says, what does this all mean? I've covered a lot today and these are incredibly complex. I've said very little about these topics. There's a lot to say um I've covered a lot, but I want to leave you with three takeaways first. When you hear about the modern data stack, think of it as a suggestion, every vendor will say we are the key to the bottom data stack. And there's an incredibly rich landscape of tools capabilities that can be part of that modern data stack. This is the latest data and analytics vendor landscape and it is absolutely overwhelming. I mean, knowing where to start is a challenge. Knowing where to stop is a challenge. So when you're building a company's data ecosystem, make sure you focus on optionality speed to value and avoiding complexity because as you grow, those are only gonna increase. So remember that the end goal is to optimize the path to value for data. The next thing to think about is to be forward thinking in your data organizational strategy and make sure that the data producers from day one, understand that their data that they're producing is a product that they are responsible for and that they own early in your evolution.
I think that it eases that transition from small to large and make data ownership part of the culture. Even if you have data, a centralized data team, this will lead to fewer struggles later on when the inevitable transformation to that decentralized strategy takes place. And third, the amount of change that I've seen in the last however many years in this field has been incredible. Data mesh is really interesting and it's gaining this huge foothold in the enterprise world. And a streamlined modern data stack can be a quick way to spin up an analytics function. But it helps to remember that it's not about the vendors, it's not about the hottest trends. The focus needs to remain on getting strategic business from data. And I think that can be something that gets lost in the mix. So that all said, thank you very much for listening. Um I'm I've been calling Tarto. Uh Please reach out if you're interested in discussing data mesh, the modern data stack. And I should let you know also that my company Starburst is hiring I lead our enterprise engineering organization and not only are we the key to the data mesh and the key to the modern data stack, but we get to do cool things like talk about these things at conferences like this one.
So if you're interested, I'd love to hear from you and you can reach me at Colleen at Starburst dot IO. Thanks very much.