Preethi Baskaran - API Design and Architecture


Video Transcription

Thanks for joining. Um Before we get started, I just wanted to give an introduction. Uh This session is about uh API design. So I will be sharing some of the key takeaways and lessons I have learned through both academically as well as at work.What are the things you need to keep in mind when designing an API? How do you think from your c consumer shoes while designing an API? Yeah. Uh I'm pretty Bhaskaran. Um I've been working as a software engineer with Uber Freight. Uh It is very similar to Uber Ryzen eats. My work here is mostly focused on uh API S which deal with search optimization and improving the booking experience for our users. Uh Before that I was working as a software engineer at NAVIS uh where I was working on software optimization for shipping terminals. So most of so most of my experience has been working on the back end side of things. Uh designing and writing API S. How do we improve the latency? How do we make sure that uh our consumers are happy with the API S we provide? That's, that's been my focus area for the past few years and I thought I have learned a few lessons which I can share with the folks and also learned from you guys. So I looks like everybody is able to hear fine and I can proceed. Good.

I hope you can also see my slides. I have not checked that before. Ok. Sounds good. Yeah. So today we'll focus on the big picture. What is the session is all about? Uh what kind of examples we'll be walking through? Uh And then look back at outages, how outages of API designs are related and then we'll dive into the key concepts and designs. Uh You need to keep in mind while designing or updating an API. And finally, uh you'll have uh I hope, I hope there will be time for some question answers, but please do keep your questions coming in the chat. I'll probably take it as they come. Yeah. So the big picture is uh what is API design before that? What is API API stands for application programming interface? It's an expression of functionality that is your service could do so many different things and how do you express it to the outside world? Oh, this is what my service does. This is what you should give us input and this is what I can give you as output. How do you express that functionality in a clear concise form and so that your consumers are not surprised about what are you suddenly giving me and, or, or I don't know how to use your API so keeping that uh communication clear is what is API design is all about.

So throughout our uh session, I will be talking through some examples. So to start with, I wanted to keep it very generic. Uh Let's take a ticket booking app. So assume that I'm I'm someone who owns a theater and I have a tech support team which designs API like book ticket, cancel, ticket, update, ticket, et cetera. And there are third party platforms like book my show fan or whatever it's in your country. So different platforms to which I can book my book tickets. So these platforms will be leveraging my API S to book the ticket or update these tickets. So in other words, I am the producer or the service owner for these API S and they have a bunch of consumers for these API S. So this is very important to keep in mind that the focus that I'm a producer and there are a bunch of consumers. So with that uh uh high level example, like how do I represent an API? So I can, like I mentioned before API is nothing but a software intermediary, which that's two applications to talk to each other. Like every time you use your phone, you're using uh Facebook to send an instant message or you're sending uh you're checking the weather out. What you're doing is your weather, weather uh application on your phone makes an API request to different services to get the the information.

So that's the interface it talks to. So this API itself can be represented in different formats like you have uh Apa Thrift Ideal Json. There are like tons of different ways to just present in this interface, just like you have so many ways to write your code. So uh we'll not go into the internals of these languages and not, and none of these concepts, I'm going to talk about the language specific. I want to just keep it at a very, very high level. Um So for and for, but at the same time, I want to give a, give a high level overview of Apache thrift because examples I'll be talking through uh have some reference to Apache thrift. So I just want to make sure that uh you don't get surprised at what these languages. So Apache Thrift is a lightweight language independent software stack. For example, here you have a multiplication service. Uh So this is my service and API I provide is multiply and I say that uh I take two inputs and the output is also an integer. So that's the kind of contract I established by consumers. Uh This Apache drift can also take in structures which are like request trucks, response trucks, which says these are the request parameters I can take in and these are the response parameter I can give out And similarly, there will also be exceptions which your API it needs to be provide, needs to provide when the things go wrong.

We talk about that more uh later. But yeah, I want to say the cave that if thrift is new to you, it's OK because this uh presentation is more about the concepts to keep in mind regardless of which interface definition language you choose. Yeah. So looking back exploring outages through the lens of API design, I want to call out that for your consumers. API can cause a lot of confusion and misunderstanding that can lead to outages. API is it's not just your concern as a producer, but also uh you need to think about what overall problems it can cause to your consumers. Uh For example, when A B I keeps changing, it's inconsistent or unintuitive for your consumers, there could be so many ways by which outages can be introduced. If you look at your uh servers and think about how many outages have happened before. Uh because of uh API difference, you, I'm pretty sure you'll be able to find new outages because the API was used to perform, necessarily not performing well. So you can just measure that and figure out how many uh how much of importance is API design. But other on the other side is what is not measurable is how the effort you put into understanding these API.

So good API design will help your fellow engineers to, you know, debug the problem and also keep the API consistent so that your consumers as well as the producers have a safe life. So other aspect is that as a consumer, you might realize that there is a lot of migrations happening for your, from your producers. And that what the API used yesterday may not be the right API today. Like for example, uh there's a lot of complexity involved. For example, if there is a new feature to be added, uh there's a high probability that the producer would start adding more and more uh data to the existing API. And as a consumer, you might be confused about, oh why, why is it now behaving in a different way? And second thing is reliability. Suddenly your, your service might start failing and you might wonder, I did not deploy anything. Why is it suddenly failing? It has been the same configuration. Why has it been failing? It's, it's, it could be mostly because the API change something in the end and it's causing an outage on your end and latency. Uh Suddenly your API is, it used to be blazingly fast and suddenly become slower. And again, that's because uh you, the producer of the API is adding more and more data to it or they have done something internally which causes a lot of latency issues as a consumer.

So these migrations are always a pain, but we have to deal with it but at the same time, we have to think about different ways to reduce the complexity, making sure it is latency, efficient and reliable for your consumers. So yeah, let's dive into, we talked about different problems. Let's dive into uh what are the tools and concepts to promote good API design? Yeah. So uh all the negative examples aside, uh three main area, focus area for a good API design is clear contracts, uh uh clear uh establishing the clear way to expose your data and having some kind of service level agreements about what you should expect in terms of latency and throughput from the API S.

Yeah, to start with uh clear contracts. Um it's very, very important as a consumer to have a very desi uh clear signature. For example, earlier uh example, we talked about uh update ticket, ticket endpoint say for example, your update ticket endpoint is used to update movie, day time available seats, movie name, metadata, everything you know. So you have a version one end point which allows the entire ticket to be updated. Um So for example, she takes um um a ticket id and then it takes an entire ticket info. So for me as a consumer, it's not pretty clear to me what all is this, this endpoint is allowed to update. So does do does it update movie date and time as well or does it update only the sees? It's not super clear to the user Uh On the other hand, is another version to end point uh which is more granular. That is I say that this is the ticket id, movie name, movie, time, movie seat number. So I know for a fact that these are the different parameters that this particular endpoint is allowed to update. So carry that kind of clear granular definition helps your consumers to know what to expect from your API S.

Uh The other aspect of uh clear contracts is error handling. That is error is one of the important feedback to call us. Uh This is sometimes overlooked. Uh We don't uh return correct error messages in our API S. For example, if you have an application error, like authorization error or ticket already booked error or invalid error, it tells the consumer that there is something wrong with this ticket that is already booked or this user is not authorized to update the ticket or book the ticket.

So they wouldn't keep retrying on their end, you know. So that gives us useful feedback to say like what I need to do. Next. On the other hand, there could be like internal errors like writing to database error or some kind of network error. So that shows to the caller that oh, something is wrong. Let me re write it again. Probably it will succeed. So that kind of feedback helps them to take the right action as a consumer. So it's very important to return the right error code for the right scenario. As a producer of the API S, the next aspect is optional fields. Uh This is also very overlooked. Uh Aspect of clear contracts is making sure what is optional. What is required. For example, in update ticket endpoint, of course, you need a ticket id without which you cannot update a ticket. So that is considered as a required parameter. But whereas a movie name time or seat number is optional because a user can update any of these parameters. So everything is considered as optional. So uh as a user, I know that these are all uh uh required strub I need to pass so that the request succeeds and these are all OK to omit. So it's, it, it gives you a lot of flexibility for the consumers uh when there are optional fees. Uh I think the last aspect of uh uh uh um clear contracts is API versioning like API S keep changing all the time.

Uh You, you suddenly have some new requirements coming and you will enhance it. So a lot of times you need to version the API just like you have software versions, you need to also have different versions for your API uh The pros of having these versions is that you can build out new interfaces without impacting the existing clients. And you can also avoid this interface as you wish. Uh Because there are no clients yet to this new version. So this gives you a lot of flexibility as well as not affecting your existing customers. But of course, there are um cons as well. That is you need to, at some point, you need to migrate existing clients to use a new interface because it's it's not realistic to keep maintaining two different versions all the time. So there will be some kind of overhead until all your clients mule to the new version. But uh A P A versioning helps for you to uh you know, uh make sure that you're not overloading the existing endpoint with too much functionality. And your users and users are also not suddenly surprised about why things have changed. So when do you do this API versioning when you have like you have to add or remove one or more fields.

Uh For example, uh for say, uh you by looking at ticket, uh the last name of the user was optional for some reason and now you suddenly want it to be a required field. So it's not realistic for your consumers to suddenly always pause this for uh because it was an optional before. So that would break the API from the other end. So you cannot do that in an existing API making an optional to a required field. Or there could be some field type changes. Say for example, your movie seat number used to be a numeric form and now you change the formatting to alpha numeric and it suddenly that you cannot expect your users to change uh the send the corresponding data type. So any kind of this non backward compatible change you if you're introducing and if you want your, your plans to absolutely pass those, then you can have to consider about api versioning. So how do you do it? Uh Unfortunately, ideas that is any inter interface definition language will not support versioning directly. So you need to have different implementation interface files like my service B two dot Thrift or my service B two dot JSON where you have new interface and you have a new API S.

So it's as I mentioned before at utmost maintain two different versions so that uh you don't have a lot of overhead once you have the version two ready, make sure that you help your consumers multiple the version two before you move on to the version three and so forth. Yeah. Before I move on to exposing data, I just want to pause here and see if anybody has any questions about establishing clear contracts. Yeah. If not, I think I can move on to the next topic of exposing uh data, your A bis. Yeah. So how does uh domain expose data? So like we talked about earlier, we have endpoints like the this could be the tr PC end point, they could be rest endpoint or end point. So these are all different ways to establish your endpoint. You can, you can have uh end point a update ticket which you update the ticket and gives you the uh the new data back, the updated data back. Or you can use data streams like Rabbit and Q Kafka, which are all cues which are not real time. We'll talk more about a realistic example later. Uh So you can use data streams to uh to expose your data or have your data warehouses like high relation or non relation. There are so many options you can have directly access the database to get the data. So these are different ways to expose. Um So one thing is R PC, another one is a data stream. So R PC stands for remote procedure call. That is you use this when you want a real time response.

For example, if you say uh if the user says book a ticket, you want the response back, the consumer wants a response back in real time. That is the ticket is booked. It's not realistic for a user to keep waiting for the response. So any instant response you get from an endpoint is via an R PC call. Uh So on the other hand, the data streams is for asynchronous operations. For example, you again, I'll take the same ticket. Example. When you're booking a ticket, you get a response back that the ticket is booked, but you also need to send an email. Uh It's not again uh practical to make your user keep waiting on the same mobile screen or a web screen waiting for the email. So you process that asynchronously uh by a data stream. That is true. So that's more like a data stream or uh offline approach. So you can expose data both with an R PC as well as a data stream approach. So this this could be like too many concepts and it's it might go over ahead. So I would like to just talk with the help of an example. So we take a simple flow like when I use a book a ticket through the app and we want to send a confirmation email to them that this ticket is booked for this movie.

And let's assume that you're only talking about the flow where the user taps the book button and the endpoint returns success the ticket is booked. That is we are not considering edge cases like uh like concurrency or the ticket is already booked. Let's not go into that aspect, just look at the success scenario and just realize how many issues can come up in this scenario. And the goal here is that the email should be sent within minutes of booking. There is there is an email service. Uh Yeah. And if that service uh goes uh out, out of service or there's a downstream outage, uh it's expected that uh the pending emails will be sent once the service recovers and we should never have a booking without an email being sent eventually. So this is like a goal like a simple user booking scenario. And let's see different options. We can design this and what are the different kinds of issues that come up as you need to consider. Let's take the first R PC approach that is in the book, uh your consumer calls your uh books, uh uh book ticket, uh API and then you update the status that the ticket is booked in the database. And you also call uh uh send confirmation email by calling the email service. So this both are both are uh real time remote procedure calls. So what is the issue here? What would happen if your service uh fails bef before you send the email?

That is you update the status in the database and your service fails. So that would cause a problem and you won't be sending the email. So one solution here is to add retry. So like you need to keep checking whether the email service succeeded or not. But what happens when you're retrying it again, crashes. So there are multiple ways this can be addressed, but you need to keep in mind that service can crash at any time. And remote procedure calls need to have some mechanism to keep track of that. So that's our issue with the remote procedure call, uh, where you cannot guarantee that the email gets sent. Let's take another approach. Is this a reverse way. That is say that uh that your consumer calls, uh book ticket to your service, you call the email service first and then you update the status in the database. Uh In this scenario, the issue is I think you would have expected this, that you, you basically send the email. But if your service crashes after sending the email and your status is not updated, so that has a lot of potential liability. It might cause bad customer experience because you send the email but the, but the ticket is not actually booked. So you basically need both these operations to succeed atomically or just fail both the actions.

So having an R two C call, either way that is having doing one first and then the other second will not help in either of these cases. Let's take a slightly different approach with easing that is uh say book ticket as call to your service and use updated status in the database. And then um you make a remote procedure call like you trigger a background flow uh to uh to your email. That is you, you put a publish a Kafka message to the rabbit MQO or to the Kafka queue and the email service reads from this queue. So what is the problem here? I think it's very similar to the first approach of R PC call. That is what happens if the service fails after you update the status database and you don't publish the message of the CF queue. So again, there is no way to for your email to be sent. So there is just because we made this asynchronous doesn't automatically uh get rid of the fact that there, there can be service crashes, you need to retry and there are certain issues to consider there as well. Now, let's think of async email approach in a different way. That is your first triggering. Uh You send the caf a message to the uh to the queue and which will trigger the email and then you update status. Basically, you're reversing the async approach you saw in the earlier approach.

Um The issue is kind of pretty similar here as well. What happens if you send the email out and then the and the service crashes? Uh But there are some workers don't see that before sending the email, you can do a double check. Have I updated the status in the database? If not, don't send the email, you can do some kind of logic over there. Um But there are that is something you need to be wary of that just because you need a sync will not solve the issue. Uh The third approach uh with async is it's still not so great that is you update the ticket in the database. Uh and then after the database right succeeds, you emit events, DB events uh to the Kafka Q. So this will be your one domain or one microservice and it's another microservice with another domain which reads off this KAF per Q and sends the email. Um But what are the downsides here? Is that uh what would happen if uh you have a sudden database back for like you were doing some mass update of all the uh ticket names or ticket timings. So everything would provide a DB update event and you're just overloading your email service to process all these events. And are, are you potentially exposing your internal DB events to your outside consumers? Which is not great.

Uh So this is still not a super efficient approach, but it's still better than your previous approach because you're, you're dividing the responsibility into two different uh domains, this can solve the problem to an extent. Um And the last thing is this might be in my opinion, better than any of the other approaches that is uh it's very similar to the before that uh you your book ticket service, update status in the database, you update, you send out DB events and then there there's also and create external events module which will only uh send out events that are uh that are relevant to your consumers.

You're not exposing all the data, all the DB events you outside consumers, you're only creating specific events which are applicable for this use case and that you publish into the Kafka queue, which the domain B or another microservice reads and sends the email. So in this case, the advantage is that say you are doing some mass backfill to your database, you create Excel events controller or your handler would be able to filter out those events and just send what is necessary you have some control over what you want your outside world to see.

So this helps uh in an ex to an extent to you know, keep things separate and have separation of concern at the same time exposing the data safely. Um So yeah, so the question now comes is is this a solution, right, solution for all our use cases? But the main takeaway I want to point out here is that no, it is just to highlight the different issues we need to think about like what happens and how do you want to handle a service, crash or restart? Uh What would happen if a retry cannot succeed or how, how do I handle database backfill or what are the potential issues with database backfill that can cost you your service and what would happen if there is an outage? So these are the different scenarios to think about and have some backup plans. Uh And the other thing is it might sound like the are all the R PC calls are bad, but that's not the point here that is sure I'm just trying to convey that there are some shortcomings which we need to understand and we should make sure that we're able to use the right tool for the right job.

I will pause here as well. Um If there are any questions, what is the difference between using R PC and rest in terms of synchronous blocking communication? There is not uh much of a difference between again R PC and rest is very similar in terms of they're both real time. I would just say they are uh uh there are different ways to implement a real time endpoint. Uh But both as you say, they are synchronous and they can block the communication. It's just that some of them prefer R PC, some of them prefer rest. But there's not much difference as far as I know. OK. Next, let's, I think we are a little, let's go to the sl a latency. Uh What is service level agreement latency is latency impacts our network throughput. You need to have some kind of agreement that this is what my service a accepts, this is acceptable latency. If, if you don't set that expectation, right? It could cause a lot of timeouts, you have frustrated users and this number of consumers. So I would like to highlight about uh things to consider when you have to have. There's a high latency endpoint. How do you debug and what are the things you need to avoid? First thing is avoiding uh uh I mean, depending on slow downstream with no timeouts, that is say, for example, you depend on email service which takes forever.

Uh You need to have some time or configured to make sure that your main end point is not affected by that. So it's very important to identify your downstream services and their latency. And then parallelization of async work, you need to figure out what parts of the API can be done in parallel. Not every action in an APR needs to happen one after the other. So figuring out what can asynchronous what can be paralyzed goes a long way in addressing mut issues. Other thing is uh it is very common that your end points are doing too much than they're supposed to because as your functionality evolves, it's it's very uh easy to think to add to the existing uh service, existing API because it's just easier, you already have the infrastructure built.

But if you think from your consumer's perspective, your the latency might go up a lot, which is not good for your consumers. Other thing is storage, you might suddenly enclose include a new DB uh index, uh sorry, new column in your table, but you're not adding the corresponding index. So the query pattern might take a lot of time. So every time you introduce a new change to your database, you need to make sure that you're adding the right indexes, right type of index uh as well when you're adding a new t new fields to your table. Other thing is redundant calls is very common that you think that is when you have a lot of things happening in your API sometimes what happens is you're doing the same kind of validation at two different places. So you need to take an audit of what your API does. If it's loaded with so much data, so much information, figure out if you're doing the same operation twice in two different places. And how can we cut those? And third is, I think this is not a common problem these days is no replicas for E si think with Aws and auto scale, this is kind of taken care of, but it's very important to have replicas for res.

Uh And because cross data center queries can take a lot of time uh forcing the reeds to primary is pretty expensive. Um Other thing is caching. Uh I think this is also highly overlooked data though caching is a well known concept to many of us. Uh So you have to figure out uh what data our service owns, but how can we cash it? Uh For example, if the, if the data is owned another service, you first read from the service first and then if the service uh b failed after a certain time out, you probably have read it from cash. This would work if you're not super concerned about uh data accuracy. If the eventual consistency is something you are OK with this kind of works instead of waiting forever for service B to return the data. Uh Similarly, if you have uh the data your service owns, you can directly read from cash first. If there is a cash miss, then you can read from the database and then update the cache. So this will also help you to not every overload the database, uh deeds and use cashing wisely can be helpful.

Of course, uh caching can be sometimes difficult to implement and that um you know, you need to make sure that you're able to invalidate the cache of the right and update the cache. So new to designing a caching solution also has its own um you know, little caveats there, but definitely caching the right way can help your database latency to to go down, sorry, your api latency to go down. So yeah, coming to these concepts, how do I fix the high latency endpoint? First is your endpoint can only be as fast as the slowest downstream. That is you need to do an audit of what all your API does and what other services it calls you need. You need to do some math and figure out what I can do about that particular card. Should I uh should I have a certain time out? Should I paralyze it? Should I make it asynchronous just doing the order can give you a lot of different options. The second thing is can end point be split into simpler, smaller and faster end points like I mentioned earlier, it's very easy to bloat an endpoint with too much functionality because it is just easier and faster to develop. But we just have to take a call whether uh should I create a new endpoint to do this or can I just do it in the existing endpoint? If I just do it on the existing endpoint, what are all the side effects for my consumers?

So just having that thought it would help you to make sure that your service level agreements are met. And the third thing is paralyzing and caching wisely, that is uh we have to paralyze calls whenever possible. A lot of times your api does too many different things and figuring out where panel calls makes sense uh can help a lot even if you don't go into caching solutions, paralyzing alone can make a lot of difference. So again, introducing a caching layer and then removing the redundant calls just, and I think a lot of this has got to do with auditing your servers and figuring out each for each of the problematic areas. What can I do? So, yeah, that's um that's, that's all I had with the latency. And yeah, thank you for listening to the session. I hope everybody had one or two takeaways from the recession.