Building and evaluating robust conversational interfaces using LLMs

Automatic Summary

Revolutionizing Conversational AI Interfaces with Large Language Models

Welcome to my session where I'll share insights about how the evolution of genai models, specifically large language models, can be harnessed to build conversational interfaces that are not only realistic and natural, but also performance-driven.

Why Focus on Naturalness in Conversational AI Interfaces?

The main goal of creating better conversational AI interfaces is to enhance the customer experience in scenarios like sales, hospitality, and retail—places where the naturalness of conversation and effective human feedback are crucial. These interfaces should not simply serve as question-answering chatbots but should facilitate an enhanced customer experience, replicate human interaction, and emit a personality that makes the conversation tone casual and fun, depending on the use case.

Defining a Good Conversational AI Interface or Service

So how do you define a good conversational AI interface or service, and what should be the driving force behind it? Let's delve into it.

  • Customer Experience: The ultimate goal of these interfaces should be a superior customer experience, not just answering questions. You want your chatbot to make customers feel acknowledged and valued.
  • Natural Interaction: A conversational AI should provide a free-flowing back-and-forth interaction. It should deal with out-of-scope queries gracefully and avoid the rigidness of form-filling-AI models where they ask straightforward questions to fill out a user profile.
  • Customizable Model: Your AI should offer high customizability, addressing diverse use cases and accommodating complex customer data models. This will involve building an architecture that suits your specific needs vs using general no-code or low-code platforms.
  • Evaluation Pipelines: For production use cases, defining thorough evaluation pipelines is integral. Especially in B2B scenarios, you need strong evaluation metrics for versioning and to iterate over the most effective strategies.

Building a Good Conversational AI System

Building a good conversational AI system goes beyond simply answering queries. Here's how:

  • Reducing Rigidity: Create a framework that can be deployed across multiple fields such as retail, sales, hospitality, etc. The system should reduce the rigidity in the conversation while giving you control over the flow.
  • Effective Tonality: Incorporate components that create a human-like persona, and accommodate tonal variations based on specific parameters to ensure the conversation sounds as natural as possible.
  • Reducing Latency: Aim to minimize the response time from large models. Quick responses increase customer satisfaction in your service.
  • Improved User Profiling: Strive to understand your user requests better and better in order to profile users more effectively.
  • Data Integration: Curate a system that can ingest data from varied dynamic and static sources, in a customizable manner.
  • Iterative Improvement: Finally, keep assessing your system based on offline and online metrics to improve it continuously.

Evaluating the System

The crucial part of building a Conversational AI system is effectively evaluating it. Consider the following points:

  • Create an Evaluation Dataset: Creating a goal dataset forms the basis of benchmarking the system against each point of system iteration.
  • Thoughtful Metrics: Design metrics that assess both online and offline performance at individual component level and full system level.
  • Align with Business Metrics: Always align your metrics with business goals to check if your AI system is aiding the business’s growth.

Conclusion

Designing effective conversational AI interfaces is a challenging task involving a multitude of factors. Nonetheless, the evolution of genai models and the advent of large language models make it quite achievable. With a custom approach that addresses the interfaces' limitations, it's possible to reach an optimum solution that aids user interaction, ultimately enhancing the user experience. For any questions or suggestions, feel free to connect. Thanks for joining me in this enlightening session!


Video Transcription

Hi, everyone. Good morning. It's morning where I am. Welcome to, this talk. What I'll be talking about, really using genai models, specifically large language models to, build conversational interfaces.Which are actually more natural, more realistic, and, actually get the job done. So the preview that the domain of this talk and the conversational AI, interfaces that we're talking about is basically for, sort of scenarios like sales, hospitality, retail, places where, naturalness of conversation, but really a better understanding of a human feedback is extremely important.

So let's kind of dive into the session. So, yeah, it's, again, how not to sound like Chargebee, essentially. So the agenda for this session is, basically, how do you def what do you define as a good conversational AI interface or service, what the motivation is? Is it just you know, being able to answer questions or is it more than that? Then I will also present an outline on how to customize it and fix it for use cases. So this is not, like, using a no code platform or a low code platform, but rather if in your you know, in a conversational AI or a natural language understanding team, there is actual, you know, customizability that's required, customer data, models that are essential, how do you actually structure such an architecture.

And then what's most important in my opinion is defining evaluation pipelines, especially for production use cases. So if you have B2B settings and, you know, you're building out a conversational agent, that's going to be used by an external business as a customer and enterprise, that's a customer. So it's more like b to b to c. You still need to have thorough evaluation metrics to 5 versioning and, iterate over the best, you know, versions and and changes that happen both at, a component level in your pipeline and as well as sort of end to end evaluating the entire architecture in the system. And then we'll maybe if we have time, we'll focus a little bit on, retrieval systems for, retrieving the relevant data and how you actually evaluate those. And then, you know, we'll go from that. So why should you get Okay.

So, this is this talk is especially important for people who are in the space of natural language understanding and processing and potentially conversational AI. Or if you're, you know, a senior, ML or, technical architect that wants to build a system that can actually ingest data from different sources and actually be able to, create an agent that can do a good job at whatever the business law ticket hand is. But why should we care about making these conversational interfaces robust and more natural? The main reasons are, of course, you want these agents to not just be a question answering chatbot, you know, but you want them to actually enhance the customer experience. So if you're, if at a senior level, your technical strategy is to actually add in these to improve a customer experience to make them feel their hood. I think that we need to have a better in these AI systems to actually, imitate that at the very least, if not, truly replicate that.

You'd like them to be, you know, sometimes you don't want them to just answer questions with that straight face. You want it you know, it's not a rapper on another m. It's, you know, they have they they should have personality. There should be, you know, there should be candidness and and the tone should be casual and potentially even fun depending on the use case. Right? You want there to be, like, a back and forth interaction. The conversation should be free flowing. So, you know, if anything goes out of scope, it's not like you say, sorry. I can't handle that. It should still gracefully fail in those cases. However, these rules should still be configurable in a business logic based on business logic. We don't want users to fill out a form.

So when you have a sales or retail agent, you don't want them to ask questions in a in that that are effectively just filling a form from a potential user or a customer. You don't want them to be like, okay. So what do you like about the What what do you like about this? What do you like about this? But you wanna implicitly understand preferences and make a recommendation. So everything being more softer and implicit is actually really the key here. And we want to pipeline that's accessible, at every component. So, if you have one module that's focusing more on user understanding, which has a couple of machine learning models. And then there's, components that are focusing on retrieval of data and components that are focusing on generation. Each of them should be accessible so that you can iterate on them, in a deterministic manner.

So that's the the goal here is, like, we want to be able to build an actual sales or how utility or marketing agent and measure how effective it is at its actual job, which is, you know, of doing the sales or or being a good marketer. So, this should probably give you some insight. You have on the left, you have this organization. This is kind of what you just ask GPK 3.5 or GPK 4, or a lot of the off off the shelf, like LLMs that are available. You know, a customer saying, what should I buy my husband as a 30th birthday? As a birthday gift first 30th birthday? And the agent just replies with, like, this long, worthy list of things that would work well. What we actually want is an sure. There are there are times and behavior of this model will be better where you will have a probing question coming in.

But what we actually want is deterministic having a series of actions that can be taken by the agent. When the customer says something like this, you know, there's a potentially an acknowledgement action. And then which is like, oh, milestone birthday. Sounds like you really wanna make it special. What does he usually like? So this is more like, acknowledgement and then probe. Based on both of these actions, certain states are kind of filled, but implicitly. So we're we basically need to understand based on this response that, the person's looking for something that fits this sort of a profile. So, really, this this agent driven questioning and acknowledgement and that sort of behavior. It's a research topic as well.

But right now, I think if you don't have implicitly large language models that can do this deterministically. We need a framework that actually defines and, defines components that can handle this. So how can we actually build a good conversational AI system? We first define a framework that can be deployed in retail sales and hospitality. That's what we're trying to do here. What we actually want to do in this case We wanna reduce rigidity in the conversation while having more control in the flow. That's what I was kind of alluding to in the previous example. We want to improve the tonality. So there has to be some sort of tonality component, which makes them sound more human than they are, or adds a, a distinct sense of textual style, or conversational style.

Obviously, we want to reduce latency and responses from larger models. So, you know, you can't just use the largest LLM available, because it takes time to generate responses. And in a sales or retail setting, it's not 1 to 8 for 4 seconds, potentially 10 seconds to get a response. Understand user requests and profile the users better. And build generalize the ad customizable data integration pipeline. So different b to b customers would have different data sources, different you know, in the in the retail setting, different product catalogs, different, price books, different systems that can directly feed in inventory information and so on. So how do you actually ingest all of that in a system that's generalizable? And we should, like I said, iteratively be able to improve it based on measurable online and offline metrics. So how do we actually do this? Right?

We have, what we have is basically, improved Like, we need an improved NLU component. What that is is we need to understand what the intent of the user is. Understand the sentiment behind each important attribute it's relevant. So this is more from classical dialogue system sort of, a framework, but now with the advent of large language models, we potentially use better, better models and better, methods of improving this in a multi multi turn more contextual setting. Understand when to stop asking questions. That's that's also important. Like I said, earlier. We don't want to fill forms. It's it's annoying for users and it's not, the most natural experience So what we want is to really, you know, as a human being, we understand when, you know, when you understood enough about the opposite person you're having a conversation with. So that's that's we we call it convergence.

So understanding when you actually created an assumption picture of a user that you convert to some sort of state of their requirements. So that brings me to the next point, which is better agent state, context, and memory handling. So, the state is basically the current state of the conversation, the current state of the user, and wherever the the state is in in the flow of the conversation. Right? Whether you've made a recommendation, whether you've, understood the you've filled in enough information about the user's profile and some inherent state. But, you know, tracking state. So this state also is contextual. So, when you may recommendation as a, FAQ kind of question is not again, it ruins the experience it makes. It it's what is like the difference between actually using this, in an end to end manner versus always requesting for a human agent at the end.

So and, like, different things in different contexts can mean mostly different, things. So I think that that becomes essential is, like, the state tracking is also contextual. The behavior, like how much to acknowledge, how much to probe, how many turns of conversation, which is how many tries of how many back and forth turns of conversation should you have before you realize that, you know, this user is clearly frustrated, or they don't what to do.

I need to hand off to somebody else. So on. How do you actually learn that or and or, you know, be able to configure that is another problem. So we should be able to provide configurability configs or, you know, like, high level, definitions sort of files or some sort of tooling to allow for this configurability to be pretty flexible. And then to learn this both heuristically and from machine learning based approaches, you can have things like prior logs with sales agents. It's it's a little hard to get this source of data. It's it's a little easier to get the data over, like, what we're doing on the internet as opposed to what's happening in real life in in in, you know, user to user use cases. But there are logs called center logs, telemarketing logs, so on. Enhanced contacts from static and dynamic sources using a common data and chat and pipeline.

So like I was mentioning, we want to use sources like product catalogs, textual descriptions, marketing information, all of stuff that a sales associate really has access to, that makes them a good sales associate. And then things like, you know, having connect plugins that connect with various APIs that talk about pricing or ads or promotions. So those are more dynamic sources. Using a combination of both of these to perform RAG, which is retrieve augmented generation for LLMs. Which basically takes in, understands users and understands the state of the conversation, retrieves the relevant information, and then, feeds that to the context of a generation LLM. That's that's basically what we'd like to. Then I was talking about the tonality, which is basically making sure that the conversation is natural non rigid. It there is, identity specific sort of styles that are available.

So, you know, potentially you could have a tonal a tonality component or a paraphrase component, which would be able to, replicate the tonality that, a business would want in there. Their agents so that everything doesn't become, you know, standard. And, of course, there are all these other staff like logging, observability, metric accumulation as you've deployed a system that's that's more standard from a machine learning monitoring in a soft systems monitoring approach. But, those that information is essential because it could also be used for fine tuning and using, implicit feedback. Let's kind of go into this, which is basically, it's it's basically like, a high level diagram for, for what I've sort of discussed. So there's a user. And, what we have here is a user contextual NLU, which is basically a model that's, or a set of models that's understanding the user's response and multiple responses so far.

Based on that, it feeds into a static to a set of, data and config files. So you have started data sources, dynamic data sources, which are basically the the API based sources, and then, configurations to configure agent behavior. All of that is assimilated and create, you know, fed into a context creation template, which would basically use the agent behavior configs, such as, like, how much to ask, whether to knowledge, and then based on contextual, and then you should be recommend, should be, probe more, and so on, or should we just, you know, chit chat?

And then that's fed in here and used to generate a response from a large language model that's faster paraphraser. This is then added to, I mean, you know, typical dialogue machines scenario where you have dialogue the memory of the conversation because that could be used going forward. And so it continues in this in this setting. I'm gonna stop here and ask if anyone has any questions or, you know, comment, I mean, anything that they'd like to ask, before we try to wrap this up. Okay. So, now maybe let's go into the next part, which is how to evaluate these systems. So I think I'll leave this at a very high level, but, what we absolutely need to do is start with building out an evaluation dataset.

That's extremely important because you want to be able to benchmark your entire system at every point of iteration. And, you know, we know this from software. We know this from even traditional ML algorithms. So I think even in generative models, I think it's it's rarely essential. It's just how we measure that becomes slightly different. What we want is also to be able to design thoughtful metrics, that actually assess the the the performance of various components as well as online and offline performance. So, and then evaluate at a component level as well as an end to end system level. So, we also wanna evaluate the NLU components, the drag, components, the, generation model individually, but also the entire pipeline in an end to end manner. So, creating a goal data set is super important where you have the relevant questions and the most ideal responses.

Could also use AI associated evaluation to score responses that come from a data. So using a larger model like on G 34 or something that's much better. So one way of, you know, assessing this is using semantic answer similarity. So creating, and and and this is this is something that, basically understands the context of a question and an expected answer And then the answer that's generated from your system is measured, basically using cosign similarity. Then for evaluating retrieval systems, you know, there's various other areas to, compute, the metric. So whether it's actually relevant to the data, to the question that's asked whether, we actually are able to recall all the information about a particular question and so on. So how do you online evaluate? I mean, always tying it to a business metric. Like, is this agent actually leading to an increase in sales.

Is this actually being able to improve engagement with customers? Maybe assessing sentiment of conversations, in post. Is important as well. And yeah, with that, I'm reaching the end of my session. So, anyone has having any questions comments or suggestions, please feel free to write it in the session. You can reach out to me on LinkedIn, as well. K. There's one question. What's your opinion, redeeming your AI solutions? Yeah. So with red teaming, I mean, assessing for things like toxicity, hallucinations, so on, it's, like, there's obviously open research tasks, and we're, like, working on that as well. But, there are ways there are scores like hallucination scores that can be used. I think right now for toxicity, guard railing on the input side as well as on the output side becomes important. So post generation verification, and filtering.

And then on the input side is like guardrailing for toxicity comments and so on. There are a lot of different approaches, however, for each business, they have their own definition of what toxicity is or what is disallowed. And so, there still has to be a a dedicated set of maybe people or some sort of data collection efforts to, ensure that that, everything that is disallowed is represented in some some some form somewhere. So whether it's coming through people actually, actively working, to to understand the shortcomings of the system or building out data sets and models that can actually effectively identify these, these gutters. For hallucination, also, there's, ways to hallucinate, but, to to reduce hallucination, what we've realized mostly is to, improve the context. That's being the retreat context. That's being fed into the models, really works well, and and reducing the length of the context as well, works really well.

But, for very important things like business logic or, sorry, or business specific terms, again, filtering on the output side is also always a good check to have. So, identification of in of various slots and then fill making sure that they're filled with the relevant values for example, for pricing and such. So so I think it's a combination of both. Like, there is research and development, but then there's also a lot of, accounting for mistakes and and being proactively cautious. Yeah, I hope I'm able to answer questions. Alright. Thank you, everyone. That's that's the end of my session. I'd love if you have any other questions, please feel free to connect. It's been really nice presenting today to this, set of amazing women. Thanks a lot. Bye.