Podcast

LLM Observability: Insights from Traceloop's Gal Kleinman

With

Gal Kleinman

29 Apr 2025

Episode Description

Join Simon Maple as he interviews Gal Kleinman, co-founder of Traceloop, to explore the complexities of LLM observability. Kleinman discusses the significance of evaluation suites and the unique challenges posed by LLM applications. With his extensive background in engineering, Kleinman shares practical solutions and best practices, including the use of OpenLLMetry, to optimize observability and performance in AI systems. This episode is a must-listen for developers seeking to enhance their expertise in LLM applications.

Subscribe to our podcasts here

Overview

Introduction

In this episode of the Tessl podcast, host Simon Maple sits down with Gal Kleinman, co-founder of Traceloop, to explore the intricacies of LLM (Large Language Model) observability. Kleinman shares his insights on the challenges and best practices in building effective evaluation suites for LLM applications, highlighting the unique hurdles faced in this domain. With a rich background in engineering and product development, Kleinman provides a deep dive into the world of LLM observability, offering valuable advice to developers navigating this complex landscape.

Building Effective Evaluation Suites

Gal Kleinman emphasizes the importance of constructing robust evaluation suites to gain valid insights from LLMs. He highlights potential pitfalls that developers might encounter and offers guidance on establishing these suites. Evaluation suites are critical in determining how well an LLM performs its intended tasks, and Kleinman stresses starting with a clear understanding of the goals and expected outcomes. He suggests creating comprehensive test cases that cover a wide range of scenarios, as this diversity is key to capturing the full capability of an LLM. As Kleinman notes, "To actually get valid insights back, one must be aware of the gotchas and start effectively," underlining the necessity for thorough preparation and strategic thinking.

The Journey of Traceloop

Kleinman discusses the motivation behind founding Traceloop and its journey to success. He shares the story of launching their Minimum Viable Product (MVP) and the significant improvement in accuracy from 30% to 90%. This leap in performance was achieved by continuously refining their product based on observational data from users in production. Kleinman candidly admits that the initial MVP was never launched, as the team became deeply engaged in developing their LLM observability solution. This candid reflection underscores the iterative nature of product development in tech startups. He reflects, "We fell in love with the idea of observability for LLM applications," which demonstrates their commitment and passion for creating impactful technology solutions.

Challenges in LLM Observability

Monitoring LLM-based flows presents unique challenges, particularly in mapping all potential input scenarios. Kleinman explains that the open-ended nature of LLMs makes it difficult to create exhaustive test coverage. Unlike deterministic systems, LLMs require a more flexible approach to testing, where developers must anticipate a broader range of inputs. "It's much harder in general to map all the options... because LLMs by their definition, they are quite open-ended," Kleinman states, highlighting the complexity involved. This complexity requires innovative strategies to ensure that all possible use cases are adequately covered, despite the inherent unpredictability of LLMs.

Open LLMetry and its Impact

Kleinman introduces Open LLMetry, a framework built on Open Telemetry that enhances observability in LLM applications. Open LLMetry assists in orchestrating and monitoring microservices, providing a structured way to track and analyze LLM performance. By leveraging Open LLMetry, developers can gain deeper insights into their applications' behavior, allowing for more effective debugging and performance optimization. Kleinman notes that the framework's ability to integrate with existing tools makes it a valuable addition to any developer's toolkit. "Open LLMetry now that sits on top of Open Telemetry... can make a lot of sense," he remarks, indicating its strategic importance.

Best Practices for LLM Observability

Implementing observability in LLMs involves setting up appropriate alerts and thresholds. Kleinman advises developers to prioritize understanding application traces, as this information is crucial for diagnosing issues and improving system reliability. He recommends starting with a baseline of metrics and gradually refining the observability setup based on real-world data. This iterative approach ensures that developers can adapt to changing conditions and maintain a high level of system performance. "The best practices... involve alerting and applying thresholds you care about," Kleinman shares, emphasizing the need for meticulous planning and execution.

Privacy and Evaluation Consistency

Respecting client privacy is paramount when handling observability data, and Kleinman underscores this point. He also stresses the importance of evaluation consistency across different human evaluators to ensure reliable results. This consistency can be achieved by standardizing evaluation criteria and training evaluators to adhere to these standards. By maintaining rigorous evaluation practices, developers can ensure that their LLMs deliver consistent and accurate performance. "Everyone respects their... privacy," Kleinman asserts, reinforcing the ethical considerations crucial to observability practices.

Summary/Conclusion

In conclusion, Gal Kleinman provides a comprehensive overview of the challenges and solutions in the realm of LLM observability. Key takeaways for developers include:

Start with a strong evaluation suite to gain meaningful insights.
Understand the unique challenges of LLM flows and test coverage.
Leverage tools like Open LLMetry for enhanced observability.
Implement best practices in setting alerts and thresholds.
Ensure privacy and consistency in evaluation processes.

Resources

Visit Traceloop's Website: https://www.traceloop.com/

Chapters

0:00 - Episode highlight: LLM Observability challenges

1:00 - Intro to guest and their role in the industry

5:00 - Building effective evaluation suites for LLMs

10:00 - The journey and success of Traceloop

15:00 - Challenges in monitoring LLM flows

20:00 - Introduction to Open LLMetry

25:00 - Best practices for LLM observability

30:00 - Ensuring privacy and evaluation consistency

35:00 - Key takeaways and conclusion

Full Script

You are listening to the AI Native Dev brought to you by Tessl.

Hello and welcome to another episode of the AI Native Dev. My name is Simon Maple, and joining me today is Gal Kleinman, who is the CTO and co-founder of [00:01:00] Traceloop. Welcome Gal. How are you?

Gal Kleinman: Thank you. I'm great. How are you?

Simon Maple: I'm doing very well, thank you. Yeah, not too bad. It's just turning summer, not summer, but we're just turning into spring in the uk so finally, we're getting a little bit of heat back in our day.

I'm all the better for a little bit of sun.

Gal Kleinman: Cool.

In the place I'm located that it's already summer, like it's 25 degrees outside.

Simon Maple: Stop. Stop. You're hurting me. Yeah. So Gal, tell us a little bit about your story, your journey before Traceloop, first of all.

Gal Kleinman: Before Traceloop I will tell you briefly about my last role before starting Traceloop. I was a group manager at Fiverr, leading, a group developing like the machine learning platform for the data science teams at Fiverr. Also another team who was dealing with the data infrastructure.

This is it then once I left Fiverr together with Nir, my co-founder and the CEO of Traceloop, we joined we, we got accepted to Y [00:02:00] Combinator, our batch Winter 23. And like the rest is history from that point, Traceloop, Traceloop started.

Simon Maple: Yeah. Awesome. And so Traceloops in and around the space of LLM observability, and we'll talk about that generally, but also a little bit about Traceloop itself as well.

So I guess first of all, what's the problem I guess, that Traceloop is trying to solve? What space is it in?

Gal Kleinman: So we are at the space like the famous space at LLM Observability. Which is a derivative of general observability of applications. What we do actually and what we solve in addition to the classic, the classic observability, is the part of evaluation the part of evaluation both in real time, like in terms of monitoring your LLM application and helping you improve your LLM application in the, like the development cycle.

Using mostly offline evaluations. So that's that's [00:03:00] the, like the problem definition or another problem, but more about the solutions, we help you observe and improve your LLM application using, using the findings from the production and tracing.

Simon Maple: Yeah. Awesome. And what was your inspirational motivation almost to, to start Traceloop up?

Was there a pain point you had previously?

Gal Kleinman: Yeah, actually during the Y Combinator batch, we've been developing a totally different product. It was like just the beginning of the LLM and generative AI era. OpenAI has just released ChatGPT and all of the GPT models.

We, we've been developing a solution which was based in a multi-agent architecture. It was like rocket science, that day. Today every other startup is developing autonomous agents, but we've been developing autonomous agents who, which was, which were supposed to solve the generation of end-to-end, end-to-end tasks [00:04:00] for distributed systems. I know a lot of buzzwords, but this is what we've been dealing with that time. And we've got to, to an MVP somewhere at close to the end of Y Combinator batch. We've got to working MVP of that architecture. And it wasn't it wasn't stable. And it wasn't reliable at all.

Like always there is the argue between me and Nir, my co-founder. Whether it was working 30 per, I claim that it was working 30% Nir as the CEO, who is supposed to be the optimistic between us, he claims that it was around between 60 to 70% working, but it wasn't reliable at all. And we were afraid of going out of, out of beta with this product.

At that point, we thought okay we gotta have some observability solution. We are startup, it's fine if it'll be scrapy at the first at the first day of [00:05:00] our product, but we gotta somehow observe what happened. What happens to our users in production and using those observations we get to we have to improve our product continuously and get to a point where it works.

I wouldn't say perfectly because probably 100% is not achievable with AI, to a point where we feel it, it's 90% working and it's a hard work. Then we started developing internally and our observability solution based on open telemetry, which I guess we will talk about later in this, in this podcast.

And the rest is history.

Simon Maple: Yeah it's really interesting, as a startup to understand when to go to, when to go to market at what time. And I think there's a couple of, I guess there's a couple of important things. One is meeting the expectations of potential users or first onboarding those first users.

I guess you, there's probably an amount of that, which is in and around messaging in terms of, we don't wanna promise too much necessarily, but we want to [00:06:00] gauge, we want to like level expectations, but then I guess the other important thing is we call it at Tessl, we call it developing in a lab.

You don't wanna be developing in a lab. You want to get that external, like real user feedback and get that real user feedback in early. Did you find, I'm curious before we jump into a little bit more around the LLM observability, did you find that once you pushed your product to market and you started getting that user feedback, was that an accelerator to that percentage going from whether it was 30 or 60, at that, that much higher or was it a steady increase still similar to how you were developing in the lab before?

Gal Kleinman: I can tell you the, the beautiful story about yeah, we launched the MVP and then we improved our, our accuracy from 30% to 90% in, I dunno, in a week or two but I prefer to tell you the truth that we were like we, we fell in love with the idea of observability for LLM applications and then like that MVP was [00:07:00] never, was never launched. We, since then we've been working on on the LLM observability solution.

Simon Maple: Yeah. Gotcha. Gotcha. Awesome. Observability. Then let's talk a little bit about, for developers who are using LLMs today, the need for LLM observability.

I guess, what are the biggest misconceptions that developers have when they're trying to build an LLM, using an LLM, what are the biggest misconceptions or falsehoods that I guess that developers have about LLMs whereby LLMs can really bite them if they don't know about these.

Gal Kleinman: I think like every, everyone probably will talk to will say the same thing. It's quite easy to get into a sort of working POC and most of the developers, they think because we think we are better than others, then we think that okay, no, it is a super simple prompt and a super simple pipeline, and I know how it will behave.

Yeah, we will develop that POC and quite quickly it'll get into a [00:08:00] production, grade application. It is a misconception. The truth is that taking application or the vast amount of, the vast amount of time is spent taking that POC, which is working probably as I mentioned in our idea, which is working probably 30% of the times, taking this one into a 90% working application.

This is what you spend most of the time on, and this is, this is a great misconception. Also another misconception is the fact that, okay, if I will just have a way to see what happened, what happens in production, or have have some sort of simple tracing or simple logging capabilities.

This is what will help me to get into like production grade application. That 90, 90% dream also I think it's a misconception and you need to be much more sophisticated as this solution does not [00:09:00] scale and you can't go for example, trace by trace and monitor that way your production traffic and by that, improve your application.

Simon Maple: Let's jump into the, some of the differences, between this black box of an LLM existing in our app and just look on, our traditional code. Even simple things like debugging for example, right? It's easy for me to step through my code and understand why things happen and how things get to, how I get an answer out of that.

With an LLM, it's completely obviously different. We throw a prompt in, we get a response, and it can be very, this is my input, this is my output. And actually if I give that same input 10 times to, the same LLM with, without context, I will get 10 different answers. How do we even think about, what are the fundamental differences that we need to be aware of when debugging or doing anything in detail with, a traditional application comparison to an LLM?

Gal Kleinman: So just as you mentioned the fact that it's not deterministic, make it [00:10:00] much more harder to debug because even if I copied the exact same example I saw in production, for example, I try to mock it in my dev environment, I will probably not get the same results, which makes everything harder to, to debug, test and fix.

In terms of, this is in terms of debugging, but in terms of observability, the big difference is the fact that in classic observability, we have the common, common metrics and way to, to understand which flow, which flows didn't work as we expected.

For example, we can track errors by, I don't know, looking for status codes of HTTP, for example, like looking for 500 status code or just looking for exceptions. I dunno, looking for other metrics like latency and stuff like that. These are all things that as engineers we are familiar [00:11:00] with and a lot of best practices were written about, about tracking those metrics and by that understanding which flows are misfunctioning in your application. When we are getting to LLM based flows, then we have a whole, a whole different space of problems and the like, the most freaky stuff is understanding which actual flows didn't work as we expected.

Why it is so hard because the whole term of good or bad or success, or between success and failure, it's not that clear. Like it's not a binary result. And even if we will look at the same result. Let's imagine me and you look at the same response for a prompt we gave to ChatGPT. You can think it's a good, it's a good answer for the question, and I can think it's a bad answer for the question.

Even when you look at like manual annotator, manual [00:12:00] annotators and companies companies we are working with, then the practice is usually taking more than one annotator for each example because it is subjective and there is no clear answer for anything. Then you'll have to set up some metrics or way to evaluate what, what really happens in your, in your production and evaluate whether it's good or bad. And this is a hard task, which we don't have in classic observability.

Simon Maple: Yeah. Yeah. It's so frustrating, right? In order to determine what, whether something is right or wrong or it's not even, like you say, it's not even right or wrong.

There's a scale of, do you know what, for most people, this will probably be right, but actually there's a group there that will be wrong. That's so hard to ascertain. I guess the second thing is in terms of, in terms of being able to test, would you say that, it's harder to almost mock up production, like [00:13:00] quality tests, actual user flows or compared to traditional input output?

Or do you think that's reasonable as well. 'cause I guess you have to care about, not just, here's a flow, here's some input, and we get some expected output, but it's the intent of the user and is that user satisfied with the result that they get. So I guess my question is, from a, from the point of view of actually being able to mock those tests from an observability point of view, is it harder as well in, in development testing scenarios to create those tests that are representative of, of production quality flows and actually then work out if it's correct or not?

In terms of responses?

Gal Kleinman: Yeah, it's much harder in general to map all the options, all the, like all the possible examples of input. And as a result, it's much harder to create a good coverage or a good set [00:14:00] of tests because LLMs by their definition, they are quite open ended, as long as you are not scoping your application to super, super specific flows. For example, we can think of some chat bot some support chat bot. So if we, I will just expose the LLM interface to the user and to my user and tell him, okay I'm a support chatbot. Do ask me whatever you want.

Then the amount of possibilities for the input is endless. Yeah. But if I will scope you, and this is a good practice in general for, for such products, if I will scope you to stuff that AI chatbot can do, then I can scope the the variations of the input and then it will be a little bit easier to create good coverage for that for those use cases.

Simon Maple: Yeah, I know. I love that. Structured inputs are really reduce the, the level of variance, that you could receive. When we think about the, challenges, that we face [00:15:00] in the real world today. We talked about some of the problems that exist there from a developer point of view.

Some of the scenarios that, that they might see as potential issue. Hallucinations is the obvious I guess when the LLMs just make up shit that you just, is false. But obviously need some level of citation. What are the other things that we as developers or end users need to be worried about?

I guess that, that we need, that they know that it's a candidate that we can identify in an observability point of view.

Gal Kleinman: Just what we talked about earlier, like the deterministic behavior which we are used to does not exist in, in this world. And this is an issue because we can deploy some code to production, which involves some prompt or some pipeline or LLM based some LLM based application.

And what we saw, for example with our, one of our customers is the performance or the quality of the [00:16:00] answers from that LLM can degrade or drift a long time, and this is something to be aware of. OpenAI for example, even like even if we're using the exact same model with the revision with the date of the release of that model.

They are doing a lot of changes not to the foundation model itself, but usually to the, let's say, to the inference architecture. Like they have, a complex architecture, which handles the inference, inference request. And they are having those those changes frequently, which can result in degradation in the performance of your, of your LLM calls.

For example we saw in our customers that someday, some random day, they realized that one of their summary tasks or summary prompts it started, they track the amount of amount of words in the summary as some [00:17:00] prometric to the quality of the, to the quality of the summary.

And some random day they realized that the median, that the median number of words in the summary just decreased dramatically. I guess it was around 20, 20 words, which is quite significant in, in their case because the whole summary was about between 50 to 150 words as far as I remember.

Don't catch me here, but a lot of changes can happen during even if you tested your code and everything is working someday, everything can change and the fact dramatically your your product, obviously there is the case of hallucinations, which I guess it is so widely discussed.

But I dunno if I can add a lot more in these terms.

Simon Maple: Yeah. No, that makes sense. That makes sense. So let's jump into some of the observability terms then. We mentioned open telemetry at the start, an [00:18:00] open LLMetry, which I think is maybe, was that coined by Patrick Debois?

Gal Kleinman: Yeah, exactly.

Simon Maple: It was, he, he loves naming things, doesn't he? Tell us a little, so first of all, why don't you just remind us what Open Telemetry is?

Gal Kleinman: Open telemetry. It's a CNCF project, totally open source, which is like formalizing, standardizing, structuring the way applications, applications in our area of distributed systems and complex systems and architectures of SaaS should report metrics, logs, and tracing data that way.

And they provide a lot of SDKs and tools around it, like collectors, which we can discuss it later if you want. I'm not sure how important that is it in, in this phase. They provide a lot of tools, reporting data and instrumenting client, client libraries mostly when I'm talking about client libraries, I'm talking about SDKs and, and stuff like that.

Mostly for a backend. Today it's [00:19:00] also existing the client and even in mobile, I think so and you can report structured tracing data to any destination since it's standardized. So every destination and every observability platform, which supports open telemetry and exposes open telemetry endpoint can consume and visualize and present in, in their application.

I hope it makes sense. I'm not sure it was clear at all, but

Simon Maple: No, it's good. It is good. Open LLMetry now that sits on top of Open Telemetry. Tell us a little bit about what that does and I guess the differences.

Gal Kleinman: Once we, so as I mentioned, every open telemetry, they have, their country repo and also I think in the main repo, they provide a lot of a lot of instrumentations and instrumentation is actually a library or a sub library in, let's take the example of Python. It's a package who's [00:20:00] wrapping a specific specific SDK for example, we can think of I dunno. If you have in Python you are working with Postgres and you use SQL Alchemy as as your client or ORM. So open telemetry instrument.

They have open telemetry instrumentation for SQL Alchemy, which is wrapping the library and knows how to extract the data in some magic way out of the SDK methods and report it in a standard way as open telemetry logs, traces, and metrics. What we did with Open LLMetry back then we were looking for an observable how to report let's call it observability data, which is actually traces.

We saw it as traces be because we thought, oh there are, we've been developing agents it reminded us some microservices architecture with, you [00:21:00] have an orchestrator and which is the agent or a lot of agents communicating with each other. You have the orchestrator, then you have sort of microservice, which is agent, which is doing something as a tool for another agent. So we thought, oh, open telemetry can make, can make a lot of sense in in these terms, in these terms as well. Maybe that's a good, like we are both me and Nir my co-founder, we have, we are quite experienced engineers.

And as long as you become, as long as you become an experienced engineer, you realize that you don't have to invent the wheel for every task you are doing, and as much you are adapting as much you're adopting already developed solutions or out of the box solutions, then I guess you, you become a better a better engineer, at least in my opinion.

So we thought, okay, open telemetry could be or can be [00:22:00] a great tool for the purpose of observing agents, observing LLM applications because even if you are not developing agents, the pipelines were getting more and more complex that time and involved also API calls and stuff like that, which is super native to open telemetry because HTTP calls are being instrumented over there.

So we thought, okay, let's adapt open telemetry. Back then, once we realized it, we said, okay, let's create a repository, a project, which is extending open telemetry. What we've just added is a set of instrumentation for most of the relevant SDKs and libraries during the development of LLM based applications.

It is combined of three different types of instrumentations. One is for foundation models you can find their OpenAI, Anthropic models, Bedrock ,Gemini and [00:23:00] or Vertex AI and and so on. We have, 40, 40 instrumentations today or something around that. I can't, I don't even count it anymore.

Anymore. We have their vector databases, instrumentations and frameworks the third pillar is frameworks like LangChain, Llama Index, CrewAI, Haystack, and and other friends. That way we provide instrumentations for most of the building blocks of of LLM applications. Combining this one with the already existing the already existing instrumentations provided by open telemetry and using the mechanism of open telemetry, we are able to to report or ingest open telemetry data both to our platform but also to any other observability platform supporting open telemetry.

And you can see like a great you can observe and see great [00:24:00] traces which contains the prompts, the responses, the metadata around it, like the token, tokens used, the cost for every call. The, let's say the top K documents you've retrieved from from your vector database. The transformation, the transform tasks inside the frameworks like LangChain and Llama Index.

And a lot more data you can get it for free. You just have to initialize our SDK like Traceloop and you get it out of the box.

Simon Maple: So what would be the best practices be, to someone starting with LLM observability, is there an amount that would need to be set up in terms of, obviously, does everything get switched on and then you apply alerts apply kinda like thresholds, which you care about.

How much configuration does there need to be per application based on subtle differences between implementations?

Gal Kleinman: So if you are just working with the basic tracing and basic [00:25:00] metrics that, that are provided from the SDK, it's only the raw data.

Then you'll have the classic the classic observability metrics, let's say the deterministic ones, the token usage and latency and stuff like that. You can set and configure alerts and monitors based on that. So as a first step, it's a good step towards the direction of of being monitored and being covered.

But it's not enough to be honest. As we talked, as we talked before what's more important is defining, defining evaluations in production, which are using the, which are using the the real content of the response and trying to evaluate and score the response itself and not the metadata metrics.

Simon Maple: Yeah. And is that the, is that one of the kind of hardest things, if you think about the lift that it need, that a user needs in terms of getting good observability data back is creating the [00:26:00] right evals and good evals? Is that the lift from the user point of view?

Gal Kleinman: Yeah. This one is much harder. Like you can start with Traceloop SDK report the open telemetry tracing data to either Traceloop or any other vendor. This one is easy, but once it gets to defining and configuring the, or building the right evals, then it gets much harder.

And this is the heavy lifting monitoring and observing like LLM applications.

Simon Maple: To actually get valid insights back. What would you say then as advice to people who are trying to build eval suites, what are the gotchas, what should people look out for and where should they start?

Gal Kleinman: So from what we said, I will say like a good starting point can be working with some proxy metrics, which usually can be, they can be slightly naive, like I mentioned the [00:27:00] word count and stuff like that. It can be a good proxy for some cases, but if you wanna take it one step farther than what we saw at least. It is super dependent on the context of your application. There is no silver bullet and general metrics, which, you know, which tend to be used in classic NLP, I dunno Rouge, Gru they're all great metrics. They just didn't provide the, usually the expected the expected result. And you'll have to find your evals and provide the context of your value of your application. It requires a lot of like you first have to manual annotate data and and understand as a person, as a human annotating or evaluating the responses, I gotta first understand which criteria I wanna evaluate myself and make sure that as [00:28:00] a human I can evaluate it and get consistent result results with other humans, for example.

Simon Maple: Yeah, and I think that's really interesting actually, because I think it's one of those hard ones where you actually don't really know how good your eval is because it could provide you with some results. Say certain models are better than others. Say certain changes are better than others, but you don't really know until it's in production, right? Because it's only when you get users that are happier or less happy based on the results that, that you actually know whether your LLMs are providing good value.

Is that fair in all cases or? I think maybe with the structured input, structured outputs is probably a little bit more more of a measurability without having to rely on users' sentiment, but yeah. Does it feel like sometimes we are guessing a little bit with evals and really we need to truly get that production level user feedback?

Gal Kleinman: So to be honest, as far as I see it user feedback, when you ask for it explicitly, it tends to [00:29:00] be biased. Like it's not a, it's not a good metric. If you wanna take something from production, then I will look for for implicit scoring by users. Let's say I'm working on a product who suggesting completions.

Then if the user actually accepted my completion, I can count it as a thumb up. But if I will proactively ask for his feedback, usually it tends it tends to be bias. Because probably you get only the angry, most of the time you will get the angry users. And even if you have 90% of them, which are actually satisfied, you will see only the ones who, who were angry. Setting up evaluations we believe, by the way, in using real production data and examples, it can be, it can be hard in terms of, in terms of privacy of data for some for some companies who you know [00:30:00] everyone respects their, the privacy of their clients and their customers.

So not everyone can do it, but if you can do it and take examples from real world like real world use cases within your product, then you can use this one to train evaluators, which you can then use during your development process.

Simon Maple: Yeah. Yeah. Makes sense. From a trace data point of view and an alerts point of view is there are there ever problems in and around kinda like signal to noise ratio between the traces?

Or is there is is it hard to create, alerts that are meaningful?

Gal Kleinman: Yeah, definitely because you can at the beginning, you can go trace by trace and it's fine. You do it manually. Most of the companies are doing it, even the biggest ones because they start with a really small, really small rollout of the LLM feature there are launching.

So it's fine. You can go trace by trace. [00:31:00] Then you use, as I mentioned like the basic metrics of latency and stuff like that. And you think it solve, it solves your issue, but once you get you get to like a significant scale, you can't go trace by trace. So if you don't have the capability or the right evaluators set up and developed before then you find yourself with, let's say, I dunno a million a million generations a day.

It just it just doesn't ,doesn't scale even if you wanna spend a lot of, a lot of money on the problem and put a lot of human resources, it feels like super super primitive going over a million of examples every day and tag them. So you gotta set up the right evaluations.

And the right evaluators to score the traces in production and that way get only the interesting one, [00:32:00] the interesting traces and investigate and debug them in order to improve your application.

Simon Maple: Yeah. And in terms of who users typically are for this kind of thing, how, how much is the spread of these types of activities done by developers say versus, versus folks who are more on the operation side?

Gal Kleinman: The whole world of the whole space of AI and ML development has shifted towards, towards backend, developers and even frontend developers today are developing those features and applications. When it was data scientists developing these applications they are used to work with they have, they gotta have some evaluation, some evaluation system and like the practice and they're working more in a more methodical way because they trained, they were trained to do but today, backend engineers they're not used to do it. So a lot of the effort is going towards the it [00:33:00] is shifting to the operations side. You can find a lot of teams, which existed even before, like in, in classic data science teams, for example. They tend to use it in the past annotation teams.

Or which are creating and curating data sets of labeled data. Today product managers can do it annotation teams or domain experts which makes sense actually, because if I'm building, for example let's imagine, a doctor, a medical doctor who's, who's based on on AI, on generative AI, then it makes much more sense that a doctor will rank and score and label the responses of the LLM instead of the backend developer who's understanding like I have some understanding in in like in health, just because, just because I was trained I was trained for the by, by my family, but the common [00:34:00] developer usually does not understand anything in house. So it's better if a doctor will be the one to validate and score the responses.

Simon Maple: Yeah. Cool. Let's look a little bit forward now and think about what's happening in the future.

Obviously agentic AI is something that's absolutely flying right now and everyone wanting to be more and more agentic. There's talk of everyone being close to AGI, but let's assume that's a few years off. I think that's, a fair assumption. With those types of things where LLMs are users of LLMs as well and you have orchestrators of of agents does that make, does this make things harder for, from an observability point of view as data and things are being passed between agents and between? Various flavors of LLMs, various vendors of LLMs within the same application. Does it make it harder and what's the importance that of observability in that space?

Gal Kleinman: Yeah definitely as, as much as the, like the flows and the use cases, of LLM [00:35:00] applications.

Will get more and more complex, then observability will play a much major role in, in that game. And why? Because today imagine a flow which is only based on a single, a single prompt to the same model every time. If one of my users like he has an issue, it is super simple or quite simple to understand why, because the flow is super simple.

But once it gets, it gets much more complex and there are a lot of, there are more prompts more prompts involved, more models involved, more databases, more tools like you can today, access the whole world of, of MCPs. We didn't cover it, it at all. But you can access Google, you can do a lot of stuff, and you gotta somehow understand and trace what happened along that way in order to understand and debug [00:36:00] both.

If you have an existing ticket from a customer and you wanna debug it, but also, and this is more of the space we are dealing with understanding the health and the quality and of the generation of your system. So you wanna break it down, like if I get some general score, for example of, I dunno, it's 50% working, but I dunno how to break it down.

And when it gets, when the flow gets, gets complex, it's much harder to break it down or you have a lot more components that you wanna build some dashboard and understand which is the leaking component, which makes me be only 50% at the end. It is possible that all the rest, like I have all of the components functioning at, I dunno, 90% 90% quality, let's say, and one which is which is leaking and having like missed malfunctioning and having a bad performance.

I wanna be [00:37:00] able to track this one and understand which exact component is in charge of this one.

Simon Maple: Yeah. Yeah, no, that makes complete sense. Otherwise the black box just gets bigger and then you have no idea then which part is at fault. That makes a lot of sense. Let's wrap up Gal and I guess from a developer point of view, is there a takeaway around managing observability across apps that are using LLMs that you would kinda say, this is the thing that a developer needs to know about or needs to do. This is the most important thing that they should know. What would that be?

Gal Kleinman: Quite cliche, but there is that thing what is not, what is not measured cannot be improved.

And what is not improved is always degraded. It makes a lot of sense in in this space. And actually this is the problem we are dealing with. How do you measure the actual the actual quality of your responses? And once you are able to measure it, you can improve it and you can make sure it isn't it isn't [00:38:00] degraded.

This is what I would expect every developer to have in mind. And this is, I think the main or the key takeaway I'm going out with from this space. Yeah.

Simon Maple: Which is quite a classic observability takeaway as well. So it's it's almost don't discount your LLM from an observability point of view.

It needs to be, it needs to be held, as accountable as every other part of your app, from a, from an observability point of view. And you need to, you absolutely need to know what's happening. Don't just leave this piece of magic that exists in your application without, without measuring it. Which, which is excellent advice. Gal, thank you very much. This has been super relevant to our audience. I think it's, I think it's super important that we, when we do introduce these things, and I remember years ago, I've said this to a few folks on the podcast years ago, we used to scrutinize these LLMs that we add into our apps much, much more than we do today.

It's amazing how mainstream LLMs in applications or in our workflows has become, and how prevalent it is and [00:39:00] how accepted it is. But I think we need to apply the same rigorous measures and practices that we do across a LLMs as we do our traditional code.

So yeah, love, love this session on observability and some really nice practical insights there. So Gal very much appreciate your time. Thank you for your insights.

Gal Kleinman: Thank you. I had a great time.

Simon Maple: Awesome. And thanks all for listening. Make sure you, if you enjoyed the session, make sure you subscribe.

Give us a thumbs up on YouTube as well. And thanks for tuning in and we'll see you next time. Bye for now.

‹ Vibe Coding SimCity: Prototyping Tiny Towns with AI Dev Tools

The Software Engineering Identity Crisis ›

Subscribe to our podcasts here

Welcome to the AI Native Dev Podcast, hosted by Guy Podjarny and Simon Maple. If you're a developer or dev leader, join us as we explore and help shape the future of software development in the AI era.

THE WEEKLY DIGEST

JOIN US ON

Discord

Come and join the discussion.

Join

2025

AI Native Development