AI agent evaluation in customer service

How to evaluate an AI agent before buying: which tests to run and which metrics to require

How to evaluate an AI agent before buying: which tests to run and which metrics to require

How to evaluate an AI agent before buying: which tests to run and which metrics to require

Bruno Cecatto

Bruno Cecatto

Bruno Cecatto

There's no shortage of options. Almost every week, a new AI solution for customer service appears, promising automation above 80%, cost reduction, and a better customer experience. The problem is when that remains just a promise.

Many solutions impress in the demo, respond smoothly in a controlled environment, and seem ready to scale, but when they enter real operations, the story changes. Resolution rates drop, edge cases start to appear, and the team realizes important questions were not asked before signing. 

For CX leaders, this risk is even greater, because the pressure comes later, in the form of cost, results, and internal accountability. That's why the safest way to evaluate an AI agent today is very simple: test it as if it were already in production and you were being held accountable for the results.

Before evaluating any AI agent, understand your customer service profile

Before comparing solutions, it's worth understanding what kind of support you want to automate, where the highest volumes are, and how much of your demand can actually be handled by an agent without creating more friction than relief.

This picture starts with simple questions, but ones that greatly change the quality of the evaluation, such as: How much volume comes in per month? Which channels concentrate the most support requests? Which part of the demand is more informational, such as recurring questions, status updates, and guidance, and which part requires action, context, or integration with systems?

It's also worth looking at the foundation that will support this agent. If the knowledge base is scattered, outdated, or incomplete, that already affects the test from the start. And there is another point that many people underestimate: after deployment, who will monitor adjustments, review responses, and keep the agent evolving within operations?

Without this diagnosis, the chance of making a mistake remains high even when the solution seems good. Evaluating a vendor before understanding your own support operation usually leads to a weak comparison, because the team starts judging tools without enough clarity about the problem it wants to solve.

What tests should you run before buying an AI agent?

A demo helps explain the product proposition, but it isn't enough to decide to buy. What really makes a difference is putting the agent in situations that resemble your operation and observing how it behaves outside the sales script.

Test 1 — Real resolution rate (or at least a reliable estimate)

The first metric worth testing is also the most important: how much the agent can truly resolve in your context. Not in a benchmark, in a presentation, or in a generic dataset, but in YOUR operation.

A simple way to do this is to separate the 50 most frequent tickets from last month and use that sample in the evaluation. The point here is not just to measure whether the agent responds, but how many cases it resolves without needing to escalate to a human. 

This test gives a much more useful readout than any loose promise of automation, because it brings the evaluation closer to what really matters after the contract is signed: real results on top of real volume.

Test 2 — Behavior in edge cases

This is where many solutions start to lose steam. Edge cases almost never show up in the demo, but they do show up in the operation. Does the agent make up an answer when it doesn't know? Does it recognize that it needs to escalate? Does it understand signs of frustration? Can it change course when it realizes it's taking the conversation in the wrong direction?

This test helps a lot in measuring the safety of automation. An agent may do well in simple cases and still create problems when it encounters an exception, a poorly phrased question, or an upset customer. It's better to find that out before signing.

Test 3 — Integration with the current stack

An AI agent shouldn't operate as a parallel operation. It needs to coexist with the stack you already have, with the current helpdesk, with the workflows the team already uses, and with the way support is tracked day to day. 

It's important to understand early whether the solution works well in the current environment, whether it requires major process changes, whether it depends on a deeper team reconfiguration, or whether it creates additional effort that no one is accounting for. In many cases, the problem isn't the agent's quality, but the operational cost of integrating it into your operation.

Here, the diagnosis makes a big difference. At Cloud Humans, in fact, we can use AI itself to analyze a real sample of your tickets and estimate how many of them it could resolve before the contract is signed. That way, you decide based on your scenario, and not on a 'generic' benchmark.

What metrics should you require from an AI agent?

After the tests, it's worth focusing on the metrics that really help evaluate performance and return. This is the stage where the comparison starts to become more objective and also where many hasty decisions can be avoided.

  • Real resolution rate: look at the support cases that were resolved without human escalation and with enough quality not to create rework later.

  • Cost per resolution: compare it with the cost of a human ticket. What drives the result more is not the lowest price per resolution, but the tool with the best resolution rate and performance.

  • Incorrect escalation rate: measure how many times the agent passed to a human something it could have resolved on its own or, conversely, held back a case that should already have been escalated.

  • CSAT for AI interactions: evaluate this metric separately from overall CSAT. Customers tend to evaluate AI support more rigorously, which makes a fair assessment of AI compared to human support more difficult.

On the other hand, there are metrics that stand out, but help little when they appear on their own. Deflection rate without a quality criterion can inflate the perception of results. Overall NPS mixes too many variables and comes too late to evaluate this kind of decision. Number of conversations handled shows volume, not value. In the end, closing a conversation is not the same as resolving it well.

What should I ask the supplier before signing?

Some questions can significantly change the quality of the evaluation. They help us understand not only how the agent works, but also how it will behave in your operation after the contract is signed.

  • How do you charge: per resolution, per use, or by volume?
    This answer changes how ROI is assessed. Usage-based models, such as messages or API calls, can make comparison difficult because they do not always reflect whether the customer actually received a satisfactory answer.

  • How does the agent search for information and avoid responding based on outdated content?
    This question helps determine whether the solution uses dynamic search in the right sources or relies on a more rigid knowledge base that loses quality over time.

  • How long does it take to go into production?
    It is also worth understanding whether there is a training period before going live, who needs to be involved, and how much internal effort goes into this stage. Setup time varies quite a bit from one vendor to another.

  • How are customer data handled, and which model processes the messages?
    Security, privacy, and governance need to be part of the evaluation from the start, not only when the decision is already at an advanced stage.

  • What reports will I receive to track results and prove ROI?
    Analytical capabilities vary quite a bit between vendors. It is worth clearly understanding how you will track resolution, cost, escalation, and quality once the agent is up and running.

A decision is only good when you can stand by it

In the end, the agent that looks best in the demo is not always the one that will make the most sense for your operation. What will matter later is how much it actually resolves, how much it costs, how much it demands from the team, and how clearly you can show all of that when the bill comes due.

If you lead CX, this point matters even more. When the decision is made with care and backed by clear data, the conversation changes. You stop defending a technology bet and start supporting a business decision with visible impact on cost, efficiency, and experience.

Want to look at a real sample of your support interactions and understand how much of it an AI agent could truly resolve? Click here and request your free assessment.

About the Author

Feb 11, 2026

Feb 11, 2026

Bruno Cecatto

Bruno Cecatto

Bruno Cecatto

Founder @ Cloud Humans - I help fast-growing companies scale their customer support with fewer resources.

Founder @ Cloud Humans - I help fast-growing companies scale their customer support with fewer resources.

Founder @ Cloud Humans - I help fast-growing companies scale their customer support with fewer resources.

LinkedIn

Feb 11, 2026

Feb 11, 2026

Meet Cloud Humans.