Confident Nonsense: Is your AI “working,” or is it just turned on?

Reality check: If you don’t own evaluation, you don’t own outcomes. You own activity, which looks great right up until it doesn’t. Vendor dashboards and “model quality” metrics are not the same thing as operational performance across real workflows.

A telecom company rolled out an AI assistant inside the agent desktop. The pitch was reasonable: summarize calls, propose next steps, reduce after-call work, and help agents sound like they’d read the policy docs instead of guessing.

The pilot looked clean. Handle time dropped. Wrap time dropped. Supervisors reported “better notes.” Someone presented a dashboard with an upward arrow, which is how you know the budget gods were pleased.

Then rollout happened.

The assistant expanded from one site to five. It moved from a curated slice of billing calls into device support, plan changes, retention, and the part of the contact center where the customer opens with, “So I saw on Reddit…” Reality, in other words.

Two months in, a VP asked the obvious question: “Is it working?”

They got three answers.

  • The vendor dashboard said the system was performing well based on internal quality scores.

  • The data science team cited improvements in summarization “accuracy,” but struggled to translate that into operational impact.

  • Operations showed productivity gains in one site, increased escalations in another, and a rise in recontacts in a third. Supervisors had started telling agents to “edit the summaries,” quietly adding time back and defeating the point.

No one was lying. Everyone was measuring what they personally owned. The system wasn’t simply “working” or “not working.” It was producing uneven outcomes across contexts, and the organization didn’t have a shared ruler to measure it.

That is how “it seems fine” becomes a standard.

Why nobody owns the ruler

This usually isn’t a failure of intelligence. It’s a failure of ownership.

Enterprises can build an AI capability faster than they can agree on what “good” means, how to prove it, and who is responsible for keeping it true after rollout.

A few patterns show up repeatedly.

Evaluation is everyone’s job, so it becomes no one’s job.
IT owns the platform and uptime. Data science owns models and offline metrics. The business owns “outcomes,” which often means the slide that got the program funded. Operations owns daily workflow success but rarely owns the data, the model, or the release process. Ask “who owns evaluation?” and you’ll get sincere answers that don’t match.

Incentives reward shipping, not proving.
Programs get funded and celebrated on go-live dates and adoption numbers. Quality is slower, noisier, and more expensive to measure. So teams default to what is easy: sessions, prompts, summaries generated, and a vague “quality score.”

Pilots hide where the system struggles.
Pilots are biased toward cleaner intents and motivated agents. Rollout includes policy exceptions, seasonal spikes, new products, angry customers, and edge cases that aren’t actually “edge” in volume. Your aggregate metrics can stay flat while your worst-case outcomes become the customer’s entire memory of your brand.

Proxy metrics are comforting, and comfort is a trap.
Containment goes up while first contact resolution stays flat. Handle time drops while recontacts rise. Agents learn to accept suggestions quickly. Supervisors learn to discourage escalations. The enterprise accidentally optimizes for a spreadsheet instead of customer outcomes.

No one wants to be the person who slows things down.
Owning evaluation means owning bad news. It means writing down thresholds, running tests that can fail, and telling a senior leader that a shiny program should pause. Without explicit sponsorship, “it seems fine” becomes socially efficient.

How to spot dashboard theater in your organization

The first sign is not an incident. It’s ambiguity.

You should worry if:

  • You don’t have a short document that defines evaluation criteria, acceptance thresholds, and who signs off.

  • “Evaluation” equals a vendor dashboard or a one-time pilot readout.

  • Test plans are created after deployment, or they exist but don’t influence releases.

  • “Quality” is referenced constantly, but no one can define it operationally.

A simple diagnostic question:
“Show me the last recorded failure from this system, how it was classified, and what changed afterward.”
If the room goes quiet or you get anecdotes instead of evidence, you’ve found the gap.

What to put in place so you actually own evaluation

The fix isn’t “more metrics.” It’s an evaluation operating model.

1) Create an evaluation contract per use case.
Two pages, max:

  • What the system is allowed to do (and what it is not)

  • What “good” means in operational terms

  • What evidence is required before expanding scope

  • Who signs off for quality, risk, and readiness

If you can’t point to this quickly, you don’t have governance. You have hope.

2) Use a layered scorecard.
A single KPI invites nonsense. Balance “how much” with “how well.”

Include:

  • Business outcomes: first contact resolution, recontact rate, time-to-resolution, complaint rate

  • Workflow quality: supervisor QA, agent edit rate, escalation correctness

  • Safety/compliance: disclosure accuracy, PII handling rules, policy adherence

  • Model behavior: factual consistency, grounding/citation rate (if using retrieval)

  • Operational health: latency, tool-call failure rate, fallback rate, cost per interaction

This is how you avoid winning on efficiency while losing the customer.

3) Build real test sets, not vibes.
You need more than a single “golden” set. Create:

  • A representative set of common interactions

  • An adversarial set of known failure modes and policy exceptions

  • A regulated set for compliance-heavy topics

  • A seasonal set for call-driver shifts

Treat these like assets. Version them. Refresh them. Gate releases on them.

4) Establish a sampling and review cadence with a closed loop.
Production QA can’t be “when we have time.”

  • Weekly sample review by intent and channel

  • Stratify by risk (regulated topics, high-value customers, escalation-prone intents)

  • Maintain an error taxonomy (annoying, costly, risky, brand-damaging)

  • Track fixes to closure (prompt changes, retrieval updates, routing rules, agent guidance)

The important part isn’t the meeting. It’s the corrective action trail.

5) Assign owners who can act.
If the “owner” can’t pause rollout or block a release, they’re not an owner. They’re a spectator with a calendar invite.

What this means for leaders

If you’re asking “Is it working?” and getting three answers, your AI program is telling you something: evaluation is missing, fragmented, or performative.

Own the ruler. Put a contract in place, define thresholds, build test sets that include the ugly cases, and operationalize QA as a recurring discipline. Your AI outcomes will improve, your risk profile will shrink, and your frontline teams will spend less time compensating for the system’s uncertainty with informal workarounds.

And yes, you will still have dashboards. They just won’t be the only thing standing between you and reality.

Previous
Previous

CMSWire: Medallia Experience '26: Insight Generation to Customer Action Orchestration

Next
Next

Open World CX: Customization Beats Personalization When You Need Customers to Stick