I was working lately on improving some customer support agents and one thing that brought me an "aha" moment was the design of intents to route the agents to the right path. The intents were coming directly from the existing customer support system and the LLM was trying to identify the intent and map it to the right category and route it to the right workflow.
A few categories/intents "track delivery", "reschedule", "cancel order", "request invoice", and so on. Anything that didn't fit went to a bucket called "Other", which routed to a human. It worked, in the sense that the demo worked. Then I had the idea to use embeddings to cluster similar tickets to see which category was the most dominant in order to tackle it first.
A classifier is a function from a query to a fixed set of labels. That is the whole contract. It can't return a label we never defined, which means the categories we got wrong weren't flagged as errors, they were invisible. A ticket about a driver who left a parcel with the wrong neighbor didn't come back as "you forgot a category". It came back as "track delivery", with reasonable confidence, because that was the closest box on the shelf. This is very problematic, as the business evolves and new categories of problems emerge, the predefined categories become stale.
In this case the router always produced an answer. Every ticket got a label, the dashboard filled in, and the whole time the thing I most needed to know, what we had failed to model, was the one thing the system structurally couldn't report. Our blind spots got folded into the best-matching category and counted as successes.
"Other" was where it hid. That bucket wasn't noise. It was a stack of intents we hadn't discovered yet, compressed into a single uninformative label so it stopped bothering me.
The corpus already knows the answer
Here is the inversion. I had the data already. We had tens of thousands of resolved tickets, each one a real thing a real customer needed, each one already resolved by a human who left a trail of what it took. The intents were sitting in that pile the entire time. We just decided what they were before we read them.
So I went back and read them first, in the order I should have started with:
- Take the whole corpus.
- Embed it and find the structure that's actually there: reduce, cluster, look.
- Name the groups that emerge, and let the long tail stay a long tail.
- Then, and only then, freeze a router over the intents found.
What came out was not the handful of categories we had declared. The clustering pulled out dozens of distinct recurring jobs, and the most interesting ones were things no one in that original meeting would have written down: "carrier marked delivered but customer disputes it", "label printed at the wrong size", "reschedule blocked by a failed payment retry". None of those were on our list. Every one of them had been quietly sitting inside "Other", and a couple of them were big enough that they should have been the first workflows we automated. The category I was looking for, the dominant one worth tackling first, was not even one of the names we had given the system.
Discovery and serving aren't the same
Discovery and serving aren't the same layer. Discovery is an offline pass over the full corpus: slow, thorough, run on a cadence, allowed to take an hour. Serving is the hot path: a fast classifier that answers in milliseconds. I wasn't going to run clustering per request any more than I'd retrain a model per request. I ran the discovery offline, then distilled what it found into the cheap online router.
The serving layer still ended up being a classifier with a fixed list, which felt like I'd gone in a circle. I hadn't. The list was the same shape, but now it was derived from the corpus instead of from a guess, and I can re-derive it on a schedule because the corpus moves.
Learning: The taxonomy has a half-life
The thing I learned is that even the taxonomy I discovered correctly decays, because the world it described keeps moving. New products show up, new failure modes appear, a carrier changes a policy and suddenly there's a class of ticket that didn't exist last month. A frozen list would route all of it into yesterday's best match and my "Other" bucket would quietly refill, which is how I got here the first time. My learning on this was to treat induction as a loop, not a launch, re-run discovery on a schedule and diff the new groups against the old ones.
Before writing down a single intent, ask one question: am I describing the corpus, or my assumptions about it? Discover the intents, then declare them. Not the other way around.
And even after I looked and clustered properly, the clusters lied to me in a second, subtler way, but that is a different post.