In the last post I bragged that I had stopped declaring intents and started letting the corpus speak. I clustered tens of thousands of support tickets, and the algorithm handed me 91 groups. I treated that 91 like a fossil I had dug up, something that was simply there in the data and mine to read off.
It wasn't. 91 was a dial I had turned without noticing. The same tickets would have just as happily been 12 intents, or 400. That is the second, subtler way the clusters lied to me. They presented a count as if it were a fact about the world, when it was a choice I had made and then forgotten making.
Clustering gives you whatever resolution you ask for
Run K-means and you pass it k. The algorithm does not find the number of intents. You assert the number, and it draws that many boundaries around whatever is there. Ask for 12 and you get 12 confident, reasonable-looking clusters. Ask for 400 and you get 400 of those too.
The methods that claim to pick the number for you only move the choice somewhere quieter. The elbow, the silhouette score, the min_cluster_size in HDBSCAN: each is a knob, and turning it changes the answer. You are still choosing. You have just hidden the choosing inside a hyperparameter so it feels like the data decided.
Hierarchical clustering is the honest one, because it refuses to pick at all. It gives you a tree. A tree has no single correct place to cut. Slide the cut up and you get fewer, broader intents. Slide it down and you get more, narrower ones. Every height is a valid clustering. The dendrogram is telling you the truth, which is that "how many intents" has no answer until you supply one.
The same tickets are one intent, or they are five
Take "cancel order". At a coarse cut it is one clean intent. Drop one level down the tree and it shatters:
- cancel before dispatch, which is a flag flip in the database
- cancel after dispatch, which means someone has to call the carrier and recall a parcel
- cancel a subscription, which is a billing operation
- cancel because the item showed up damaged, which is a refund plus a return label
Is that one intent or four? There is no fact of the matter in the embeddings. Those four are close enough to merge and distinct enough to split, and the geometry supports both readings without complaint. The vectors will not break the tie, because the tie is not theirs to break.
Sameness is a routing question, not a distance question
The thing that actually decides it is not in the text at all. Two utterances belong to the same intent when they should trigger the same downstream handling. Different action, different intent. Same action, same intent, no matter how differently they are worded.
"Cancel before dispatch" and "cancel after dispatch" read almost identically and sit right on top of each other in embedding space. They are different intents, because one flips a boolean and the other pages a human to phone a logistics company. The distance between them is tiny and completely irrelevant.
It runs the other way too. A terse "where's my money" and a three-paragraph apology-laden essay about a broken blender can land far apart in vector space and still be the same intent, because both end in the same place: issue a refund.
So granularity is not set by a silhouette score. It is set by divergence in handling. The right place to cut the tree is wherever the branches below it would route to different code.
The benchmarks quietly assume a truth that isn't there
This is also where the academic framing of the problem gets slippery. Open up any new-intent-discovery paper and the results table is scored with NMI, ARI, or cluster accuracy. Every one of those metrics compares your clusters against a labeled ground truth, which means every one of them assumes there is a single correct granularity: the annotator's.
Under that assumption your clustering is "wrong" if it splits an intent the annotator merged, even when your split sends the two halves to two different workflows and theirs sent them to one. A method can score badly on the benchmark and be exactly right for your system. It can ace the benchmark and be useless for your router. The "estimate the number of clusters" sub-field that the papers chase as if it were a statistics problem is, in production, not a statistics problem at all. It is a question about the shape of your own operations.
Ship the tree, cut per consumer
The practical move is to stop freezing a flat list of 91 and to keep the hierarchy around. Then cut it differently for whoever is asking.
The router cuts where handling diverges. The analytics dashboard can cut far coarser, because the VP wants twelve buckets on a slide, not ninety-one. Same tree, cut at two heights, and neither reading is wrong.
For the router, the cut is not a number you tune. It is a function of what you would do with each ticket:
// A node in the agglomerative tree. Leaves are tickets, internal nodes are merges.
type ClusterNode = {
members: Ticket[];
children: ClusterNode[];
};
// The only thing that decides a boundary: what would we actually do with this ticket?
type Handler = "auto-cancel" | "carrier-recall" | "billing" | "refund-and-return";
function cutByHandling(node: ClusterNode, route: (t: Ticket) => Handler): ClusterNode[] {
const handlers = new Set(node.members.map(route));
// Pure node: every ticket here is handled the same way. This is one intent.
if (handlers.size === 1) return [node];
// Mixed node: the branch is hiding a routing fork. Keep descending.
return node.children.flatMap((child) => cutByHandling(child, route));
}
The cut height stops being a hyperparameter and becomes a consequence of route. You descend until every surviving node is handled exactly one way. That is the correct granularity for the router by construction, and it moves on its own the day you add a new workflow, because a once-pure node now contains two handlers and splits itself.
The catch is that you need route, which is close to the thing you were trying to build in the first place. In practice you bootstrap it. Sample a handful of tickets per cluster and have a human tag the handler, or reuse the resolution notes the support agents already left behind, the same trail that made discovery possible in the last post. You are not labeling fifty thousand tickets. You are labeling enough to color the branches.
Discover the shape, decide the resolution
The embeddings will cluster at any resolution you ask for and call every one of them valid. The thing that picks the resolution does not live in your vector store. It lives in your runbook.
So stop asking how many intents there are. Ask how many different things you actually do. Discover the shape from the corpus, and cut it with your own topology of actions.