Hypothetical Document Embeddings (HyDE)

June 9, 2026·7 min read

AIRAGretrievalembeddings

I was reading about different techniques on optimizing the answers of RAG and I stumbled upon HyDE (Hypothetical Document Embeddings) so decided to give it a try in a PoC.

One of the issues in RAG retrieval is that queries and answers usually have very different shape, a query is a question while the knowledge base is full of statements. For example, "How do I rotate an expired API key?" and "Key regeneration is performed from the project settings panel" are about the same thing but they do not have very little in common as structure. Even if we use an embeddings model the answer and question are very far in the vector space, practically because there are different kinds of text.

One is short, interrogative, and written in the user's vocabulary. The other is long, declarative, and written in the documentation's vocabulary. We are asking the embedding to bridge a gap it was never trained to care about. This is the core observation behind HyDE (Gao et al., 2022).

An embedding space drawn as two overlapping lobes. A blue lobe on the left holds question-shaped text, short and interrogative in the user's words, and a green lobe on the right holds answer-shaped text, long and declarative in the documentation's words. The query "how do I rotate an expired API key?" is a blue dot inside the question lobe, close to other question dots. The page that actually answers it, "Key regeneration is performed from the project settings panel", is a green dot in the answer lobe. A red dashed line spans the wide gap between them labeled far in cosine space, the match you wanted, illustrating that the query sits closer to other questions than to its own answer.

The fix to this problem is to stop comparing questions to answers, and start comparing answers to answers, or questions to questions.

Fake the answer at query time

HyDE (Hypothetical Document Embeddings) proposes the following: before retrieving the answer, ask an LLM to hallucinate an answer to the query, then embeds that answer and search with it for similar answers in the knowledge base.

HyDE

The hypothetical document for that query might read "To rotate an API key, open the project settings, regenerate the key, and update your environment variables. The old key stops working immediately."

The hypothetical answer can be wrong and HyDE still works, because you never show its content to anyone. It is used only as a way to find a better neighborhood, then retrieve real documents from there. The cost is that HyDE adds an LLM generation and embedding call to every single query, aka extra latency and token cost on every query.

Reverse HyDE: fake the questions at index time

In reverse HyDE instead of turning the query into answer-shaped text at search time, it turns each document into question-shaped text when you index it. For every chunk, you ask an LLM to write the questions that chunk could answer, embed those, and store them pointing back at the chunk/document.

Indexing Reverse HyDE

The regeneration page now has index entries for "how do I rotate an API key?", "what happens to my old key after I regenerate it?", and "where do I change my API credentials?".

When the real query arrives, it matches against question-shaped text written in the user's own register. There is no generation at query time. You embed the incoming query and search as usual.

The actual difference is when

The difference in the two ways is when the LLM does the work.

  • HyDE spends compute per query. Every search pays for one generation.
  • Reverse HyDE spends compute per document, once, at ingestion. Every search after that is cheap.

If the use-case is something where latency cant be afforded, then obviously Reverse HyDE is more suitable. On the other hand if knowledge base is changing too often, which requires re-indexing, HyDE is more suitable (given that latency is not an issue).

Two pipelines stacked for comparison, showing where the LLM generation lands. The top row, HyDE, labeled compute on demand, pay per query, is a single red-bordered every-query strip: query, then a yellow LLM generate fake answer step, then embed, search, and docs. The LLM call sits on the hot query path. The bottom row, Reverse HyDE, labeled precompute, pay per document once, splits in two: a yellow once-at-index-time block of chunk, LLM generate questions, index, and a separate green every-query-no-LLM block of just query, embed, search. A footer reads: volatile corpus leans HyDE because nothing goes stale, stable corpus with heavy traffic leans Reverse HyDE because you amortize one generation across many searches.

My PoC

The diagrams above are drawn by hand. To check that the effect is real and not just a nice story, I did a small PoC using a free embeddings model (all-MiniLM-L6-v2 via transformers.js).

Search the raw query against the corpus and the right page lands at rank 6, buried under documents that merely share surface words like "rotation policy" and "token expired" but do not answer the question.

Horizontal bar chart titled Baseline: raw query vs corpus, ranking twelve documents by cosine similarity to the query. The top five are lexical look-alikes: expiry-policy at 0.763, rotation-reminder at 0.710, token-expired-error at 0.663, security-advice at 0.518, and create-key at 0.501. The regeneration-page, the answer we actually wanted, is highlighted in green at rank 6 with a score of 0.489, below all of them.

The asymmetry is easy to see. Compare the query against its own answer and against other questions about API keys, and 4 of 5 of those questions sit closer to the query than the answer does, even though none of them is the answer. That's obviously expected.

Horizontal bar chart titled Asymmetry: query vs its answer vs same-topic questions. Four questions outrank the answer: how do I revoke an API key I no longer use at 0.695, what should I do if my API key is leaked at 0.593, how do I create a new API key at 0.586, and where do I find my API key in the dashboard at 0.489. The regeneration-page, its own answer, is highlighted in green tied at 0.489, with one more question, can I have more than one API key per project, just below at 0.456.

Switch to HyDE, embedding a fake answer instead of the query, and the right page climbs from rank 6 to rank 3. It does not reach the top, two documents that genuinely discuss key rotation and expiry still rank above it. The fake answer clears out some generic noise, but it in this case it cannot beat documents that really are about the same thing.

Horizontal bar chart titled HyDE: hypothetical answer vs corpus. rotation-reminder leads at 0.704 and expiry-policy follows at 0.656. The regeneration-page, the answer we wanted, is highlighted in green at rank 3 with 0.626, well above the rest of the field which trails off from security-advice at 0.549 down to webhooks at 0.178.

Reverse HyDE takes it to rank 1 outright. The user's almost-exact phrasing was generated and indexed ahead of time, so the query matches question-to-question.

Horizontal bar chart titled Reverse HyDE: raw query vs precomputed questions. The regeneration-page, the answer we wanted, is highlighted in green at rank 1 with a commanding 0.865, far ahead of expiry-policy at 0.727 and create-key at 0.586, with auth-401, sso, billing-faq, and rate-limits trailing well below.

This is ofcourse not a benchmark, my PoC had a corpus that is tiny and the numbers can be noisy, but it demonstrates that they technique works and is valid tool in the toolbox of everyone trying to optimize a RAG system.