Home / AI Arena / Building AI Agents / Query Transformation

Query Transformation

7 min read ai-arena LangChain RAG Python

This is part of the AI Agents series. All code is at github.com/achintmehta/langchain.

The retrieval quality problem

The vector similarity search at the heart of RAG is only as good as the query you give it. The problem is that users do not phrase their questions the way documents are written. A user might ask "why won't my connection reset work?" while the relevant documentation says "TCP RST packet behaviour". The vectors for these are not particularly close, so naive retrieval returns poor results even if the perfect answer is sitting in the database.

Query transformation is the practice of rewriting, expanding, or routing the user's query before it hits the retrieval step, specifically to close this vocabulary and phrasing gap.

Rewrite-Retrieve-Read

The simplest transformation: ask the LLM to produce a better search query from the user's raw input. The intuition is that the LLM knows what kind of language appears in technical documents and can rephrase a colloquial question into something that matches document vocabulary more closely.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

rewrite_prompt = ChatPromptTemplate.from_template(
    """You are an expert at improving search queries.
Rewrite the following question to make it a better query for retrieving
relevant documents from a technical knowledge base.
Output only the improved query, nothing else.

Original question: {question}
Improved query:"""
)

rewrite_chain = rewrite_prompt | llm | StrOutputParser()

original = "why won't my connection reset work?"
improved = rewrite_chain.invoke({"question": original})
print(improved)
# "TCP RST packet transmission failure troubleshooting"

# Now use `improved` as the retrieval query instead of `original`
results = retriever.invoke(improved)

This is a small change with a surprisingly large impact on retrieval quality for user-facing applications where you cannot control how questions are phrased.

Multi-Query Retrieval

A single query, even a well-written one, can only approach the topic from one angle. Multi-query retrieval generates several different versions of the question, retrieves for each, and combines the results. This casts a wider net and is especially effective when the question is genuinely multi-faceted.

multi_query_prompt = ChatPromptTemplate.from_template(
    """You are an AI assistant. Your task is to generate five different versions
of the given question to retrieve relevant documents from a vector database.
By generating multiple perspectives on the user question, your goal is to help
overcome some of the limitations of distance-based similarity search.

Provide five alternative questions separated by newlines.
Original question: {question}"""
)

# Parse the response into a list of questions
def parse_queries(text):
    return [q.strip() for q in text.strip().split("\n") if q.strip()]

multi_query_chain = multi_query_prompt | llm | StrOutputParser() | parse_queries

queries = multi_query_chain.invoke({"question": "What are the best practices for API authentication?"})

# Retrieve for each query and deduplicate by page content
seen = set()
all_docs = []
for query in queries:
    for doc in retriever.invoke(query):
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            all_docs.append(doc)

LangChain also ships a MultiQueryRetriever that does this automatically, it wraps an existing retriever, generates multiple queries internally, and returns the deduplicated union of all results.

RAG-Fusion with Reciprocal Rank Fusion

Multi-query retrieval produces multiple ranked lists. Reciprocal Rank Fusion (RRF) is a simple and effective algorithm for combining them into a single ranked list that reflects which documents were highly ranked across multiple queries rather than just one.

A document scores 1 / (rank + k) for each list it appears in, where k (typically 60) is a constant that smooths the impact of high-ranked documents. Documents that rank highly in multiple query results accumulate the highest total score.

def reciprocal_rank_fusion(results_list, k=60):
    fused = {}
    for results in results_list:
        for rank, doc in enumerate(results):
            key = doc.page_content
            if key not in fused:
                fused[key] = {"doc": doc, "score": 0.0}
            fused[key]["score"] += 1.0 / (rank + k)
    sorted_docs = sorted(fused.values(), key=lambda x: x["score"], reverse=True)
    return [item["doc"] for item in sorted_docs]

all_results = [retriever.invoke(q) for q in queries]
fused_docs = reciprocal_rank_fusion(all_results)

The combination of multi-query generation plus RRF is called RAG-Fusion and is one of the most reliable off-the-shelf improvements over naive RAG.

HyDE, Hypothetical Document Embeddings

HyDE takes a different angle. Rather than generating multiple queries, it generates a hypothetical document that would answer the question, then uses the embedding of that hypothetical document as the retrieval query.

The logic is that the embedding of a plausible answer looks more like the embeddings of real answer documents than the embedding of the bare question does. The question and the answer are semantically the same information, but they live in slightly different parts of embedding space because of their different linguistic form.

hyde_prompt = ChatPromptTemplate.from_template(
    """Write a short paragraph (3-5 sentences) that would plausibly answer
the following question. It does not have to be factually correct,
just write something in the style of a document that would contain the answer.

Question: {question}
Hypothetical answer:"""
)

hyde_chain = hyde_prompt | llm | StrOutputParser()

hypothetical = hyde_chain.invoke({"question": "How does TCP handle packet loss?"})

# Retrieve using the hypothetical document as the query
results = retriever.invoke(hypothetical)

HyDE works best when the user query is very short or informal and the documents are long and formal. It adds one extra LLM call, so it increases latency, consider it for cases where retrieval quality matters more than speed.

Logical Routing

So far, the examples have assumed a single vector database with all your documents. Real applications often have multiple data sources, different document collections, a SQL database, an API. Routing classifies the query and sends it to the appropriate source.

LangChain's structured output feature makes this clean. You define the routing decision as a Pydantic model and ask the LLM to return one of its allowed values:

from pydantic import BaseModel
from typing import Literal

class RouteQuery(BaseModel):
    """Route a user query to the most relevant data source."""
    datasource: Literal["python_docs", "javascript_docs", "sql_database"]

structured_llm = llm.with_structured_output(RouteQuery)

routing_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert at routing user questions to the appropriate data source.
Based on the question topic, route it to the relevant source.
python_docs: Python language questions
javascript_docs: JavaScript and web frontend questions
sql_database: Questions about our product data, users, transactions"""),
    ("human", "{question}")
])

router = routing_prompt | structured_llm

decision = router.invoke({"question": "How do list comprehensions work?"})
print(decision.datasource)  # "python_docs"

Because the output is a typed Pydantic model, you get validation for free, the LLM can only return one of the allowed values, and any other response raises a validation error that you can handle with retry logic.

Combining transformations

These techniques are not mutually exclusive. A production pipeline might:

Rewrite the raw user query into a cleaner search query.
Generate three variants of the rewritten query.
Route each variant to the right data source.
Retrieve from each source, fuse with RRF.
Feed the top-k fused results to the LLM.

Each step adds a small amount of latency but significantly improves retrieval quality. Start with the simplest approach (naive RAG), measure where it fails, and add transformations targeted at those failure modes rather than applying everything at once.

What's next

Once you have solid retrieval, the next layer is building agents, systems where the LLM has tools it can call and makes decisions about what to do next. The next part covers LangGraph's cognitive architectures: the ReAct loop, reflection, multi-agent supervisor patterns, human-in-the-loop, and more.