Query, Retrieve, and Generate
Write the query pipeline that embeds the user question, retrieves the top-k chunks, formats an augmented prompt, calls the LLM, and returns a cited answer.
The Query Pipeline: End to End
The query pipeline is the online half of RAG — the code that runs in real time when a user asks a question. It connects all the components built during indexing: the embedding model, the vector store, the prompt template, and the LLM. A well-implemented query pipeline completes in under 500ms for most workloads and produces grounded, cited answers. In this lesson we build each step from scratch.
Step 1: Embed the User Query
The first step is to convert the user's natural language question into a vector embedding using the same model used during indexing. This embedding encodes the semantic meaning of the question and will be compared against document chunk embeddings in the vector store. Keep this step fast — use a lightweight model like text-embedding-3-small and cache embeddings for repeated identical queries.
from openai import OpenAI
client = OpenAI()
def embed_query(question: str) -> list:
response = client.embeddings.create(
model='text-embedding-3-small',
input=[question]
)
return response.data[0].embedding
user_question = 'What is our remote work policy?'
query_vector = embed_query(user_question)
print(f'Query embedded: {len(query_vector)}-dim vector')