AI Engineering Academy · Lesson

Query, Retrieve, and Generate

Write the query pipeline that embeds the user question, retrieves the top-k chunks, formats an augmented prompt, calls the LLM, and returns a cited answer.

The Query Pipeline: End to End

The query pipeline is the online half of RAG — the code that runs in real time when a user asks a question. It connects all the components built during indexing: the embedding model, the vector store, the prompt template, and the LLM. A well-implemented query pipeline completes in under 500ms for most workloads and produces grounded, cited answers. In this lesson we build each step from scratch.

Step 1: Embed the User Query

The first step is to convert the user's natural language question into a vector embedding using the same model used during indexing. This embedding encodes the semantic meaning of the question and will be compared against document chunk embeddings in the vector store. Keep this step fast — use a lightweight model like text-embedding-3-small and cache embeddings for repeated identical queries.

from openai import OpenAI

client = OpenAI()

def embed_query(question: str) -> list:
    response = client.embeddings.create(
        model='text-embedding-3-small',
        input=[question]
    )
    return response.data[0].embedding

user_question = 'What is our remote work policy?'
query_vector = embed_query(user_question)
print(f'Query embedded: {len(query_vector)}-dim vector')

All lessons in this course

Document Loading and Text Extraction
Chunking Strategies: Fixed vs Sentence vs Recursive
Indexing: Embedding and Storing Chunks
Query, Retrieve, and Generate

← Back to AI Engineering Academy