0Pricing
AI Engineering Academy · Lesson

Evaluation, Deployment, and Retrospective

Run the full evaluation harness including retrieval metrics, LLM-as-judge quality scores, and load tests, deploy to a cloud provider, and write a retrospective documenting lessons learned.

The Final Mile: Evaluation Before Shipping

A system is not ready to ship until it has been evaluated end-to-end against real-world conditions. The final evaluation phase combines all evaluation techniques learned in this track: retrieval metrics to verify the RAG pipeline finds the right chunks, LLM-as-judge scores to verify answer quality, load testing to verify latency SLAs, and security scanning to verify hardening. All must pass before deployment begins.

Running the Full Evaluation Harness

Execute the complete evaluation suite against your staging environment using a representative sample of production-like queries. Record all metrics: hit rate, MRR, NDCG for retrieval; faithfulness and answer relevance for generation; per-query cost; p50/p95/p99 latency. Compare every metric to the success criteria defined in the architecture phase. Do not ship until all hard minimums are met.

async def final_evaluation(system_url: str, test_set_path: str) -> dict:
    test_cases = load_test_set(test_set_path)
    results = []
    for case in test_cases:
        start = time.perf_counter()
        response = await query_system(system_url, case['question'])
        latency_ms = (time.perf_counter() - start) * 1000
        judge_score = await judge(case['question'], response['answer'], case.get('reference'))
        hit = any(case['relevant_doc'] in s for s in response.get('sources', []))
        results.append({'latency_ms': latency_ms, 'score': judge_score, 'hit': hit, 'cost': response.get('cost_usd', 0)})
    return compute_final_metrics(results)

All lessons in this course

  1. Designing the Production Architecture
  2. Implementing Core RAG and Agent Features
  3. Hardening: Security, Caching, and Reliability
  4. Evaluation, Deployment, and Retrospective
← Back to AI Engineering Academy