AI Engineering Academy · Lesson

Why LLM Apps Are Hard to Debug

Understand why traditional logging is insufficient for LLM applications, what information you need to diagnose failures in RAG and agent pipelines, and the tracing data model.

The Unique Challenges of LLM Debugging

Traditional software fails deterministically: given the same input, it always produces the same output, and a stack trace points directly to the line that failed. LLM applications break these assumptions. The same prompt can produce different outputs on different calls, failures are often silent (wrong answer instead of exception), and the cause may be buried in a prompt used 5 steps earlier in a chain. Standard logging and debugging tools are simply not designed for this.

Non-Determinism Makes Reproduction Hard

LLM outputs are non-deterministic by default. Even with temperature=0, the same prompt may produce slightly different outputs due to batch processing and numerical precision. This means bugs are intermittent: a prompt that fails 20% of the time will pass your test suite if you only run it once. Reproducing a specific failure requires logging the exact input, model parameters, and output at the time of failure — not just the input.

import json
import time

def logged_llm_call(client, messages, model, temperature, **kwargs):
    request_id = f'{int(time.time() * 1000)}-{id(messages)}'
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        **kwargs
    )
    
    # Log EVERYTHING needed to reproduce this exact call
    log_entry = {
        'request_id': request_id,
        'model': model,
        'temperature': temperature,
        'messages': messages,
        'response': response.choices[0].message.content,
        'finish_reason': response.choices[0].finish_reason,
        'usage': response.usage.model_dump(),
        'timestamp': time.time()
    }
    write_to_trace_store(log_entry)
    return response

All lessons in this course

Why LLM Apps Are Hard to Debug
Tracing with LangSmith
Langfuse for Model-Agnostic Observability
Alerting on Latency, Cost, and Quality Degradation

← Back to AI Engineering Academy