LLM integration to our application turned debugging into a nightmare. Here’s how we fixed it in 5 days with zero new tools.

Startup Engineering Principle : Choose Simple over Perfect Solution

Oct 27, 2025

The Problem:

A single user request triggers 200+ LLM calls. Which one failed? Where did context break? Our MTTR became as high as 7 hours. As a startup, we cannot afford to have our reputation at stake.

The Startup Dilemma:

Everyone suggested: LangSmith, LangFuse, Helicone.

Evaluating tools = 2–3 weeks
Every new tool = learning curve + cost + vendor management
We don’t even know what we need yet

First-Principles Questions:

What exactly do we need to debug? Input sent, response received, post-processing results, and relationships between 200+ calls.

The insight: Not a logging problem. It’s a tracing problem. Each LLM call is stateless. Our application maintains context. Same applies to observability.

The Solution:

We already had Grafana Loki, Tempo, and OpenTelemetry.
The breakthrough: Connect all LLM calls under one parent span.

Architecture :

Application Layer
OpenTelemetry emits logs and traces.
Logs (Grafana Loki)
Captures each LLM interaction:
1. trace_id & span_id
2. Message sent & response received

Traces (Grafana Tempo)
Parent span connects all related calls:
1. “Analyze PDF” → Factor 1 (Profit) → Factor 10
2. Each factor → LLM Call 1… Call 20
3. Shows serial vs parallel execution, timing.

Observability UI (Grafana)
Query by span_id → Complete journey with all logs.

The Results:

MTTR: 7 hours → 30 minutes
Timeline: 5 days (2 design + 2 dev + 1 infra)
Cost: $0 in new tools
Complexity: Zero new vendors

Key Design Decisions:

Why Traces Over Logs?
Logs show events. Traces show relationships. For 200+ interconnected calls, relationships matter.

Why Existing Stack Over New Tools?
Existing stack solved 80%. New tools solve 100% but cost 5x in time, money, and complexity.

Why Parent-Child Spans?
Mirrors how our application maintains state. One request → multiple factors → multiple LLM calls.

Startup Engineering Lessons:

Don’t: Add new tools before understanding the problem.
Do: Define what you need first (saved 3 weeks).

Don’t: Evaluate 5 tools when you don’t know what you need.
Do: Use existing stack, learn what’s missing, choose deliberately.

Don’t: Wait for the “perfect” solution.
Do: Ship an 80% solution in 5 days, iterate based on production issues.

We’ll eventually need specialized AI observability tools. But not today. Tomorrow, we’ll know exactly which tool solves which problem.

That’s Startup Engineering: choose the simple over the perfect solution. Solve for 80%. Iterate to 100%.

Discussion about this post

Ready for more?