Agent observability and lifecycle

Agents behave like black boxes. Identical code produces different outputs. A test passes locally but fails in CI/CD. A production incident happens, and tracing the root cause feels impossible without manually sifting through logs.

After exploring observability frameworks and thinking about how to maintain near-deterministic results and controlled costs, I came to ask myself:

How can we capture what agents actually do—not just what they output—and use that evidence to build better systems?

The problem: invisible agent behavior

Developers face three persistent challenges when building agents locally:

Non-deterministic outputs: Identical prompts and contexts yield different results, making reproduction and debugging nearly impossible.
Context switching overhead: Understanding an agent’s reasoning path requires jumping between IDE, logs, and external debugging tools.
Evidence gaps: When something fails—in development, CI/CD, or production—there’s no reliable trace showing why it happened.

Traditional observability tools built for synchronous services fall short here. Agents aren’t simple request-response systems. They make decisions, invoke tools, delegate to sub-agents, and navigate complex reasoning paths. Standard dashboards can’t capture this complexity.

Introducing Monocle: traces for generative AI

Monocle is an open-source observability framework specifically designed for AI applications. Built as an LF AI & Data project by Okahu, it extends OpenTelemetry (OTel) with a meta-model tailored for LLM and agentic workloads.

The framework embodies three principles:

Open-source standards: Fully OTel-compliant spans requiring minimal code changes. Your instrumentation integrates with industry standards, not proprietary black boxes.

Native extensibility: Add new AI components without Docker or cloud dependencies. Local tracing works out of the box.

Agnostic exporting: Ship traces where you need them—local files, cloud platforms, S3, or any OTel-compatible collector.

Enabling Monocle requires just two lines:

from monocle_apptrace import setup_monocle_telemetry
setup_monocle_telemetry(workflow_name='adk_travel_agent', monocle_exporters_list=['file', 'okahu'])

That’s it. You now have deep observability with minimal instrumentation overhead.

Why ecosystem compatibility matters

Monocle doesn’t exist in isolation. It integrates across the entire AI stack:

Agentic frameworks: LangGraph, LlamaIndex, Google ADK, OpenAI Agent SDK, CrewAI
LLM providers: OpenAI, Azure OpenAI, Anthropic, Google Vertex, AWS Bedrock, DeepSeek
Web frameworks: FastAPI, Flask, AWS Lambda, Vercel
LLM frameworks: Langchain, Haystack
Exporters: File, memory, S3, Azure Blob, GCS, Okahu Cloud

This breadth means you’re not locked into a specific vendor or tech stack. Use the tools you already have; Monocle wraps them with observability.

Phase 1: taming the local development black box

Instrument Monocle to export traces locally and to Okahu Cloud.
Install the Monocle Claude skill or MCP (Model Context Protocol).
Run your agent tests; Monocle generates trace JSON data.
If a test fails, feed the trace JSON and git history to Claude—let it regenerate the code based on evidence.
Once the test passes, commit.

The philosophy is radical in its simplicity: Your local trace is evidence that something worked in the past.

This flips the debugging model on its head. Instead of asking “why did this fail?” and wrestling with hypotheticals, you ask “what does the trace show?” and let AI tools help you reason about it.

Validation beyond outputs

Traces capture four critical validation areas:

Agentic response: Does the output match expectations? String comparison or semantic similarity.
Tool invocation: Were the correct tools called with proper inputs?
Agent delegation: Did sub-agents route correctly and hand off successfully?
Cost and performance: Token usage, latency, error states—the economics of your agent’s execution.

The trace JSON includes metadata: commit hash, test name, workflow name, session IDs, and specific assertions. When a tool wasn’t invoked or an agent hallucinated, the trace captures it as evidence.

Phase 2: CI/CD pipeline quality gates

The problem compounds in CI/CD. Agents fail without leaving breadcrumbs. Non-deterministic responses mean you can’t reliably reproduce failures, stretching RCA (root cause analysis) from minutes to hours.

Monocle enables native GitHub Actions integration. Agent quality checks happen at the commit level, before staging or production deployment.

Isolation: Test functional changes specifically in staging vs. production.
Filtering: Query AI insights by workflow, commit, or pipeline run to spot performance degradation.
Elimination: No more manual log-diving to figure out why an agent run broke.

Traces become artifacts of your pipeline, queryable and comparable across runs.

Phase 3: production observability and root cause analysis

Standard dashboards fail for agents. They show metrics and logs, but agents operate in complex “inner loops” with multiple decision points and tool invocations. A slow response might originate from model drift, latency cascades, or malformed tool inputs—all invisible to traditional metrics.

Monocle’s solution is automated investigation. Using the Okahu MCP within your IDE and specialized SRE agents, you query traces through analysis graphs that visualize service-to-service logic paths:

User Issue: Slow Response
  → Service A: API Gateway
  → Service B: AI Model Host
  → Root Cause: Model Drift / Latency

The Okahu cloud portal becomes an operations center for the “operate” phase:

Application insights: Trace breakdowns, error rates, performance patterns.
Kahu: An AI assistant querying recent errors and anomalies in natural language. “Show me APITimeoutError incidents from the last 24 hours.”

Closing the loop: evidence-based development

The framework’s strength lies in its philosophy: traces as evidence across three phases.

Code phase: Local traces during development, Claude skill / MCP helping you reason about failures.

Test phase: CI/CD evaluations creating trace artifacts for every run.

Operate phase: Production dashboards and SRE agents analyzing traces to prevent escalations.

This isn’t just logging. It’s treating traces as first-class artifacts—queryable, comparable, and actionable across your entire development lifecycle.

Practical takeaways

If you’re building agents at scale, consider:

Instrument early: Two lines of Monocle setup give you deep observability. The cost is negligible.
Capture local evidence: Develop with traces from day one. Make debugging data-driven, not intuition-driven.
Integrate pipeline checks: Push trace-based evaluations into CI/CD as quality gates.
Automate investigation: Let AI tools help you reason about failures using trace data as context.

The maturity of agentic applications depends on visibility. Agent observability removes the veil.

Resources

Demo portal: portal.okahu.co
MCP: mcp.okahu.co/mcp
OSS: monocle2ai.org