Technical Training Resource

Observe, debug, and govern production AI agents.

AgentOps is the operating discipline for autonomous AI agents: lifecycle management, observability, evaluation, governance, cost control, and runtime reliability. The first training track focuses on OpenTelemetry for agents.

Agent SDK tool calls / MCP traces · logs · metrics OTLP Grafana · Honeycomb · Langfuse · Datadog

What is AgentOps?

AgentOps is the emerging practice for lifecycle management of autonomous AI agents, bringing DevOps and MLOps-style management, monitoring, and improvement into agentic pipelines.

IBM defines AgentOps as an emerging discipline for managing, monitoring, and improving agentic development pipelines. Red Hat independently frames AgentOps around observing, evaluating, governing, and optimizing agentic systems. Academic literature uses AgentOps as the discipline name for tracing, monitoring, logging, and analytics for agent safety in production.

The phrase is not theoretical. Multiple enterprise vendors are naming it, tooling is being built around it, and the operational motions it describes — trace what the agent did, catch failures, measure cost, govern access — are already required in every production agent deployment.

Lifecycle

Manage agents end-to-end

From prototype to production: deployment, versioning, rollback, and shutdown controls for autonomous agent systems.

Observability

See what agents do

Traces, logs, and metrics across agent loops, tool calls, model requests, and infrastructure — without modifying agent code.

Governance

Control what agents can do

Access controls, policy boundaries, human handoff triggers, audit trails, and incident review for production agent workloads.

Why agent observability matters

A production AI agent is a loop: model call → tool selection → tool execution → output → next model call. Every leg of that loop can fail, drift, or cost more than expected. Without observability, you cannot see where.

The questions that matter in production:

Tools

Which tools did the agent call?

Which MCP tools, which sequences, which failed, which were called unexpectedly.

Latency

Where is time being spent?

Model request latency vs tool execution vs orchestration overhead. Which step is the bottleneck.

Cost

How much is each run costing?

Token spend per agent run, per tool call, per user. Where costs are growing unexpectedly.

Failures

Where did it break?

Errors, timeouts, unexpected tool outputs, policy violations, and loop termination events.

The OpenTelemetry bridge

OpenTelemetry (OTel) is the common language between agent behavior and production infrastructure. It is not specific to AI — it is the same telemetry standard that runs in distributed microservices at scale. AgentOps brings it to the agent layer.

Claude Agent SDK
Agent loop tool calls / model requests OTel traces · metrics · logs OTLP export Honeycomb · Datadog · Grafana · Langfuse
The Claude Agent SDK can export traces, metrics, and log events through any OTLP-compatible backend. Visibility into which tools agents called, model-request latency, token spend, and where failures occurred — without modifying the agent loop.
Cloudflare Workers Runtime
Worker handler KV / R2 / Durable Objects fetch calls OTel traces · logs Honeycomb · Grafana Cloud · Axiom · Sentry
Cloudflare Workers exports OpenTelemetry-compliant traces and logs automatically — no code changes. Request flows through Workers and connected services, binding operations, and handler invocations are all captured. This is the infrastructure layer of the AgentOps stack.
MCP Tool-Call Audit Trail
MCP tool invocation input / output capture trace span OTLP audit backend · SIEM · governance log
Model Context Protocol (MCP) tool calls are discrete, auditable events. Each tool invocation can be traced as a span: what was called, with what input, what it returned, how long it took. This is the governance layer of AgentOps.

AgentOps Foundations

A practical training path for running AI agents in production. Modules focus on the operational skills required to instrument, monitor, debug, and govern autonomous agent systems.

M1
AgentOps fundamentals
What AgentOps is, why it emerged, how it relates to DevOps and MLOps, and the operating motions it requires.
Coming soon
M2
Agent traces, logs, and metrics
The three pillars of agent observability. What to instrument, what to collect, and what to ignore.
Coming soon
M3
OpenTelemetry and OTLP
OTel primitives — spans, traces, metrics, logs, context propagation — applied to the agent loop and tool-call chain.
Coming soon
M4
Claude Agent SDK observability
Exporting traces and metrics from the Claude Agent SDK via OTLP. Backend configuration for Honeycomb, Langfuse, Grafana, and Datadog.
Coming soon
M5
Cloudflare Workers observability
Runtime traces, logs, and metrics from Workers. OTel export configuration. Tracing agent-to-infrastructure request flows.
Coming soon
M6
MCP tool-call audit trails
Capturing and auditing Model Context Protocol tool invocations. Governance patterns, retention, and SIEM integration.
Coming soon
M7
Cost, latency, token, and failure dashboards
Building production dashboards for agent economics: token spend per run, latency by step, error rates, and cost-per-outcome.
Coming soon
M8
Human handoffs and incident review
Triggering human review, handoff protocols, escalation logic, and post-incident analysis for agent failures.
Coming soon
M9
Governance, access, and lifecycle controls
Policy boundaries, permission models, agent versioning, rollback, and end-of-lifecycle controls for production deployments.
Coming soon

Join the waitlist

For engineers, automation leads, AI ops teams, and operators building production agent workflows.

Early access to AgentOps Foundations

Notified when training modules launch. No spam. Unsubscribe any time.

No spam. Unsubscribe any time.

You're on the list. We'll reach out when AgentOps Foundations launches.