What to ask your customers.

Pre-research synthesis from public signals. Interview guide, not product decisions.

Product: AI Agent Observability Platform | Job: Monitor AI agent operations

Executive Summary

🎯

Purpose

Pre-research synthesis identifying agent observability gaps to inform customer interviews. NOT product decisions.

🔍

Key Finding: 4 Critical Gaps

  • Prompt Injection Detection (9.6/10 pain) — 0 of 8 tools have it
  • Decision-Level Trace (9.2/10 pain) — All show WHAT, none show WHY
  • Behavioral Cost Anomaly (8.5/10 pain) — No real-time pattern detection
  • Compliance Audit Trail (6.5/10 pain) — No agent-specific trails
📋

Next Steps

  1. Review interview guide (Section 5)
  2. Interview 5-10 engineering leads
  3. Validate top 3 gaps are real
  4. Return with validated findings
100+ sources analyzed 8 competitors validated 75 structured quotes 4 critical gaps found
PRE-RESEARCH REPORT

Pre-Research Report — Directional Signal, Not Validated

This report synthesizes publicly available signals (GitHub issues, Reddit discussions,
developer forums, academic papers, product launches) to identify potential customer pain points.
It is NOT based on customer interviews or quantitative surveys.

Use this to inform an interview guide, NOT to make product decisions.

Confidence tiers:

Quality tiers:

Confidence Tiers

Strong Signal 5+ independent sources
Emerging Pattern 2-3 sources
Hypothesis Inferred, needs validation

Sourced Outcomes

Customer pain points identified from public signals (GitHub, Reddit, forums). Each outcome includes source citations.

Strong Signal High Priority Monitor

Minimize time spent diagnosing failed agent runs (decision-level debugging)

Evidence (10 sources):

HN Discussion 2026-01-08
"Debugging agents is painful - When your agent makes 20 tool calls and fails, good luck figuring out which decision was wrong. WatchLLM gives you a step-by-step timeline showing every decision, tool call, and model response with explanations for why the agent did what it did."
🔗 View source
HN Discussion 2026-03-01
"When agents fail, choose wrong tools, or blow cost budgets, there's no way to know why - usually just logs and guesswork. As agents move from demos to production with real SLAs and real users, this is not sustainable."
🔗 View source
HN Discussion 2025-12-15
"Most agent failures are silent. Most failures occur in components that showed zero issues during testing. Why? Because we treat agents as black boxes - query goes in, response comes out, and we have no idea what happened in between."
🔗 View source
HN Discussion 2023-07-10
"I cannot imagine spending extended time with a framework without knowing what the internals are doing. I do realize this isn't achievable on all levels with LLMs, but introducing more black boxes on top of existing ones isn't solving any problems."
🔗 View source
HN Discussion 2026-02-03
"Most tutorials and frameworks (LangChain, AutoGPT, etc.) felt like black boxes that added unnecessary layers of abstraction. Debugging a 'ReasoningEngine' when it hallucinated was a nightmare."
🔗 View source
Reddit Thread 2025-01-30
"logs are all over the place. whisper or deepgram for transcriptions, openai or rasa for intent classification, langchain traces for response generation, plus vector dbs like pinecone for memory. jumping between cloud dashboards or writing custom scripts just to debug one conversation is a pain."
🔗 View source
Reddit Thread 2023-12-10
"A lot of times I had to look at the sourcecode, or use a lot of debugging breakpoints to figure out what was going on. For example, the other day, I used the new OpenAI assistant feature and it was not clear from the docs how to get the response and the thread ID from the object returned by invoke."
🔗 View source
Twitter/X 2026-03-XX
"When something goes wrong in traditional software, you know what to do: check the error logs, look at the stack trace, find the line of code that failed. But AI agents have changed what we're debugging. When an agent takes 200 steps, repeatedly calls tools, updates state, and still produces the wrong result, there is no stack trace to inspect. Nothing crashed."
🔗 View source
Reddit Thread 2026-01-05
"As workflows get more complex (multi-step chains, agents, tool calls, retries), it gets hard to answer questions like: Where is latency coming from? How many tokens are we using per chain or user? Which tools, chains, or agents are invoked most? Where do errors, retries, or partial failures happen?"
🔗 View source
Article (Salesforce) 2026-XX-XX
"Agent observability focuses on 'unknown unknowns'. It seeks to answer complex questions: why did an agent choose a specific tool over another? Why did a reasoning loop fail to reach a conclusion?"
🔗 View source
Strong Signal High Priority Monitor

Increase confidence that cost anomalies will be detected early (real-time behavioral detection)

Evidence (8 sources):

HN Discussion 2026-01-08
"Agent costs spiral fast - Agents love getting stuck in loops or calling expensive tools repeatedly. WatchLLM tracks cost per step and flags anomalies like 'loop detected - same action repeated 3x, wasted $0.012' or 'high cost step - $0.08 exceeds threshold'."
🔗 View source
HN Discussion 2026-02-03
"I built AgentPulse because I kept getting surprise bills from my AI agents and had no idea which calls were burning money. The problem: You build an agent, it works great. Then you check your OpenAI bill: $400. Which agent? Which calls? No clue."
🔗 View source
HN Discussion 2026-03-05
"I built this because I had a $47 Tuesday. One Claude Code session, eight hours, no visibility into what was happening. By the time I checked the billing page the next morning, the damage was done."
🔗 View source
Reddit Thread 2025-10-13
"Last but not least, use the slowest but the best reasoning tool is your brain, fix the bug when the LLM can't quickly analyze the problem. yeah i got it, super expensive, i switched to GPT5 and Gemini 2.5 pro, from one prompt, Caude took almost 20 $"
🔗 View source
Reddit Thread 2026-01-19
"You describe an agent idea in plain English, and it outputs three implementation approaches (low / medium / high cost) with rough breakdowns for models, infra, and usage assumptions. The goal isn't 'accurate pricing'. It's helping people reason about feasibility and trade-offs earlier"
🔗 View source
Reddit Thread 2026-02-04
"We track AI costs per-user and per-feature, not just aggregate spend. The key is treating token usage like any other cloud resource, instrument it, track it, set alerts. For unit economics, we log every LLM call with: user_id, feature, model, tokens, cost."
🔗 View source
GitHub Repo 2025 (active)
"Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Comprehensive Observability: Track your AI agents' performance, user interactions, and API usage. Cost Control: Monitor and manage your spend on LLM and API calls."
🔗 View source
GitHub Repo 2025-2026
"Open source cost intelligence proxy for AI agents. Cut costs ~80% with smart model routing. Budget check (daily/hourly/per-request limits), Anomaly detection (velocity, cost spike, loops), Auto-downgrade (if budget threshold breached)"
🔗 View source
Strong Signal High Priority Confirm

Minimize time to validate changes didn't break existing workflows (automated regression detection)

Evidence (5 sources):

HN Discussion 2026-01-07
"Most agent testing today focuses on eval scores or happy-path prompts. In practice, agents tend to fail in more mundane ways: typos, tone shifts, long context, malformed input, or simple prompt injections — especially when running on smaller or local models."
🔗 View source
ArXiv Paper 2026-03-XX
"Consider an enterprise deploying an AI agent for customer support ticket routing. On Monday, after a prompt refinement, the agent correctly routes 93% of tickets. By Wednesday, a model provider silently updates the underlying LLM, and the routing accuracy drops to 71%. No test caught the regression."
🔗 View source
Blog Post 2026-01-12
"These benchmarks run multiple times and aggregate results. Regression does not mean 'the output changed.' For testing our code's behavior given some reasonable model output, that's exactly what we want."
🔗 View source
Blog Post 2026-02-XX
"This is the minimum viable interface for agent regression testing. One command. One config file. Works in any CI system. No accounts. No dashboards. The reason agent testing is broken isn't technical. The tooling is straightforward to build. The reason is cultural."
🔗 View source
Blog Post 2025 (unknown month)
"In this tutorial, you will learn how to use different methods to test the quality of LLM outputs. The methods will work just the same as other LLM-powered use cases, from summarization to RAGs and agents."
🔗 View source
Strong Signal High Priority Protect

Minimize risk of prompt injection attacks (security & detection)

Evidence (6 sources):

GitHub Issue 2026-02-XX
"This is a complete security bypass. If a single-model agent with sudo or AWS API keys gets prompt-injected while you are sleeping, we are talking about full system compromise, leaked SSH keys, and data exfiltration."
🔗 View source
GitHub Repo 2025-2026
"Security framework that protects AI agents from prompt injection, command injection, and Unicode bypass attacks. Built in response to the Clinejection attack that compromised 4,000 developer machines through a malicious GitHub issue."
🔗 View source
GitHub Repo 2025-2026
"Agents Rule of Two: A Practical Approach to AI Agent Security - Meta's Oct 2025 framework stating that agents must satisfy no more than two of: (A) processing untrustworthy inputs, (B) access to sensitive data, (C) ability to change state externally"
🔗 View source
GitHub Repo 2025-2026
"One emerging threat — prompt injection attacks, where the adversary embeds malicious instructions in the web environment that divert the agent to instead perform tasks for the adversary."
🔗 View source
GitHub Issue 2024-XX-XX
"This could lead to content being tampered with, the injection of malicious third-party agents, and unintentionally invoking hacker tools that capture the privacy of users' input questions."
🔗 View source
HN Discussion 2026-01-29
"The security scanner is the part I'm most proud of. 35+ patterns detect prompt injection, jailbreaks, system prompt spoofing, shell injection (rm -rf, curl | sh), and Unicode obfuscation. It's not just regex — context-aware scoring so 'ignore' inside a code comment doesn't false positive"
🔗 View source
Strong Signal Medium Priority Monitor

Detect when agent output quality degrades over time (quality regression monitoring)

Evidence (3 sources):

HN Discussion 2026-03-01
"A support agent that began misclassifying refund requests as product questions, which meant customers never reached the refund flow. A document drafting agent that would occasionally hallucinate missing sections when parsing long specs. There's no stack trace or 500 error and you only figure this out when a customer is angry."
🔗 View source
HN Discussion 2025-12-15
"Component-specific latency (which component is the bottleneck?), Intermediate states (what was retrieved, what reasoning strategy was chosen), Failure attribution (which specific component caused the bad output?)"
🔗 View source
Blog Post 2025-12-06
"LLM-as-judge is the most common methodology for evaluating the quality or reliability of AI agents. However, it's an expensive workload and each user interaction (trace) can consist of hundreds of interactions (spans)."
🔗 View source
Emerging Pattern Medium Priority Diagnose

Minimize time to understand root cause of agent failures (automated failure analysis)

Evidence (3 sources):

HN Discussion 2026-03-01
"When issues surface, Sentrial diagnoses the root cause by analyzing conversation patterns, model outputs, and tool interactions, then recommends specific fixes."
🔗 View source
HN Discussion 2025-12-15
"The system automatically attributes failures to specific components based on observability data. Routing failures (wrong workflow), Retrieval failures (missed relevant docs), Reasoning failures (wrong strategy), Generation failures (poor output despite good inputs)"
🔗 View source
Blog Post 2025-11-01
"Early termination, skipped checks, and missed human escalations represent the 'silent killer' category in agent reliability frameworks. Implementing multi-stage validators solves the problem by gating every phase: planning, execution, and final output."
🔗 View source
Emerging Pattern Medium Priority Monitor

Increase visibility into multi-agent workflow interactions (complex system tracing)

Evidence (2 sources):

Product Hunt 2025-02-27
"Langfuse can visualize & analyze even complex LLM executions such as agents, nested chains, embedding retrieval and tool usage"
🔗 View source
Testimonial (Arize) 2026-01-XX
"Every request that we send to an agent system is automatically sent to Arize along with the full flow. If we have five agents in the system and each agent calls a number of tools, we log that entire interaction. We then use that data for monitoring and metrics to make sure the agents were called correctly, that they used the right configuration, the right model, the right tools, and so on."
🔗 View source
Emerging Pattern Low Priority Analyze

Increase accuracy of cost attribution by feature/user/workflow (unit economics)

Evidence (1 sources):

Reddit Thread 2026-02-04
"We track AI costs per-user and per-feature, not just aggregate spend. For unit economics, we log every LLM call with: user_id, feature, model, tokens, cost."
🔗 View source
Emerging Pattern Medium Priority Audit

Meet compliance requirements for agent audit trails (regulatory & governance)

Evidence (4 sources):

Blog Post 2026-XX-XX
"Audit-Ready Trail: the deterministic layer logs every authentication attempt and verification result, providing a traceable and audit-ready record required for regulatory scrutiny (e.g., GDPR, HIPAA, CCPA)."
🔗 View source
Blog Post 2025-01-09
"Comprehensive guide to AI agent compliance under the EU AI Act. Covers high-risk classification, human oversight requirements, audit trail infrastructure, and industry-specific obligations for financial services, healthcare, insurance, and government."
🔗 View source
Blog Post 2026-XX-XX
"The observability of 2026 will likely integrate more with governance, risk, and compliance tooling, giving risk officers a dashboard of AI compliance metrics alongside performance metrics."
🔗 View source
Blog Post 2026-02-10
"Prompts and completions flow directly from your container to your LLM provider. ClawStaff does not proxy, log, or store this content. Scoped access controls support least-privilege. Each agent accesses only the integrations and data categories you define. This maps directly to HIPAA minimum necessary, GDPR data minimization, and SOC 2 confidentiality criteria. Audit trail meets logging requirements across frameworks."
🔗 View source
Emerging Pattern Low Priority Optimize

Improve agent performance through systematic benchmarking & comparison

Evidence (2 sources):

Blog Post 2025-XX-XX
"Track metrics like inter-agent communication efficiency, task allocation optimality, and conflict resolution success rates. These indicators reveal how well your agents collaborate rather than just how they perform individually."
🔗 View source
Research Post 2025-06-04
"Benchmarks today target what are now considered core competencies for LLM agents. LLM agents are expected to break down complex problems into bite-sized pieces and generate a plan of action. Developers can now choose from PlanBench, MINT, and IBM's own ACPBench, among others, to test their agents' planning and reasoning chops."
🔗 View source

Competitive Reality

Verified claims about what competitors do and don't offer. Checked .

Decision-level execution trace (WHY agent decided, not just WHAT)

langfuse ◐ Partial
langsmith ◐ Partial
wandb weave ◐ Partial
helicone ✗ Missing
braintrust ◐ Partial
arize ◐ Partial
humanloop ✗ Missing
langwatch ◐ Partial

How we verified:

Langfuse: Shows WHAT (span tree, nested traces) but NOT WHY (decision reasoning). Source: Product Hunt, 2026-02-27.
LangSmith: Excellent span-level tracing for LangChain workflows, but decision reasoning is implicit, not explicit. Source: Product Hunt review, 2026-01-02.
W&B Weave: Comprehensive tracing with OTEL support, but execution trace (what happened), not decision trace (why it happened). Source: wandb.ai.
Helicone: Gateway/proxy model. Logs requests/responses, not agent decision flow. Source: docs.helicone.ai.
Braintrust: Strong tracing for prompts, responses, and tool calls. Does not explicitly surface decision-making rationale. Source: braintrust.dev.
Arize: Comprehensive agent tracing, but focused on execution steps, not explicit decision reasoning. Source: Microsoft Marketplace.
Humanloop: Agent builder focused, not agent observability/debugging focused. Source: humanloop.com/docs.
LangWatch: Claims "down to each decision" but needs hands-on testing to verify. Source: GitHub.

Last checked:

Real-time cost anomaly detection (behavior-based, not just threshold)

langfuse ✗ Missing
langsmith ✗ Missing
wandb weave ✗ Missing
helicone ✗ Missing
braintrust ✗ Missing
arize ✗ Missing
humanloop ✗ Missing
langwatch ✗ Missing

How we verified:

All tools: Log costs and provide dashboards. Langfuse has threshold-based alerts (60-90min lag, organization-level).
None: Have behavior-based anomaly detection (loop detection, cost spikes based on historical patterns).

Last checked:

Automated change validation (regression detection)

langfuse ◐ Partial
langsmith ◐ Partial
wandb weave ◐ Partial
helicone ✗ Missing
braintrust ✓ Has it
arize ✓ Has it
humanloop ✗ Missing
langwatch ✓ Has it

How we verified:

Braintrust: Explicit CI integration and regression detection. Source: braintrust.dev.
Arize: Continuous evals + templates suggests regression detection capability. Source: Arize docs.
LangWatch: Explicit regression dataset + simulation features. Source: LangWatch blog, MarkTechPost.
Langfuse, LangSmith, W&B Weave: Have evaluation frameworks but unclear on automated regression detection.
Helicone, Humanloop: No evaluation/regression testing features.

Last checked:

Prompt injection detection

langfuse ✗ Missing
langsmith ✗ Missing
wandb weave ✗ Missing
helicone ✗ Missing
braintrust ✗ Missing
arize ✗ Missing
humanloop ✗ Missing
langwatch ◐ Partial

How we verified:

LangWatch: "Real-time guardrails" mentioned in docs, but specifics on prompt injection detection not clear.
All others: No explicit prompt injection detection as a built-in feature.
External tools: AgentGuard, ClawSec, WASP benchmark exist but are external, not built into observability platforms.

Last checked:

Quality regression monitoring

langfuse ✓ Has it
langsmith ✓ Has it
wandb weave ✓ Has it
helicone ◐ Partial
braintrust ✓ Has it
arize ✓ Has it
humanloop ◐ Partial
langwatch ✓ Has it

How we verified:

6 of 8: Have quality regression monitoring (Langfuse, LangSmith, W&B, Braintrust, Arize, LangWatch).
2 of 8: Have partial (Helicone, Humanloop).

Last checked:

Root cause analysis automation

langfuse ✗ Missing
langsmith ✗ Missing
wandb weave ✗ Missing
helicone ✗ Missing
braintrust ✓ Has it
arize ✓ Has it
humanloop ✗ Missing
langwatch ✓ Has it

How we verified:

Braintrust, Arize, LangWatch: Have some form of automated failure attribution or root cause analysis.
Others: Provide traces and data, but root cause analysis is manual.

Last checked:

Multi-agent workflow trace

langfuse ✓ Has it
langsmith ✓ Has it
wandb weave ✓ Has it
helicone ✗ Missing
braintrust ✓ Has it
arize ✓ Has it
humanloop ✗ Missing
langwatch ✓ Has it

How we verified:

6 of 8: Support multi-agent workflow tracing (Langfuse, LangSmith, W&B, Braintrust, Arize, LangWatch).
2 of 8: Do not (Helicone, Humanloop).

Last checked:

Cost attribution by feature/user

langfuse ✓ Has it
langsmith ✓ Has it
wandb weave ✓ Has it
helicone ✓ Has it
braintrust ✓ Has it
arize ✓ Has it
humanloop ✗ Missing
langwatch ✓ Has it

How we verified:

7 of 8: Support cost attribution by user/feature (all except Humanloop).

Last checked:

Compliance audit trail

langfuse ◐ Partial
langsmith ✗ Missing
wandb weave ✗ Missing
helicone ✗ Missing
braintrust ✗ Missing
arize ◐ Partial
humanloop ✗ Missing
langwatch ✗ Missing

How we verified:

Langfuse: Self-hosting supports data sovereignty, unclear on immutable audit trail.
Braintrust: SOC 2, GDPR, HIPAA certifications exist, but not clear if agent actions are logged in immutable audit trail.
Arize: Enterprise-grade observability, likely supports audit trails but not explicitly documented.
Others: No compliance audit trail features found.

Last checked:

Performance benchmarking

langfuse ✓ Has it
langsmith ✓ Has it
wandb weave ✓ Has it
helicone ✗ Missing
braintrust ✓ Has it
arize ✓ Has it
humanloop ✗ Missing
langwatch ✓ Has it

How we verified:

6 of 8: Support performance benchmarking (Langfuse, LangSmith, W&B, Braintrust, Arize, LangWatch).
2 of 8: Do not (Helicone, Humanloop).

Last checked:

Interview Guide

Based on pre-research of 100+ sources, three high-priority hypotheses emerged with strong signal. Use these to structure customer discovery interviews with engineering teams shipping AI agents in production.

Strong Signal (10+ sources)

Engineering teams lose significant time debugging agent failures due to lack of decision-level visibility

Questions to ask:

  • Walk me through the last time an agent failed in production. What did you do?
  • How long did it take to diagnose the root cause? What tools did you use?
  • What information was missing that would have helped you debug faster?
  • Do current tools show you WHAT the agent did, or WHY it decided to do it?
  • Have you ever been stuck because you could see the trace but not the reasoning?

✓ What validates this:

  • 3+ teams spending >5 hours/week debugging agent failures
  • Existing tools (Langfuse, LangSmith, W&B) mentioned but insufficient for decision-level debugging
  • Engineers manually inspecting code/prompts because traces don't show reasoning
  • Specific examples of 'I saw the agent called tool X, but I don't know WHY it chose X over Y'

✗ What invalidates this:

  • Teams not running agents in production yet
  • Current tools solve this adequately ('LangSmith waterfall is all I need')
  • Debugging time is minimal (<1 hour/week)
  • Decision reasoning is not important ('I only care about the final output')
Strong Signal (8+ sources)

Production agent teams need real-time cost anomaly detection to catch runaway agent loops before budget damage

Questions to ask:

  • Have you ever had an agent cost spike due to a bug or runaway loop?
  • How do you currently monitor agent costs in production?
  • How long does it take to detect unusual cost patterns?
  • Do you have budget alerts set up? What triggers them?
  • What would 'cost anomaly detection' need to do to be valuable?

✓ What validates this:

  • At least one past incident of unexpected agent cost spike ($50+ unexpected)
  • Current monitoring is reactive (monthly invoices) not proactive (real-time alerts)
  • Desire for per-agent or per-workflow cost tracking with anomaly thresholds
  • Finance or engineering leadership cares about cost governance

✗ What invalidates this:

  • Cost is not a concern ('our scale is too small to worry')
  • Existing tools already solve this ('LiteLLM budget alerts work fine')
  • Manual monitoring is sufficient ('I check the dashboard daily')
Strong Signal (6+ sources)

Security-conscious teams need prompt injection detection because agents with elevated permissions are existential risks

Questions to ask:

  • Do your agents have access to sensitive data or can they take actions (API calls, database writes, file access)?
  • Have you thought about prompt injection risks? How do you mitigate them?
  • What would happen if an attacker could inject malicious instructions into your agent's context?
  • Do you have any security controls around agent inputs/outputs?
  • Would real-time prompt injection detection be valuable? Why or why not?

✓ What validates this:

  • Agents have elevated permissions (database access, API keys, file system access)
  • Security is a concern but no mitigation in place
  • Awareness of Clinejection or similar attacks
  • Desire for automated detection/blocking of suspicious inputs

✗ What invalidates this:

  • Agents are sandboxed with no sensitive access
  • Security team already has robust input filtering/validation
  • Risk is accepted ('we trust our users')