Langfuse
GraduatedOpen-source LLM engineering platform โ traces, evals, prompts
Best open-source observability suite. Traces every LLM call with token costs, latency, and full prompt/response. Self-hostable on Docker. Prompt management UI is excellent.
Phoenix (Arize)
IncubatingOpen-source AI observability with embedding and RAG tracing
Strongest tool for RAG evaluation โ UMAP visualisation of embeddings, retrieval quality scoring, and hallucination detection. Run locally or in Arize Cloud.
Weave (W&B)
IncubatingWeights & Biases' LLM tracing and eval framework
Native W&B integration makes it ideal for teams already using wandb for ML experiments. Excellent trace/eval correlation. Strong Python SDK with minimal boilerplate.
Helicone
IncubatingLLM gateway with logging, caching, and cost analytics
Proxy-based approach means zero SDK changes โ add one header and get logging, caching, and rate-limiting. One-line integration is its superpower. Managed cloud only.
OpenLLMetry
SandboxOpenTelemetry-based observability for LLMs
Best choice if your org is OpenTelemetry-native. Routes LLM traces into your existing Grafana/Datadog stack. Vendor-neutral but still maturing on evaluation tooling.
PromptLayer
IncubatingPrompt version control, A/B testing, and analytics platform
Excellent for non-engineers to trigger prompt experiments. Prompt registry with versioning is clean. Pairs well with LangChain. Limited eval depth compared to Langfuse.
Braintrust
SandboxEnd-to-end LLM evaluation and experimentation platform
Fastest SDK for building evaluation pipelines. Dataset versioning, scorer plugins, and team review workflows are polished. Growing fast in enterprise evaluations space.
Arize AI
IncubatingML and LLM observability platform with drift detection and evaluation
Enterprise-grade observability covering both traditional ML and LLM workloads. Embedding drift detection, evaluation dashboards, and automated monitors. Pairs with Phoenix for open-source tracing.