October 1, 2025

Langfuse




Langfuse

Langfuse is an open source observability and analytics layer for LLM applications. It helps you see every request and response, measure latency and cost, inspect prompts and generations, version and A/B test prompts, attach user feedback and evaluations, and trace multi step workflows across services. You can self host it or use the hosted cloud.

Think of Langfuse as a flight recorder for your AI app. It captures inputs, outputs, metadata, costs, and timing so you can debug failures, improve prompts, and prove quality with real data.

Example

click to open

Goal
User types the prompt: “Explain Langfuse in one line.”
You already call an LLM once and return its text. With Langfuse you add a few lines to create a trace and log the generation.

Python sketch

# before
from openai import OpenAI client = OpenAI() def answer(user_input: str) -> str: resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_input}], temperature=0.2, ) return resp.choices[0].message.content # after, with Langfuse from langfuse import Langfuse from langfuse.openai import OpenAI # drops in a wrapped client lf = Langfuse() # reads LANGFUSE_* env vars client = OpenAI() # same callsite as before def answer(user_input: str) -> str: trace = lf.trace(name="simple-chat", user_id="u_123") span = trace.span(name="chat-completion", input=user_input) resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": user_input}], temperature=0.2, langfuse_span=span # auto logs timings, tokens, cost ) output = resp.choices[0].message.content span.end(output=output) trace.end() return output

What the UI shows for this run
  1. A trace named simple-chat with a single span called chat-completion

  2. Input text exactly as the user typed it
    User input
    “Explain Langfuse in one line.”

  3. Model settings and metadata
    Model name, temperature, latency, token counts, estimated cost

  4. Output text from the model
    “Langfuse is an open source observability and analytics layer that lets you trace, evaluate, and improve LLM apps.”

  5. Optional feedback and scores
    You can click thumbs up or down, attach a numeric score, and tag it

  6. Prompt visibility
    The full prompt with variables resolved, version label, and diffs over time

UI output

Trace Overview Trace name simple-chat Status success User ID u_123 Started at 2025-09-30 10:22:03 PT Duration 0.52 s Request ID req_3c0a1f Timeline chat-completion start 0.00 s end 0.52 s duration 0.52 s Input User input “Explain Langfuse in one line.” Output Model answer “Langfuse is an open source observability and analytics layer for LLM apps that lets you trace, evaluate, and improve them.” Prompt & Messages role user content “Explain Langfuse in one line.” Model and Settings Provider OpenAI compatible client via Langfuse SDK wrap Model gpt-4o-mini Temperature 0.2 Max tokens 256 Stop default Tokens and Cost 〈estimated〉 Input tokens 29 Output tokens 42 Total tokens 71 Unit pricing 0.15 USD per 1K input, 0.60 USD per 1K output 〈example pricing used for estimation〉 Estimated cost input 29 ÷ 1000 × 0.15 = 0.00435 USD output 42 ÷ 1000 × 0.60 = 0.02520 USD Total ≈ 0.02955 USD Metadata Route POST /api/chat Environment production App version 1.4.2 Prompt version v3.1 Tags quickcheck, single-turn, demo Feedback and Evals Thumbs up User comment Clear and accurate Scores Correctness 1.0 Clarity 5 of 5 Style adherence pass Notes No safety issues detected Errors and Retries No errors Retries 0 Search and Filter Facets user_id u_123 model gpt-4o-mini latency under 1 s prompt_version v3.1 tag demo

What it can do for you

  1. Centralized tracing across services and functions

  2. Prompt management with versioning and A/B tests

  3. Automatic cost and token accounting

  4. Live production analytics such as error rates and latencies

  5. Human and automated evaluations that attach to traces

  6. SDKs for popular stacks in Python and JavaScript and simple HTTP APIs

  7. Self hosted option for stricter data control


complex scenario

RAG

Imagine a RAG pipeline with retrieval, tool calls, and function routing. Langfuse would stitch these steps into one trace so you can see which documents were retrieved, how long each step took, and which prompt version performed best. The advantage is faster debugging, easier prompt iteration, and concrete data to justify changes.

A/B testing

example, (not sure if this is a good and simple)

click to open
from dataclasses import dataclass from statistics import median @dataclass class Run: variant: str # "A" or "B" prompt: str user_input: str output: str latency_ms: int liked: bool short_prompt = "Explain the tool briefly." long_prompt = ( "You are a helpful assistant. Explain Langfuse clearly in one concise paragraph. " "Include what it is, why it matters, and one concrete benefit." ) # Two requests per variant. # A has one liked and one not liked. B has both liked. runs = [ Run("A", short_prompt, "What is Langfuse?", "Langfuse tracks LLM calls so you can debug and improve.", latency_ms=330, liked=True), Run("A", short_prompt, "How does it help?", "It logs inputs and outputs.", latency_ms=310, liked=False), Run("B", long_prompt, "What is Langfuse?", "Langfuse is an open-source observability layer for LLM apps that captures inputs, outputs, latency, and cost to help you debug, evaluate, and iterate faster.", latency_ms=420, liked=True), Run("B", long_prompt, "How does it help?", "By recording detailed traces and evaluations, it reveals which prompts and models perform best and why.", latency_ms=480, liked=True), ] def summarize(variant: str): rs = [r for r in runs if r.variant == variant] like_rate = sum(r.liked for r in rs) / len(rs) latencies = [r.latency_ms for r in rs] return { "variant": variant, "n": len(rs), "like_rate": like_rate, # proportion liked "median_latency_ms": int(median(latencies)), "p95_latency_ms": int(sorted(latencies)[int(0.95 * (len(latencies)-1))]), } sumA = summarize("A") sumB = summarize("B") print("A/B Summary") print(f"Variant A: n={sumA['n']}, like_rate={sumA['like_rate']:.2f}, " f"median_latency_ms={sumA['median_latency_ms']}, p95_latency_ms={sumA['p95_latency_ms']}") print(f"Variant B: n={sumB['n']}, like_rate={sumB['like_rate']:.2f}, " f"median_latency_ms={sumB['median_latency_ms']}, p95_latency_ms={sumB['p95_latency_ms']}") # Simple decision rule example if sumB["like_rate"] > sumA["like_rate"]: print("B wins on user preference in this sample.") elif sumB["like_rate"] < sumA["like_rate"]: print("A wins on user preference in this sample.") else: print("Tie on user preference in this sample.")

# Expected console output

# Variant A shows like_rate 0.50 with faster median latency.

# Variant B shows like_rate 1.00 with slightly slower latency.

# The simple rule declares B the winner on preference.



Similar products and how they compare

  1. LangSmith by LangChain focuses on deep integration with LangChain projects. Langfuse is framework agnostic and open source with self hosting that many teams prefer

  2. Humanloop offers prompt management and evaluation. Langfuse emphasizes tracing and production analytics with a strong open source story

  3. Arize Phoenix and Weights and Biases Weave lean into ML observability and experiment tracking. Langfuse stays very focused on LLM app telemetry with simple SDKs and a light footprint

  4. OpenAI built in usage dashboards are convenient for high level counts. Langfuse provides per request traces, prompt versions, evaluations, and joinable metadata that you control

Official documentation

https://langfuse.com/

No comments:

Post a Comment