Langfuse

Langfuse is an open source observability and analytics layer for LLM applications. It helps you see every request and response, measure latency and cost, inspect prompts and generations, version and A/B test prompts, attach user feedback and evaluations, and trace multi step workflows across services. You can self host it or use the hosted cloud.

Think of Langfuse as a flight recorder for your AI app. It captures inputs, outputs, metadata, costs, and timing so you can debug failures, improve prompts, and prove quality with real data.

Example

click to open

Goal
User types the prompt: “Explain Langfuse in one line.”
You already call an LLM once and return its text. With Langfuse you add a few lines to create a trace and log the generation.

Python sketch

# before
from openai import OpenAI
client = OpenAI()
def answer(user_input: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_input}],
        temperature=0.2,
    )
    return resp.choices[0].message.content

# after, with Langfuse
from langfuse import Langfuse
from langfuse.openai import OpenAI  # drops in a wrapped client

lf = Langfuse()                     # reads LANGFUSE_* env vars
client = OpenAI()                   # same callsite as before

def answer(user_input: str) -> str:
    trace = lf.trace(name="simple-chat", user_id="u_123")
    span = trace.span(name="chat-completion", input=user_input)

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_input}],
        temperature=0.2,
        langfuse_span=span           # auto logs timings, tokens, cost
    )
    output = resp.choices[0].message.content

    span.end(output=output)
    trace.end()
    return output


What the UI shows for this run

A trace named simple-chat with a single span called chat-completion
Input text exactly as the user typed it
User input
“Explain Langfuse in one line.”
Model settings and metadata
Model name, temperature, latency, token counts, estimated cost
Output text from the model
“Langfuse is an open source observability and analytics layer that lets you trace, evaluate, and improve LLM apps.”
Optional feedback and scores
You can click thumbs up or down, attach a numeric score, and tag it
Prompt visibility
The full prompt with variables resolved, version label, and diffs over time

UI output

Trace Overview
Trace name simple-chat
Status success
User ID u_123
Started at 2025-09-30 10:22:03 PT
Duration 0.52 s
Request ID req_3c0a1f

Timeline
chat-completion
start 0.00 s
end 0.52 s
duration 0.52 s

Input
User input
“Explain Langfuse in one line.”

Output
Model answer
“Langfuse is an open source observability and analytics layer for LLM apps that lets you trace, evaluate, and improve them.”

Prompt & Messages
role user
content “Explain Langfuse in one line.”

Model and Settings
Provider OpenAI compatible client via Langfuse SDK wrap
Model gpt-4o-mini
Temperature 0.2
Max tokens 256
Stop default

Tokens and Cost 〈estimated〉
Input tokens 29
Output tokens 42
Total tokens 71
Unit pricing 0.15 USD per 1K input, 0.60 USD per 1K output 〈example pricing used for estimation〉
Estimated cost
input 29 ÷ 1000 × 0.15 = 0.00435 USD
output 42 ÷ 1000 × 0.60 = 0.02520 USD
Total ≈ 0.02955 USD

Metadata
Route POST /api/chat
Environment production
App version 1.4.2
Prompt version v3.1
Tags quickcheck, single-turn, demo

Feedback and Evals
Thumbs up
User comment Clear and accurate
Scores
Correctness 1.0
Clarity 5 of 5
Style adherence pass
Notes No safety issues detected

Errors and Retries
No errors
Retries 0

Search and Filter Facets
user_id u_123
model gpt-4o-mini
latency under 1 s
prompt_version v3.1
tag demo

What it can do for you

Centralized tracing across services and functions
Prompt management with versioning and A/B tests
Automatic cost and token accounting
Live production analytics such as error rates and latencies
Human and automated evaluations that attach to traces
SDKs for popular stacks in Python and JavaScript and simple HTTP APIs
Self hosted option for stricter data control

complex scenario

RAG

Imagine a RAG pipeline with retrieval, tool calls, and function routing. Langfuse would stitch these steps into one trace so you can see which documents were retrieved, how long each step took, and which prompt version performed best. The advantage is faster debugging, easier prompt iteration, and concrete data to justify changes.

A/B testing

example, (not sure if this is a good and simple)

click to open

from dataclasses import dataclass
from statistics import median

@dataclass
class Run:
    variant: str           # "A" or "B"
    prompt: str
    user_input: str
    output: str
    latency_ms: int
    liked: bool

short_prompt = "Explain the tool briefly."
long_prompt = (
    "You are a helpful assistant. Explain Langfuse clearly in one concise paragraph. "
    "Include what it is, why it matters, and one concrete benefit."
)

# Two requests per variant.
# A has one liked and one not liked. B has both liked.
runs = [
    Run("A", short_prompt, "What is Langfuse?", 
        "Langfuse tracks LLM calls so you can debug and improve.", 
        latency_ms=330, liked=True),
    Run("A", short_prompt, "How does it help?", 
        "It logs inputs and outputs.", 
        latency_ms=310, liked=False),

    Run("B", long_prompt, "What is Langfuse?", 
        "Langfuse is an open-source observability layer for LLM apps that captures inputs, outputs, latency, and cost to help you debug, evaluate, and iterate faster.", 
        latency_ms=420, liked=True),
    Run("B", long_prompt, "How does it help?", 
        "By recording detailed traces and evaluations, it reveals which prompts and models perform best and why.", 
        latency_ms=480, liked=True),
]

def summarize(variant: str):
    rs = [r for r in runs if r.variant == variant]
    like_rate = sum(r.liked for r in rs) / len(rs)
    latencies = [r.latency_ms for r in rs]
    return {
        "variant": variant,
        "n": len(rs),
        "like_rate": like_rate,                 # proportion liked
        "median_latency_ms": int(median(latencies)),
        "p95_latency_ms": int(sorted(latencies)[int(0.95 * (len(latencies)-1))]),
    }

sumA = summarize("A")
sumB = summarize("B")

print("A/B Summary")
print(f"Variant A: n={sumA['n']}, like_rate={sumA['like_rate']:.2f}, "
      f"median_latency_ms={sumA['median_latency_ms']}, p95_latency_ms={sumA['p95_latency_ms']}")
print(f"Variant B: n={sumB['n']}, like_rate={sumB['like_rate']:.2f}, "
      f"median_latency_ms={sumB['median_latency_ms']}, p95_latency_ms={sumB['p95_latency_ms']}")

# Simple decision rule example
if sumB["like_rate"] > sumA["like_rate"]:
    print("B wins on user preference in this sample.")
elif sumB["like_rate"] < sumA["like_rate"]:
    print("A wins on user preference in this sample.")
else:
    print("Tie on user preference in this sample.")

# Expected console output
# Variant A shows like_rate 0.50 with faster median latency.
# Variant B shows like_rate 1.00 with slightly slower latency.
# The simple rule declares B the winner on preference.

Official documentation

https://langfuse.com/

SLQ notes

October 1, 2025

Langfuse

Langfuse

Example

What it can do for you

complex scenario

RAG

A/B testing

Similar products and how they compare

Official documentation

No comments:

Post a Comment