Self-Hosted Langfuse: See Why Your LLM App Lied — Traces, Evals, and Prompt Versions

Your RAG pipeline worked in staging. In production, users get hallucinated invoice totals and nobody can tell whether retrieval failed, the prompt drifted, or GPT-4o-mini just had a bad Tuesday. You're grep-ing server logs and guessing.

That's the gap Langfuse fills. ~29k GitHub stars, YC W23, MIT-licensed — and an open-source LLM engineering platform for tracing, prompt management, evaluations, and datasets. Not another chat UI. The backstage tooling for teams shipping AI features.

What it actually does

Langfuse sits beside your application and records what happened on every LLM call — inputs, outputs, latency, token counts, nested spans for retrieval, embeddings, and agent steps.

Observability / tracing. Instrument with the Python or JS SDK, drop in the OpenAI SDK wrapper, or wire LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, and dozens more. Open a trace and see the full tree: user message → retrieval → rerank → model call → post-processing. Click a bad answer and debug the actual failure point.

Prompt management. Version prompts centrally, deploy without redeploying your app (server-side caching keeps latency down), and compare v3 vs v7 when quality regresses.

Evaluations. LLM-as-judge, code evaluators, user thumbs-up/down, manual labels, or custom pipelines via API. Run evals against datasets before you ship prompt changes to production.

Datasets. Build test sets for regression testing — "does our support bot still answer refund policy correctly after the model swap?"

Playground. Tweak prompts and model settings against real traces without writing a one-off script. Found a bad trace at 2pm; iterate in the playground at 2:05.

Under the hood it uses ClickHouse for trace storage at scale — we've covered ClickHouse as an analytics database; Langfuse is what many teams run on top of it for LLM-specific workflows.

Langfuse vs the chat apps we've covered

Easy to confuse with our other AI posts — different jobs:

  • Open WebUI, Onyx, Khoj — user-facing chat and search
  • n8n — workflow automation that might call an LLM
  • Langfuse — developer platform that watches those apps (or your custom FastAPI agent) and tells you why they misbehaved

You might run Open WebUI for employees and Langfuse for the team maintaining the pipelines behind it. Or instrument a customer-facing chatbot built in LangChain and use Langfuse as the ops console.

Why self-host?

Traces are your product data. Every Langfuse trace can contain user queries, retrieved document chunks, API keys in metadata, and full model responses. Sending that to a US SaaS is a non-starter for many Canadian teams under PIPEDA or client NDAs.

Prompts and eval sets stay internal. Your v14 system prompt and golden test cases aren't intellectual property you want on someone else's Postgres.

Air-gapped and hybrid setups. Self-hosted Langfuse on a Montreal VPS, app in your VPC, models via Azure Canada or local Ollama — traces never leave infrastructure you control.

MIT Community Edition. Core tracing, prompts, evals, and datasets are open source. Enterprise folders add SSO and advanced org features — check their pricing if you need those.

Telemetry opt-out. Self-hosted instances phone home basic usage stats to PostHog by default (not trace contents). Set TELEMETRY_ENABLED=false if even that is too much.

What running it takes

Fastest path — clone and compose:

git clone --depth=1 https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up

That pulls the web UI, worker, Postgres, ClickHouse, Redis, and MinIO. Plan real resources: ClickHouse wants RAM and fast disk once trace volume grows. Fine for a dev team on a 4 GB VPS; production with high traffic needs more — docs cover VM sizing, Kubernetes/Helm for serious deployments, and Terraform templates for AWS/Azure/GCP.

Point your app at LANGFUSE_BASE_URL with project API keys. Five lines of Python with the @observe() decorator and you're logging traces.

Put HTTPS and auth in front — traces are sensitive. VPN, SSO, or IP allowlisting for anything beyond a solo dev machine.

Who it's for (and who should skip it)

Good fit: teams building production LLM features (RAG bots, agents, copilots), agencies shipping AI for clients who want audit trails, anyone who needs "why did this answer suck?" answered in minutes not days.

Maybe skip it: you're only chatting with Ollama in a browser and not writing code — Open WebUI doesn't need Langfuse. If you have one cron job calling GPT once a week, structured logging might be enough. Langfuse pays off when you have multiple prompts, models, or developers iterating in parallel.

Hosting it in Canada

We run Langfuse stacks on Canadian Docker hosting — sized for ClickHouse growth, TLS, backups on the trace store, and network placement so your app and observability layer stay in the same jurisdiction.

Tell us your trace volume and retention needs — we'll size disk before ClickHouse eats the partition and you're deleting the evidence of last week's outage.

Tags:
  • Langfuse
  • LLM
  • Observability
  • DevOps
  • Self-Hosted

Need Help With Your Hosting?

Tell us about your application — we respond within 1 hour with honest recommendations.