Self-Hosted Maxun: No-Code Web Scraping When There's No API

Your analyst needs competitor pricing from twelve sites that don't have APIs. Your options: hire a dev to write Playwright scripts, pay a scraping SaaS per thousand rows, or watch someone paste data from ChatGPT that hallucinated the Q3 numbers. None of those scale when the client asks for weekly updates.

Maxun sits in the middle. ~16k GitHub stars, AGPL-3.0, and a no-code platform for web extraction — record your clicks in the browser, describe what you want in plain English for LLM-powered scraping, or crawl whole sites into structured data. Turn the results into APIs, spreadsheets, or clean Markdown for AI pipelines.

What it actually does

Maxun organizes work around robots — automated assistants that browse sites the way you would, but on a schedule.

Extract. Recorder mode watches you navigate a site and converts actions into a reusable robot — click listings, paginate, grab fields. AI mode lets you describe targets in natural language ("top 50 movies with rating and runtime from IMDb") and uses LLMs to structure the output.

Scrape. Full pages to Markdown or HTML plus screenshots — useful for feeding RAG pipelines or archiving competitor landing pages without manual copy-paste.

Crawl. Walk entire sites within scope rules you define — sitemap discovery, depth limits, URL filters. Documentation sites, product catalogs, directory listings.

Search. Automated web search with time filters — discover URLs first, then scrape what you find.

APIs and integrations. Expose robots as REST endpoints, export to Google Sheets or Airtable, schedule runs, handle logins for authenticated pages. SDK and CLI for developers who want to trigger robots from code or wire them into n8n. MCP support for AI agent tooling.

Under the hood: Playwright/Chromium for real browser rendering, PostgreSQL for state, MinIO for object storage — pagination, scrolling, and layout-change recovery built in.

Maxun vs changedetection vs Airbyte

We've covered adjacent data tools:

  • changedetection.io — alerts when a page changes; great for price watches, not for bulk structured extraction
  • Airbyte — ELT from official APIs and databases into warehouses; useless when the vendor never shipped an API

Maxun is for getting structured data off the open web when no connector exists. Pair changedetection for "tell me when it changes" with Maxun for "pull the whole table every Monday."

Why self-host?

Scraping targets are business intelligence. Competitor URLs, login flows, extracted pricing — that list is sensitive. Self-hosted on a Canadian VPS keeps robots, credentials, and output off a multi-tenant SaaS scraper in another country.

Control browser egress. Run Maxun behind your own proxies, rotate IPs on your terms, and align with sites you're authorized to scrape — not shared infrastructure that gets blocked because another customer hammered the same domain.

LLM routing you choose. AI extraction mode needs model access — point at your own endpoint so page content doesn't train someone else's default API policy.

Scrape responsibly. AGPL license doesn't grant permission to ignore robots.txt, terms of service, or Canadian computer misuse laws. Only automate sites you're allowed to access. Maxun is a power tool; legal clearance is still on you.

What running it takes

Docker Compose stacks PostgreSQL, MinIO, backend (Playwright), and frontend:

mkdir maxun && cd maxun
# Copy docker-compose.yml and .env from the repo
openssl rand -base64 48  # JWT_SECRET, SESSION_SECRET
openssl rand -hex 32     # ENCRYPTION_KEY
docker compose up -d

Frontend defaults to port 5173, backend to 8080. Set PUBLIC_URL and BACKEND_URL to your real domain before going public. The backend container wants serious resources — 6 GB memory limit and 2 GB shared memory for headless Chromium in the default compose file. Scraping is not a Raspberry Pi hobby.

Project is still early-stage per the README — pin image tags, read upgrade docs, expect rough edges. Hosted option at app.maxun.dev exists if you'd rather pay for ops.

Who it's for (and who should skip it)

Good fit: market research teams without dedicated scraping devs, agencies building lead lists from public directories, ops automating weekly data pulls from sites without APIs, AI teams needing clean Markdown from live web pages.

Maybe skip it: the site offers a proper API — use Airbyte or a cron script instead. You only need change alerts — changedetection is lighter. You won't allocate 6 GB RAM and maintain Playwright — Maxun Cloud or a simpler fetch script may fit better.

Hosting it in Canada

Maxun wants CPU, RAM, and reliable egress. We run scraping stacks on Canadian Docker hosting with PostgreSQL and MinIO volumes, TLS on the UI, and firewall rules that match your compliance story.

Tell us your robot count and schedule frequency — we'll size the box before your first crawl saturates the host.

Tags:
  • Maxun
  • Scraping
  • Automation
  • Data
  • Self-Hosted

Need Help With Your Hosting?

Tell us about your application — we respond within 1 hour with honest recommendations.