That blog post you cited in a client report? Gone — domain parked, Wayback Machine capture incomplete, and the Internet Archive copy loads a broken CSS mess. Your Pocket export has URLs. Half the destinations are 404. You saved the link, not the page.
ArchiveBox fixes the "I wish I'd saved that" problem. ~28k GitHub stars, open source, and a self-hosted web archiver that takes URLs, bookmarks, Pocket exports, RSS feeds, or browser history — then saves HTML, PDF, PNG screenshots, WARC, extracted article text, YouTube MP4s, and git clones of linked repos. Plain files on disk. Readable in twenty years without running ArchiveBox.
What it actually does
Feed ArchiveBox a URL (or a thousand). It runs multiple extractors in parallel and stores everything under data/archive/{id}/ — not a proprietary blob format.
Web pages. Original HTML, SingleFile snapshot, headless Chrome screenshot, PDF printout, wget clone, DOM dump, readability-extracted article text, response headers, favicon.
Media. yt-dlp pulls YouTube, SoundCloud, and similar — video, audio, subtitles, thumbnails into the archive folder.
Code links. GitHub/GitLab URLs get a git clone of the repo at archive time.
Inputs everywhere. CLI one-liners, web UI on port 8000, browser extension for live archiving, Pocket/Pinboard/Instapaper exports, Netscape bookmark HTML, RSS, JSON, CSV, stdin pipes. Schedule recurring imports with archivebox schedule.
Under the hood: Chrome/Chromium, wget, yt-dlp, SingleFile, Readability — standard tools, not a black box. SQLite index plus folders you can grep, rsync, or mount on NFS.
ArchiveBox vs changedetection.io
We covered changedetection.io for watching pages change — price drops, restocks, defacement alerts.
ArchiveBox is preservation, not monitoring. You archive once (or on a schedule) and keep a durable snapshot. changedetection tells you "the price changed Tuesday." ArchiveBox tells you "here's what the page looked like Tuesday, in PDF and HTML, forever." Different jobs; some teams run both.
Why self-host?
Your bookmarks are private. Browser history, legal research, client competitor pages, paywalled content you archive with your own cookies — that collection doesn't belong on a US SaaS with a vague retention policy.
PIPEDA and evidence. Lawyers and journalists use ArchiveBox for chain-of-custody on web evidence. Self-hosted in Canada means you control who accesses the archive and where disks live.
Disk is cheap; regret is expensive. Pocket shut down features. Twitter became X. Forums die. ArchiveBox lets you own copies before link rot wins — and stores them in formats that outlive any single vendor.
Stealth mode. By default ArchiveBox also submits to archive.org for redundancy. Disable that for local-only, air-gapped, or sensitive collections via config.
What running it takes
Docker Compose is the recommended path — bundles Chrome, yt-dlp, and friends:
mkdir -p ~/archivebox/data && cd ~/archivebox
curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
docker compose run archivebox init
docker compose up
Web UI at http://archivebox.localhost:8000 (admin subdomain for management). Add URLs via CLI:
docker compose run archivebox add 'https://example.com'
echo 'https://another-page.com' | docker compose run -T archivebox add
Resource reality: headless Chrome and yt-dlp are hungry. Plan 2–4 GB RAM minimum for light personal use; serious bookmark imports and video archiving want more CPU, RAM, and fast disk. Archives grow — budget storage like you would for a photo library, because video snapshots add up fast.
Put auth in front if exposed beyond localhost. Default config can lock down public add/view. Logged-in archiving (import Chrome cookies via personas) is powerful and sensitive — read their security wiki before archiving behind paywalls.
Who it's for (and who should skip it)
Good fit: researchers hoarding papers and threads, journalists preserving cited sources, lawyers collecting web evidence, anyone with 5,000 Pocket links and trust issues, teams building LLM training datasets from archived text.
Maybe skip it: you only need "tell me when this price changes" — changedetection.io is lighter. If you archive three URLs a year, a manual SingleFile browser extension might suffice. ArchiveBox pays off at volume or when you need multiple output formats and scheduled imports.
Hosting it in Canada
ArchiveBox is a disk-and-RAM app. We host it on Canadian Docker servers with storage sized for growth, nightly backups of your data/ volume, and TLS if you want the web UI reachable from outside the house.
Tell us your import volume — we'll quote disk before your first Pocket export fills the partition.