Chapter 8

AI on Your Infrastructure

Self-hosted vs cloud API, privacy, Canadian data paths.

Learning objectives

Compare cloud LLM API vs self-hosted open models
Map data residency for Canadian workloads
List hardware basics for local inference

Two deployment paths

Approach	Pros	Cons
Cloud API (OpenAI, Anthropic, etc.)	Best quality, no GPU ops, fast to ship	Data leaves your network; US terms; usage billing
Self-hosted (Ollama, vLLM, llama.cpp on your VPS)	Data stays in Canada; fixed cost; air-gap possible	Weaker models on small hardware; you patch and monitor
Canadian hosted API (regional providers)	Balance of quality + residency	Smaller model choice; verify subprocessors

Self-host sketch on Workshop Co. Proxmox

# LXC or VM with GPU passthrough (optional)
# CPU-only: smaller models (7B quantized) on 16 GB RAM — slow but private

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:3b
ollama run llama3.2:3b

# Your PHP/Node app calls http://127.0.0.1:11434/api/generate
# Never expose Ollama port to the public internet without auth

Canadian angle

For PIPEDA-sensitive FAQ (names, emails in chat logs), self-host on Swift Host Canadian VPS or use a provider with contractual Canadian processing. Default US API may be fine for generic marketing copy only.

Security checklist

API keys in env vars, rotated quarterly
Rate limit public chat widget (prevent token drain attacks)
Log prompts without storing full card numbers
Firewall inference port to localhost or VPN
Disclose “AI assistant” to users — no fake human support

Decision matrix

Workshop Co. FAQ — no customer names in prompts, public website only. Pick cloud API vs self-host and justify in 3 bullets.

Sample: cloud API OK

No PII in prompts — only public class schedule JSON
Low volume — API cost < $5/mo
Accept vendor terms; add “do not submit personal info” disclaimer

If chat collects email for follow-up → self-host or Canadian API + retention policy.

Quick quiz

Why must Ollama not listen on 0.0.0.0:11434 on a public VPS?

Answer

Anyone could run prompts on your GPU/CPU — abuse, cost, and prompt injection into your network. Bind localhost or protect with SSH tunnel / auth proxy.