Lifecycle, SLAs & RMA
Hardware failure, parts, maintenance windows.
Learning objectives
- Understand hardware failure rates and SLAs
- Navigate RMA and spare parts with a provider
- Plan maintenance windows with customers
Hardware fails — plan for it
Disks, PSUs, and fans wear out. Workshop Co.’s bare metal server runs 24/7 in Edmonton’s data centre. Swift Host’s SLA covers network and power uptime; your job is backups and graceful degradation. The provider replaces failed components under warranty.
SLA basics
| Term | Meaning |
|---|---|
| Uptime SLA | e.g. 99.9% network — excludes planned maintenance if notified |
| Hardware replacement | Often NBD (next business day) or 4-hour for premium |
| Planned maintenance | Firmware, switch upgrades — notice via email/ticket |
| Credits | Partial refund if SLA missed — read the MSA |
RMA workflow
- Monitor alerts — SMART disk warning, IPMI ECC errors
- Open ticket with logs, serial numbers, slot location
- Provider schedules swap — may need maintenance window
- Hot-swap drive if RAID redundant; otherwise brief outage
- Verify rebuild; close ticket; update asset inventory
During RAID 5 rebuild after disk replacement, a second failure loses data. Backups must exist before you treat RAID as “safe enough to delay backup.”
Worked example — PSU failure
Workshop Co.’s server has dual redundant PSUs. One fails; IPMI reports amber PSU status. Server keeps running. Swift Host ships replacement PSU under RMA; technician hot-swaps during business hours. No customer-facing outage because redundancy held.
If both PSUs failed or single-PSU budget tower — expect downtime until physical repair.
Maintenance communication
Workshop Co. emails class registrants if booking may be offline. Template:
“Scheduled maintenance Sunday 2–4 AM MT — online booking may be unavailable. Classes unaffected.” Post status page if traffic grows.
Try it yourself — failure playbook
Write four bullet points for Workshop Co. when Swift Host opens emergency maintenance on their server (unknown hardware fault).
Answer
- Confirm backup age < 24h; test restore path
- Enable static “classes still run — book by phone” banner if site down
- Monitor IPMI + provider ticket channel
- Post-mortem: was redundancy adequate? update runbook
End-of-life planning
Server is 5 years old — warranty expired, parts scarce. List migrate vs refresh criteria.
Answer
Refresh: rising SMART errors, CPU insufficient for load, no ECC support on old platform.
Migrate: provision new metal, restore backups, blue/green DNS cutover (Book 1 TTL planning), decommission old after 30-day overlap.
Quick quiz
- What is an RMA?
- Why dual PSUs in data centre servers?
- Does SLA replace the need for backups?
Answers
- Return Merchandise Authorization — warranty hardware replacement process.
- One PSU can fail without immediate shutdown.
- No — SLA covers provider obligations, not your data integrity.