Chapter 7

Lifecycle, SLAs & RMA

Hardware failure, parts, maintenance windows.

Learning objectives

  • Understand hardware failure rates and SLAs
  • Navigate RMA and spare parts with a provider
  • Plan maintenance windows with customers

Hardware fails — plan for it

Disks, PSUs, and fans wear out. Workshop Co.’s bare metal server runs 24/7 in Edmonton’s data centre. Swift Host’s SLA covers network and power uptime; your job is backups and graceful degradation. The provider replaces failed components under warranty.

SLA basics

TermMeaning
Uptime SLAe.g. 99.9% network — excludes planned maintenance if notified
Hardware replacementOften NBD (next business day) or 4-hour for premium
Planned maintenanceFirmware, switch upgrades — notice via email/ticket
CreditsPartial refund if SLA missed — read the MSA

RMA workflow

  1. Monitor alerts — SMART disk warning, IPMI ECC errors
  2. Open ticket with logs, serial numbers, slot location
  3. Provider schedules swap — may need maintenance window
  4. Hot-swap drive if RAID redundant; otherwise brief outage
  5. Verify rebuild; close ticket; update asset inventory
RAID rebuild risk

During RAID 5 rebuild after disk replacement, a second failure loses data. Backups must exist before you treat RAID as “safe enough to delay backup.”

Worked example — PSU failure

Workshop Co.’s server has dual redundant PSUs. One fails; IPMI reports amber PSU status. Server keeps running. Swift Host ships replacement PSU under RMA; technician hot-swaps during business hours. No customer-facing outage because redundancy held.

If both PSUs failed or single-PSU budget tower — expect downtime until physical repair.

Maintenance communication

Workshop Co. emails class registrants if booking may be offline. Template:

Maintenance notice

“Scheduled maintenance Sunday 2–4 AM MT — online booking may be unavailable. Classes unaffected.” Post status page if traffic grows.

Try it yourself — failure playbook

Write four bullet points for Workshop Co. when Swift Host opens emergency maintenance on their server (unknown hardware fault).

Answer
  • Confirm backup age < 24h; test restore path
  • Enable static “classes still run — book by phone” banner if site down
  • Monitor IPMI + provider ticket channel
  • Post-mortem: was redundancy adequate? update runbook

End-of-life planning

Server is 5 years old — warranty expired, parts scarce. List migrate vs refresh criteria.

Answer

Refresh: rising SMART errors, CPU insufficient for load, no ECC support on old platform.
Migrate: provision new metal, restore backups, blue/green DNS cutover (Book 1 TTL planning), decommission old after 30-day overlap.

Quick quiz

  1. What is an RMA?
  2. Why dual PSUs in data centre servers?
  3. Does SLA replace the need for backups?
Answers
  1. Return Merchandise Authorization — warranty hardware replacement process.
  2. One PSU can fail without immediate shutdown.
  3. No — SLA covers provider obligations, not your data integrity.