Week Introduction
"Assume breach" โ no defense is perfect, so security maturity is measured by how quickly you detect, contain, and recover from incidents. The goal isn't zero incidents (impossible); it's minimizing dwell time and blast radius.
This week explores the incident response lifecycle: how organizations detect anomalies, investigate alerts, contain threats, eradicate attackers, and recover operations. You'll learn why speed matters (attackers move laterally while you investigate), why preparation is everything (you can't plan during crisis), and how post-incident learning prevents future compromises.
Learning Outcomes (Week 11 Focus)
By the end of this week, you should be able to:
- LO4 - Risk Reasoning: Explain the incident response lifecycle and why each phase matters
- LO8 - Integration: Connect technical controls (logging, monitoring) to incident detection and response
- LO3 - Threat Landscape: Analyze real breach timelines to identify failures in detection, containment, or recovery
Lesson 11.1 ยท What Is an Incident? (Event vs Incident vs Breach)
Critical distinctions:
-
Security Event: Any observable occurrence in a system
Examples: Failed login, firewall block, antivirus alert, port scan
Volume: Thousands per day (most are noise) -
Security Incident: Event that violates security policy or indicates
compromise
Examples: Confirmed malware infection, unauthorized access, data exfiltration, ransomware deployment
Trigger: Requires investigation and response -
Data Breach: Incident where sensitive data is accessed/stolen by
unauthorized party
Examples: Customer database stolen, credit cards compromised, intellectual property exfiltrated
Consequence: Legal notification requirements, regulatory fines, reputational damage
Incident severity classification (prioritizes response):
- Critical (P1): Active breach, data exfiltration, ransomware, production systems down โ respond immediately (24/7)
- High (P2): Confirmed compromise, no data loss yet, limited spread โ respond within hours
- Medium (P3): Suspicious activity, potential compromise, needs investigation โ respond within 24 hours
- Low (P4): Policy violation, no security impact, informational โ handle during business hours
Why dwell time matters (time from compromise to detection):
- Industry average dwell time: 21 days (Mandiant M-Trends 2024)
- During dwell time, attackers: Escalate privileges, move laterally, establish persistence, steal data
- Goal: Detect in hours/days, not weeks/months
Lesson 11.2 ยท The Incident Response Lifecycle (NIST Framework)
NIST SP 800-61 Incident Response Framework (industry standard):
-
Phase 1: Preparation (Before incident occurs)
Activities: Build IR team, define roles, create playbooks, deploy monitoring tools, conduct tabletop exercises
Key outputs: Incident response plan, contact lists, pre-approved actions, communication templates
Why it matters: You can't plan during a crisis โ preparation determines response speed
Example: Pre-negotiated contracts with forensic firms (can't negotiate during ransomware attack) -
Phase 2: Detection & Analysis (Identify and investigate)
Activities: Monitor alerts, triage events, investigate suspicious activity, determine scope
Key questions: Is this a real incident? What systems are affected? What data is at risk?
Challenges: Alert fatigue (false positives), incomplete logs, attacker evasion techniques
Example: SIEM alert fires for unusual login โ analyst investigates โ confirms account compromise -
Phase 3: Containment (Stop the bleeding)
Activities: Isolate affected systems, block attacker C2 (command & control), prevent lateral movement
Short-term containment: Disconnect compromised servers from network (stops spread immediately)
Long-term containment: Patch vulnerabilities, reset credentials, rebuild systems
Trade-off: Containment disrupts business but limits damage
Example: Ransomware detected โ immediately isolate all servers โ prevents encryption spread -
Phase 4: Eradication (Remove threat)
Activities: Delete malware, close vulnerabilities, remove attacker access/persistence mechanisms
Why separate from containment: Must ensure attacker is fully removed before restoring
Risk: Incomplete eradication โ attacker returns immediately after systems restored
Example: Find and remove all backdoor accounts, scheduled tasks, web shells -
Phase 5: Recovery (Restore operations)
Activities: Rebuild systems from clean backups, restore data, verify integrity, monitor closely
Phased approach: Restore critical systems first, watch for re-compromise
Validation: Confirm attacker truly gone before declaring "all clear"
Example: Restore from pre-infection backup, reset all passwords, increase monitoring -
Phase 6: Post-Incident Activity (Learn and improve)
Activities: Conduct post-mortem, document timeline, identify gaps, update defenses
Key questions: How did attacker get in? Why didn't we detect sooner? What failed?
Output: Lessons learned report, remediation plan, updated IR playbooks
Example: "Attacker used unpatched VPN โ implement 48-hour patch SLA for critical vulnerabilities"
Real-world breach timeline example (SolarWinds supply chain attack):
- March 2020: Attackers compromise SolarWinds build system (initial access)
- March-May 2020: Attackers insert backdoor into Orion software updates (supply chain poisoning)
- June-November 2020: Trojanized updates distributed to 18,000 customers (widespread deployment)
- December 2020: FireEye detects breach, public disclosure (detection after 9 months)
- Dwell time: 9+ months from compromise to detection
- Lesson: Supply chain attacks evade traditional defenses โ need behavioral monitoring, not just signatures
Lesson 11.3 ยท Detection and Monitoring: The MTTD Challenge
Core metric: Mean Time to Detect (MTTD)
MTTD = Time from initial compromise to detection. Industry average: 21 days. Best-in-class: Hours. Every day attackers remain undetected, they escalate privileges, move laterally, and steal more data.
Detection pyramid (layers of visibility):
-
Level 1: Network monitoring
Tools: Firewall logs, IDS/IPS (Snort, Suricata), NetFlow analysis
Detects: Port scans, C2 beaconing, lateral movement, data exfiltration
Limitation: Encrypted traffic (HTTPS) hides content -
Level 2: Endpoint monitoring
Tools: EDR (CrowdStrike, SentinelOne), antivirus, system logs
Detects: Malware execution, suspicious processes, registry changes, file modifications
Limitation: Agents can be disabled, only sees individual hosts -
Level 3: Application/authentication logs
Tools: SIEM (Splunk, ELK), authentication logs, application audit trails
Detects: Failed logins, privilege escalation, unusual access patterns, data queries
Limitation: Requires centralized log collection and correlation -
Level 4: User behavior analytics (UBA)
Tools: Machine learning anomaly detection, baseline behavioral profiling
Detects: Insider threats, compromised accounts behaving abnormally
Example: Employee who normally accesses 10 files/day suddenly downloads 10,000 โ alert
Alert fatigue problem:
- Reality: Security teams receive thousands of alerts daily, 90%+ are false positives
- Consequence: Real attacks buried in noise, analysts become desensitized
- Solutions: Tuning (reduce false positives), automation (tier alerts by severity), threat intelligence (prioritize known bad indicators)
Detection indicators (what to monitor):
- Failed authentication attempts (brute force attacks)
- Privilege escalation (user suddenly has admin rights)
- Off-hours access (login at 3 AM from unusual location)
- Large data transfers (exfiltration in progress)
- New scheduled tasks (persistence mechanism)
- Communication with known bad IPs (C2 servers, threat feeds)