Week Introduction
"Assume breach" โ no defense is perfect, so security maturity is measured by how quickly you detect, contain, and recover from incidents. The goal isn't zero incidents (impossible); it's minimizing dwell time and blast radius.
This week explores the incident response lifecycle: how organizations detect anomalies, investigate alerts, contain threats, eradicate attackers, and recover operations. You'll learn why speed matters (attackers move laterally while you investigate), why preparation is everything (you can't plan during crisis), and how post-incident learning prevents future compromises.
Learning Outcomes (Week 11 Focus)
By the end of this week, you should be able to:
- LO4 - Risk Reasoning: Explain the incident response lifecycle and why each phase matters
- LO8 - Integration: Connect technical controls (logging, monitoring) to incident detection and response
- LO3 - Threat Landscape: Analyze real breach timelines to identify failures in detection, containment, or recovery
Lesson 11.1 ยท What Is an Incident? (Event vs Incident vs Breach)
Critical distinctions:
-
Security Event: Any observable occurrence in a system
Examples: Failed login, firewall block, antivirus alert, port scan
Volume: Thousands per day (most are noise) -
Security Incident: Event that violates security policy or indicates
compromise
Examples: Confirmed malware infection, unauthorized access, data exfiltration, ransomware deployment
Trigger: Requires investigation and response -
Data Breach: Incident where sensitive data is accessed/stolen by
unauthorized party
Examples: Customer database stolen, credit cards compromised, intellectual property exfiltrated
Consequence: Legal notification requirements, regulatory fines, reputational damage
Incident severity classification (prioritizes response):
- Critical (P1): Active breach, data exfiltration, ransomware, production systems down โ respond immediately (24/7)
- High (P2): Confirmed compromise, no data loss yet, limited spread โ respond within hours
- Medium (P3): Suspicious activity, potential compromise, needs investigation โ respond within 24 hours
- Low (P4): Policy violation, no security impact, informational โ handle during business hours
Why dwell time matters (time from compromise to detection):
- Industry average dwell time: 21 days (Mandiant M-Trends 2024)
- During dwell time, attackers: Escalate privileges, move laterally, establish persistence, steal data
- Goal: Detect in hours/days, not weeks/months
Lesson 11.2 ยท The Incident Response Lifecycle (NIST Framework)
NIST SP 800-61 Incident Response Framework (industry standard):
-
Phase 1: Preparation (Before incident occurs)
Activities: Build IR team, define roles, create playbooks, deploy monitoring tools, conduct tabletop exercises
Key outputs: Incident response plan, contact lists, pre-approved actions, communication templates
Why it matters: You can't plan during a crisis โ preparation determines response speed
Example: Pre-negotiated contracts with forensic firms (can't negotiate during ransomware attack) -
Phase 2: Detection & Analysis (Identify and investigate)
Activities: Monitor alerts, triage events, investigate suspicious activity, determine scope
Key questions: Is this a real incident? What systems are affected? What data is at risk?
Challenges: Alert fatigue (false positives), incomplete logs, attacker evasion techniques
Example: SIEM alert fires for unusual login โ analyst investigates โ confirms account compromise -
Phase 3: Containment (Stop the bleeding)
Activities: Isolate affected systems, block attacker C2 (command & control), prevent lateral movement
Short-term containment: Disconnect compromised servers from network (stops spread immediately)
Long-term containment: Patch vulnerabilities, reset credentials, rebuild systems
Trade-off: Containment disrupts business but limits damage
Example: Ransomware detected โ immediately isolate all servers โ prevents encryption spread -
Phase 4: Eradication (Remove threat)
Activities: Delete malware, close vulnerabilities, remove attacker access/persistence mechanisms
Why separate from containment: Must ensure attacker is fully removed before restoring
Risk: Incomplete eradication โ attacker returns immediately after systems restored
Example: Find and remove all backdoor accounts, scheduled tasks, web shells -
Phase 5: Recovery (Restore operations)
Activities: Rebuild systems from clean backups, restore data, verify integrity, monitor closely
Phased approach: Restore critical systems first, watch for re-compromise
Validation: Confirm attacker truly gone before declaring "all clear"
Example: Restore from pre-infection backup, reset all passwords, increase monitoring -
Phase 6: Post-Incident Activity (Learn and improve)
Activities: Conduct post-mortem, document timeline, identify gaps, update defenses
Key questions: How did attacker get in? Why didn't we detect sooner? What failed?
Output: Lessons learned report, remediation plan, updated IR playbooks
Example: "Attacker used unpatched VPN โ implement 48-hour patch SLA for critical vulnerabilities"
Real-world breach timeline example (SolarWinds supply chain attack):
- March 2020: Attackers compromise SolarWinds build system (initial access)
- March-May 2020: Attackers insert backdoor into Orion software updates (supply chain poisoning)
- June-November 2020: Trojanized updates distributed to 18,000 customers (widespread deployment)
- December 2020: FireEye detects breach, public disclosure (detection after 9 months)
- Dwell time: 9+ months from compromise to detection
- Lesson: Supply chain attacks evade traditional defenses โ need behavioral monitoring, not just signatures
Lesson 11.3 ยท Detection and Monitoring: The MTTD Challenge
Core metric: Mean Time to Detect (MTTD)
MTTD = Time from initial compromise to detection. Industry average: 21 days. Best-in-class: Hours. Every day attackers remain undetected, they escalate privileges, move laterally, and steal more data.
Detection pyramid (layers of visibility):
-
Level 1: Network monitoring
Tools: Firewall logs, IDS/IPS (Snort, Suricata), NetFlow analysis
Detects: Port scans, C2 beaconing, lateral movement, data exfiltration
Limitation: Encrypted traffic (HTTPS) hides content -
Level 2: Endpoint monitoring
Tools: EDR (CrowdStrike, SentinelOne), antivirus, system logs
Detects: Malware execution, suspicious processes, registry changes, file modifications
Limitation: Agents can be disabled, only sees individual hosts -
Level 3: Application/authentication logs
Tools: SIEM (Splunk, ELK), authentication logs, application audit trails
Detects: Failed logins, privilege escalation, unusual access patterns, data queries
Limitation: Requires centralized log collection and correlation -
Level 4: User behavior analytics (UBA)
Tools: Machine learning anomaly detection, baseline behavioral profiling
Detects: Insider threats, compromised accounts behaving abnormally
Example: Employee who normally accesses 10 files/day suddenly downloads 10,000 โ alert
Alert fatigue problem:
- Reality: Security teams receive thousands of alerts daily, 90%+ are false positives
- Consequence: Real attacks buried in noise, analysts become desensitized
- Solutions: Tuning (reduce false positives), automation (tier alerts by severity), threat intelligence (prioritize known bad indicators)
Detection indicators (what to monitor):
- Failed authentication attempts (brute force attacks)
- Privilege escalation (user suddenly has admin rights)
- Off-hours access (login at 3 AM from unusual location)
- Large data transfers (exfiltration in progress)
- New scheduled tasks (persistence mechanism)
- Communication with known bad IPs (C2 servers, threat feeds)
Lesson 11.4 ยท Recovery and Business Continuity
Recovery objectives (define acceptable downtime/data loss):
-
RTO (Recovery Time Objective): Maximum acceptable downtime
Example: E-commerce site RTO = 4 hours (every hour down = $100K lost revenue)
Drives: Redundancy investment (hot standby, failover, geographic distribution) -
RPO (Recovery Point Objective): Maximum acceptable data loss
Example: Banking transactions RPO = 0 (cannot lose any transactions)
Drives: Backup frequency (continuous replication, hourly snapshots, daily backups)
Backup strategy (3-2-1 rule):
- 3 copies of data (production + 2 backups)
- 2 different media (disk + tape, or cloud + local)
- 1 offsite (geographic separation protects against site disasters)
- Addition: Air-gapped/immutable backups (ransomware can't encrypt what it can't reach)
Recovery validation (don't trust, verify):
- Test backup restoration regularly (untested backups = no backups)
- Verify data integrity (checksums, sample testing)
- Confirm attacker eradication (scan restored systems, monitor for reinfection)
- Phased restoration (critical systems first, watch for anomalies before full recovery)
Post-incident review (blameless postmortem):
- Timeline reconstruction: When did compromise occur? How did attacker move? What was stolen?
- Root cause analysis: What vulnerability was exploited? Why didn't existing controls prevent it?
- Detection gaps: Why did it take X days to detect? What monitoring was missing?
- Response evaluation: What went well? What slowed response? How to improve?
- Remediation plan: Patch vulnerabilities, update playbooks, add monitoring, train staff
Learning from others' incidents:
- Read public breach reports (Verizon DBIR, vendor incident postmortems)
- Subscribe to threat intelligence feeds (know current attacker TTPs)
- Participate in information sharing groups (ISACs for your industry)
- Conduct tabletop exercises (simulate breaches to test response)
Lesson 11.5 ยท Secure SDLC: Prevent, Detect, Respond
Secure software is not a single tool โ it's a workflow. Mature teams assume bugs will exist and build systems that reduce impact, increase visibility, and make recovery fast.
A defensible baseline:
- Threat modeling (identify assets, attackers, boundaries)
- Code review with security invariants (authz, input handling, secrets)
- Dependency hygiene and patching cadence
- Logging, monitoring, and incident response readiness
Week Wrap-Up: Self-Check Questions
Answer in your own words (short paragraphs):
- Where are the trust boundaries in a login flow?
- Why is client-side validation insufficient for security?
- If an attacker gets a valid user session token, what risks follow?
- How should a team decide what to fix first: severity, likelihood, exposure, or impact?
- What does "secure by design" mean for a feature that handles money or personal data?
Lab 11 ยท Incident Response Playbook
Time estimate: 40-50 minutes
Objective: Create an incident response playbook for a specific attack scenario. You will map the IR lifecycle (detection โ containment โ eradication โ recovery โ lessons learned) to concrete actions and decisions.
Task Overview
Choose one incident scenario:
- Ransomware attack: Files encrypted, ransom note displayed
- Account compromise: Executive email account hacked, sending phishing to contacts
- Web application breach: SQL injection, customer database accessed
- Insider threat: Employee downloading sensitive files before resignation
For your chosen scenario, document:
- Detection indicators: What alerts/symptoms would reveal this incident? (Specific log entries, user reports, monitoring alerts)
- Immediate actions (first 30 minutes): Containment steps to stop spread (isolate systems, disable accounts, block C2 domains)
- Investigation (hours 1-4): Scope determination (How many systems affected? What data accessed? When did compromise start?)
- Eradication (hours 4-24): Remove threat (Delete malware, close vulnerability, remove persistence mechanisms)
- Recovery (days 1-3): Restore operations (Rebuild systems, restore backups, verify integrity)
- Post-incident (week after): Lessons learned (Root cause, detection gaps, remediation plan)
Success criteria:
- โ Concrete, actionable steps (not vague "investigate the incident")
- โ Time-bounded phases (hour 0-1, hours 1-4, etc.)
- โ Defined decision points (When to notify customers? Legal? Law enforcement?)
- โ Recovery validation (How do you confirm threat is gone?)
- โ Lessons learned โ defensive improvements
๐ฏ Hands-On Labs (Free & Essential)
Practice incident detection, investigation, and response workflows before moving to reading resources.
๐ฎ TryHackMe: Intro to SOC
What you'll do: Explore how Security Operations Centers detect and triage security events.
Why it matters: Incident response starts with detection. SOC workflows are the frontline of response.
Time estimate: 1-1.5 hours
๐ฎ TryHackMe: Incident Response
What you'll do: Walk through a structured incident response scenario from detection to recovery.
Why it matters: Reinforces the NIST IR lifecycle and highlights time-critical decision points.
Time estimate: 1.5-2 hours
๐ PicoCTF Practice: Forensics (Incident Artifacts)
What you'll do: Analyze basic forensic artifacts to practice evidence gathering and timeline thinking.
Why it matters: IR depends on reliable evidence. These challenges build that investigation mindset.
Time estimate: 1-2 hours
๐ก Lab Tip: For each lab, note the first signal that indicated "this is an incident." Early detection is the hardest part.
Resources (Free + Authoritative)
Work through these in order. Focus on the IR lifecycle and real-world response strategies.
๐ NIST SP 800-61r2 - Computer Security Incident Handling Guide
What to read: Section 2 (Organizing IR Capability) and Section 3 (IR
Lifecycle).
Why it matters: Authoritative government framework for incident
response. Used globally as best practice.
Time estimate: 30 minutes (skim structure, focus on lifecycle phases)
๐ฅ Computerphile - Incident Response Explained (Video)
What to watch: Real-world perspective on handling security incidents.
Why it matters: Practical insights from experienced incident
responders.
Time estimate: 15 minutes
๐ SANS Incident Handler's Handbook
What to read: Quick reference guide - detection, containment, eradication
steps.
Why it matters: Practical checklist for incident responders.
Industry-standard reference.
Time estimate: 20 minutes
๐ Verizon DBIR - Incident Timeline Analysis
What to read: Section on "Timeline of Attacks" - how quickly attackers move
through kill chain.
Why it matters: Real-world data on dwell time, detection speed,
attacker TTPs.
Time estimate: 20 minutes
Tip: Completion and XP persist via localStorage. If progress doesn't update immediately, refresh once.
Weekly Reflection Prompt
Aligned to LO4 (Risk Reasoning) and LO8 (Integration)
Write 200-300 words answering this prompt:
Explain why "assume breach" is a more realistic security posture than "prevent all attacks." Use your Lab 11 incident response playbook as an example.
In your answer, include:
- Why detection speed (MTTD - Mean Time to Detect) matters more than preventing every attack
- The six phases of the incident response lifecycle (NIST framework)
- Why preparation is critical (you can't plan during a crisis)
- How dwell time (time from compromise to detection) enables attacker lateral movement
- The difference between RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
- Why post-incident reviews matter for preventing future compromises
What good looks like: You demonstrate understanding that perfect prevention is impossible (zero-days exist, humans make mistakes, supply chains are complex). You explain that mature security programs measure success by detection speed and recovery capability, not zero incidents. You connect technical controls (logging, monitoring, backups) to IR phases. You show that incident response is a continuous learning cycle โ each incident improves future defenses.