CSY103 Week 10 - Practice regex pattern matching before moving to reading resources.

Opening Framing: The Power of Pattern Matching

Security data is messy. Log files contain thousands of lines with embedded IP addresses, timestamps, usernames, and error codes. Threat reports mix IOCs with prose. Malware samples contain strings that might be URLs, registry keys, or file paths.

Regular expressions (regex) let you describe patterns and find all matches. Instead of writing complex string manipulation code, you write a pattern like \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} and instantly find every IP address in a document.

Regex is notoriously tricky to learn, but incredibly powerful once mastered. This week, we'll build from simple patterns to complex extractions, always focused on security applications.

Key insight: Regex is a declarative language for patterns. Instead of telling Python how to find something step by step, you describe what you're looking for and let the regex engine find it.

1) Regex Basics: Literals and Metacharacters

Python's re module provides regex functionality:

import re

# Simple literal match
text = "Failed login from 192.168.1.50"
match = re.search("Failed", text)
if match:
    print(f"Found: {match.group()}")  # "Failed"

# Find all matches
text = "IPs: 10.0.0.1, 10.0.0.2, 10.0.0.3"
matches = re.findall(r"10\.0\.0\.\d", text)
print(matches)  # ['10.0.0.1', '10.0.0.2', '10.0.0.3']

Essential Metacharacters:

. - matches any single character (except newline)
\d - matches any digit (0-9)
\w - matches word characters (a-z, A-Z, 0-9, _)
\s - matches whitespace (space, tab, newline)
\. - matches literal dot (escaped)
^ - matches start of string
$ - matches end of string

import re

# \d matches digits
text = "Port 443 is open"
print(re.findall(r"\d+", text))  # ['443']

# \w matches word characters
text = "user=admin"
print(re.findall(r"\w+", text))  # ['user', 'admin']

# . matches any character
text = "a1b2c3"
print(re.findall(r".\d", text))  # ['a1', 'b2', 'c3']

Key insight: Always use raw strings (r"...") for regex patterns. This prevents Python from interpreting backslashes before the regex engine sees them.

2) Quantifiers: How Many?

Quantifiers specify how many times a pattern should match:

import re

text = "aaa ab abbb a"

# * - zero or more
print(re.findall(r"ab*", text))   # ['a', 'ab', 'abbb', 'a']

# + - one or more
print(re.findall(r"ab+", text))   # ['ab', 'abbb']

# ? - zero or one
print(re.findall(r"ab?", text))   # ['a', 'ab', 'ab', 'a']

# {n} - exactly n times
print(re.findall(r"ab{2}", text)) # ['abb']

# {n,m} - between n and m times
print(re.findall(r"ab{1,3}", text)) # ['ab', 'abbb']

Security Application: IP Address Pattern

import re

# IP address pattern breakdown:
# \d{1,3} - 1 to 3 digits
# \. - literal dot (escaped)
# Repeated 4 times

ip_pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

log = """
Connection from 192.168.1.50 to 10.0.0.25
Blocked: 203.0.113.100
Internal: 172.16.0.1
"""

ips = re.findall(ip_pattern, log)
print(ips)  # ['192.168.1.50', '10.0.0.25', '203.0.113.100', '172.16.0.1']

Greedy vs. Non-Greedy:

import re

html = "<div>content</div><div>more</div>"

# Greedy (default) - matches as much as possible
print(re.findall(r"<div>.*</div>", html))
# ['<div>content</div><div>more</div>']

# Non-greedy (?) - matches as little as possible
print(re.findall(r"<div>.*?</div>", html))
# ['<div>content</div>', '<div>more</div>']

Key insight: Quantifiers are greedy by default—they match as much as possible. Add ? after a quantifier to make it non-greedy.

3) Character Classes and Groups

Character classes match one character from a set. Groups capture portions of the match:

import re

# Character class: [abc] matches a, b, or c
text = "cat bat rat hat"
print(re.findall(r"[cbr]at", text))  # ['cat', 'bat', 'rat']

# Range: [a-z] matches any lowercase letter
print(re.findall(r"[a-z]+", "Hello World"))  # ['ello', 'orld']

# Negation: [^abc] matches anything EXCEPT a, b, c
print(re.findall(r"[^aeiou]+", "hello"))  # ['h', 'll']

# Common character classes
# [0-9] same as \d
# [a-zA-Z0-9_] same as \w
# [0-9a-fA-F] matches hexadecimal

Capturing Groups:

import re

# Parentheses create capturing groups
log = "user=admin action=login status=success"

# Extract key-value pairs
pattern = r"(\w+)=(\w+)"
matches = re.findall(pattern, log)
print(matches)  # [('user', 'admin'), ('action', 'login'), ('status', 'success')]

# Named groups for clarity
pattern = r"(?P<key>\w+)=(?P<value>\w+)"
for match in re.finditer(pattern, log):
    print(f"{match.group('key')}: {match.group('value')}")

Security Application: Log Parsing

import re

log_line = "2024-01-15 09:23:45 WARN Failed login user=admin src=192.168.1.50"

# Extract structured data with groups
pattern = r"(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)"
match = re.match(pattern, log_line)

if match:
    date, time, level, message = match.groups()
    print(f"Date: {date}")
    print(f"Time: {time}")
    print(f"Level: {level}")
    print(f"Message: {message}")

Key insight: Groups let you extract specific parts of a match. Named groups (?P<name>) make your patterns self-documenting.

4) Common Security Patterns

Here are battle-tested patterns for security data extraction:

import re

# IPv4 Address (basic)
ip_pattern = r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"

# MD5 Hash (32 hex characters)
md5_pattern = r"\b[a-fA-F0-9]{32}\b"

# SHA256 Hash (64 hex characters)
sha256_pattern = r"\b[a-fA-F0-9]{64}\b"

# Email Address (simplified)
email_pattern = r"\b[\w.-]+@[\w.-]+\.\w+\b"

# URL (basic)
url_pattern = r"https?://[^\s]+"

# Domain (basic)
domain_pattern = r"\b(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}\b"

Testing Patterns:

import re

def extract_iocs(text):
    """Extract common IOC types from text."""
    iocs = {
        "ips": re.findall(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", text),
        "md5s": re.findall(r"\b[a-fA-F0-9]{32}\b", text),
        "sha256s": re.findall(r"\b[a-fA-F0-9]{64}\b", text),
        "urls": re.findall(r"https?://[^\s\"'<>]+", text),
        "emails": re.findall(r"\b[\w.-]+@[\w.-]+\.\w+\b", text)
    }
    return iocs

# Test
sample = """
The malware connected to 192.168.1.50 and 10.0.0.25.
Hash: 5d41402abc4b2a76b9719d911017c592
C2 server: https://evil.com/beacon
Contact: attacker@malware.net
"""

results = extract_iocs(sample)
for ioc_type, values in results.items():
    if values:
        print(f"{ioc_type}: {values}")

Key insight: Build a library of tested patterns. IOC extraction patterns are reusable across many security scripts and tools.

5) Search, Match, Sub, and Split

The re module provides several functions for different use cases:

import re

text = "Error 404: Page not found at 10:45:30"

# re.search() - find first match anywhere
match = re.search(r"\d+", text)
print(match.group())  # "404"

# re.match() - match at START of string only
match = re.match(r"\d+", text)
print(match)  # None (doesn't start with digit)

# re.findall() - find ALL matches (returns list)
matches = re.findall(r"\d+", text)
print(matches)  # ['404', '10', '45', '30']

# re.finditer() - find all matches (returns iterator)
for match in re.finditer(r"\d+", text):
    print(f"Found '{match.group()}' at position {match.start()}")

Substitution and Splitting:

import re

# re.sub() - replace matches
log = "User password is s3cr3t123"
redacted = re.sub(r"password is \S+", "password is [REDACTED]", log)
print(redacted)  # "User password is [REDACTED]"

# Defang IOCs for safe sharing
ioc = "http://malware.com/beacon"
defanged = re.sub(r"http", "hxxp", ioc)
defanged = re.sub(r"\.", "[.]", defanged)
print(defanged)  # "hxxp://malware[.]com/beacon"

# re.split() - split on pattern
log = "error;warning;info;debug"
levels = re.split(r"[;,|]", log)
print(levels)  # ['error', 'warning', 'info', 'debug']

Compiled Patterns for Performance:

import re

# Compile pattern once, use many times
ip_regex = re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b")

logs = ["Connection from 10.0.0.1", "Request to 192.168.1.50", "No IP here"]

for log in logs:
    matches = ip_regex.findall(log)
    if matches:
        print(f"Found: {matches}")

Key insight: Compile patterns you use repeatedly. The compilation overhead is paid once, and subsequent matches are faster.

Real-World Context: Regex in Security Tools

Regular expressions power security tools across the industry:

YARA Rules: YARA, the pattern-matching tool for malware research, uses regex extensively. Rules like $url = /https?:\/\/[^\s]+/ detect malicious URLs in binaries.

Snort/Suricata: IDS rules use PCRE (Perl Compatible Regular Expressions) to match network traffic patterns for detecting attacks.

Log Analysis: SIEM queries often include regex for pattern matching. Splunk's rex command and ELK's Grok patterns are regex-based extraction tools.

MITRE ATT&CK Reference: Technique T1027 (Obfuscated Files or Information) involves pattern detection. Regex helps identify attack patterns in logs, files, and network traffic.

Key insight: Regex is a universal skill—the same patterns work in Python, grep, sed, YARA, Snort, and most security tools.

Guided Lab: IOC Extractor Tool

Let's build a comprehensive IOC extraction tool using regex.

Step 1: Create Sample Threat Report

Create threat_report.txt:

THREAT INTELLIGENCE REPORT
Date: 2024-01-15

Indicators of Compromise:
- C2 Server: 203.0.113.50 (primary)
- Backup C2: 198.51.100.25
- Exfil endpoint: https://data-collector.evil.com/upload
- Dropper MD5: 5d41402abc4b2a76b9719d911017c592
- Payload MD5: 098f6bcd4621d373cade4e832627b4f6
- Sender: phishing@fake-bank.com
- File: C:\Users\Public\Documents\update.exe

Step 2: Create the IOC Extractor

Create ioc_extractor.py:

import re
import json

class IOCExtractor:
    def __init__(self):
        self.patterns = {
            "ipv4": re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"),
            "md5": re.compile(r"\b[a-fA-F0-9]{32}\b"),
            "sha256": re.compile(r"\b[a-fA-F0-9]{64}\b"),
            "url": re.compile(r"https?://[^\s\"'<>]+"),
            "email": re.compile(r"\b[\w.-]+@[\w.-]+\.\w+\b")
        }
    
    def extract(self, text):
        results = {}
        for ioc_type, pattern in self.patterns.items():
            matches = list(set(pattern.findall(text)))
            if matches:
                results[ioc_type] = matches
        return results
    
    def defang(self, ioc, ioc_type):
        if ioc_type in ["url", "ipv4"]:
            ioc = re.sub(r"\.", "[.]", ioc)
        if ioc_type == "url":
            ioc = re.sub(r"http", "hxxp", ioc)
        return ioc

if __name__ == "__main__":
    extractor = IOCExtractor()
    
    with open("threat_report.txt", "r") as f:
        text = f.read()
    
    iocs = extractor.extract(text)
    
    print("Extracted IOCs:")
    for ioc_type, values in iocs.items():
        print(f"\n{ioc_type.upper()}:")
        for v in values:
            print(f"  {v}")
    
    with open("extracted_iocs.json", "w") as f:
        json.dump(iocs, f, indent=2)
    print("\nSaved to extracted_iocs.json")

Step 3: Run and Analyze

Run python3 ioc_extractor.py and examine the extracted IOCs.

Step 4: Reflection (mandatory)

Why do we compile regex patterns in the __init__ method?
How does defanging make IOCs safe to share?
What IOC types might the current patterns miss?
How would you add IPv6 address extraction?

Week 10 Outcome Check

By the end of this week, you should be able to:

Write basic regex patterns with metacharacters
Use quantifiers for flexible matching
Apply character classes and groups
Extract common security IOCs with regex
Use search, findall, sub, and split effectively
Compile patterns for performance

Next week: Security Libraries—where we leverage Python's security ecosystem for hashing, encoding, and more.

🎯 Hands-On Labs (Free & Essential)

Practice regex pattern matching before moving to reading resources.

🎮 TryHackMe: Python Basics (Regex)

What you'll do: Use regex to extract patterns from logs and text.
Why it matters: Regex turns noisy data into structured evidence.
Time estimate: 1-1.5 hours

Start TryHackMe Python Basics →

📝 Lab Exercise: IOC Regex Extractor

Task: Write regexes to extract IPs, hashes, and URLs from a sample log file.
Deliverable: Script that outputs lists of each IOC type.
Why it matters: IOC extraction is a core triage task.
Time estimate: 60-90 minutes

🏁 PicoCTF Practice: General Skills (Regex)

What you'll do: Solve beginner challenges that rely on pattern matching.
Why it matters: Regex is essential for detecting indicators in text.
Time estimate: 1-2 hours

Start PicoCTF General Skills →

💡 Lab Tip: Start with simple patterns and test incrementally to avoid over-matching.

🛡️ Secure Coding: Regex Safety

Regex is powerful, but unsafe patterns can cause performance issues or allow bypasses in security checks.

Regex safety checklist:
- Anchor patterns with ^ and $ where possible
- Avoid catastrophic backtracking (nested quantifiers)
- Limit input length before matching
- Prefer explicit character classes over wildcards

📚 Building on CSY101 Week-13: Model regex abuse as a denial-of-service vector.

Resources

Complete the required resources to build your foundation.

Python Docs - Regular Expression Operations · 45-60 min · 50 XP · Resource ID: csy103_w10_r1 (Required)
Real Python - Regular Expressions in Python · 60-90 min · 50 XP · Resource ID: csy103_w10_r2 (Required)
Regex101 - Online Regex Tester · Interactive tool · 25 XP · Resource ID: csy103_w10_r3 (Optional)

Lab: Log Pattern Matcher

Goal: Build a tool that extracts specific patterns from various log formats.

Linux/Windows Path (same for both)

Create log_pattern_matcher.py
Implement regex patterns for:
- Apache/Nginx access log entries (IP, timestamp, request, status)
- SSH auth.log failed login attempts
- Windows Event Log format (basic)
Create sample log files for each format
Extract structured data into dictionaries
Generate summary statistics (top IPs, status codes, etc.)
Handle malformed log entries gracefully

Deliverable (submit):

Your log_pattern_matcher.py script
Sample log files used for testing
One paragraph: Why is regex preferred over string methods for log parsing?

Checkpoint Questions

What does \d{1,3} match?
What is the difference between * and + quantifiers?
How do you match a literal dot in a regex pattern?
What is a capturing group and how do you create one?
When should you use re.search() vs re.match()?
Why compile regex patterns for repeated use?

Weekly Reflection

Reflection Prompt (200-300 words):

This week you learned regular expressions—a powerful but complex tool for pattern matching. Regex enables precise extraction from messy, unstructured security data.

Reflect on these questions:

What made regex challenging to learn? What strategies helped?
How does IOC extraction with regex compare to manual searching through threat reports?
Where else might you apply regex in security work beyond IOC extraction?
What are the limitations of regex-based IOC extraction?

A strong reflection will honestly assess the learning curve while recognizing the power regex provides for security data processing.

Verified Resources & Videos

Regex Reference: Python Docs - Regular Expression HOWTO
YARA Rules: YARA Documentation
Security perspective (MITRE ATT&CK): MITRE ATT&CK — Obfuscated Files or Information (T1027)

Regex is a skill that transfers everywhere—Python, grep, sed, YARA, SIEM queries, and more. The patterns you learned this week will serve you throughout your security career. Next week: leveraging security libraries.