Opening Framing: The Power of Pattern Matching
Security data is messy. Log files contain thousands of lines with embedded IP addresses, timestamps, usernames, and error codes. Threat reports mix IOCs with prose. Malware samples contain strings that might be URLs, registry keys, or file paths.
Regular expressions (regex) let you describe patterns and find all matches.
Instead of writing complex string manipulation code, you write a pattern
like \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} and instantly find
every IP address in a document.
Regex is notoriously tricky to learn, but incredibly powerful once mastered. This week, we'll build from simple patterns to complex extractions, always focused on security applications.
Key insight: Regex is a declarative language for patterns. Instead of telling Python how to find something step by step, you describe what you're looking for and let the regex engine find it.
1) Regex Basics: Literals and Metacharacters
Python's re module provides regex functionality:
import re
# Simple literal match
text = "Failed login from 192.168.1.50"
match = re.search("Failed", text)
if match:
print(f"Found: {match.group()}") # "Failed"
# Find all matches
text = "IPs: 10.0.0.1, 10.0.0.2, 10.0.0.3"
matches = re.findall(r"10\.0\.0\.\d", text)
print(matches) # ['10.0.0.1', '10.0.0.2', '10.0.0.3']
Essential Metacharacters:
.- matches any single character (except newline)\d- matches any digit (0-9)\w- matches word characters (a-z, A-Z, 0-9, _)\s- matches whitespace (space, tab, newline)\.- matches literal dot (escaped)^- matches start of string$- matches end of string
import re
# \d matches digits
text = "Port 443 is open"
print(re.findall(r"\d+", text)) # ['443']
# \w matches word characters
text = "user=admin"
print(re.findall(r"\w+", text)) # ['user', 'admin']
# . matches any character
text = "a1b2c3"
print(re.findall(r".\d", text)) # ['a1', 'b2', 'c3']
Key insight: Always use raw strings (r"...") for regex
patterns. This prevents Python from interpreting backslashes before
the regex engine sees them.
2) Quantifiers: How Many?
Quantifiers specify how many times a pattern should match:
import re
text = "aaa ab abbb a"
# * - zero or more
print(re.findall(r"ab*", text)) # ['a', 'ab', 'abbb', 'a']
# + - one or more
print(re.findall(r"ab+", text)) # ['ab', 'abbb']
# ? - zero or one
print(re.findall(r"ab?", text)) # ['a', 'ab', 'ab', 'a']
# {n} - exactly n times
print(re.findall(r"ab{2}", text)) # ['abb']
# {n,m} - between n and m times
print(re.findall(r"ab{1,3}", text)) # ['ab', 'abbb']
Security Application: IP Address Pattern
import re
# IP address pattern breakdown:
# \d{1,3} - 1 to 3 digits
# \. - literal dot (escaped)
# Repeated 4 times
ip_pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"
log = """
Connection from 192.168.1.50 to 10.0.0.25
Blocked: 203.0.113.100
Internal: 172.16.0.1
"""
ips = re.findall(ip_pattern, log)
print(ips) # ['192.168.1.50', '10.0.0.25', '203.0.113.100', '172.16.0.1']
Greedy vs. Non-Greedy:
import re
html = "<div>content</div><div>more</div>"
# Greedy (default) - matches as much as possible
print(re.findall(r"<div>.*</div>", html))
# ['<div>content</div><div>more</div>']
# Non-greedy (?) - matches as little as possible
print(re.findall(r"<div>.*?</div>", html))
# ['<div>content</div>', '<div>more</div>']
Key insight: Quantifiers are greedy by default—they match as much as
possible. Add ? after a quantifier to make it non-greedy.
3) Character Classes and Groups
Character classes match one character from a set. Groups capture portions of the match:
import re
# Character class: [abc] matches a, b, or c
text = "cat bat rat hat"
print(re.findall(r"[cbr]at", text)) # ['cat', 'bat', 'rat']
# Range: [a-z] matches any lowercase letter
print(re.findall(r"[a-z]+", "Hello World")) # ['ello', 'orld']
# Negation: [^abc] matches anything EXCEPT a, b, c
print(re.findall(r"[^aeiou]+", "hello")) # ['h', 'll']
# Common character classes
# [0-9] same as \d
# [a-zA-Z0-9_] same as \w
# [0-9a-fA-F] matches hexadecimal
Capturing Groups:
import re
# Parentheses create capturing groups
log = "user=admin action=login status=success"
# Extract key-value pairs
pattern = r"(\w+)=(\w+)"
matches = re.findall(pattern, log)
print(matches) # [('user', 'admin'), ('action', 'login'), ('status', 'success')]
# Named groups for clarity
pattern = r"(?P<key>\w+)=(?P<value>\w+)"
for match in re.finditer(pattern, log):
print(f"{match.group('key')}: {match.group('value')}")
Security Application: Log Parsing
import re
log_line = "2024-01-15 09:23:45 WARN Failed login user=admin src=192.168.1.50"
# Extract structured data with groups
pattern = r"(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)"
match = re.match(pattern, log_line)
if match:
date, time, level, message = match.groups()
print(f"Date: {date}")
print(f"Time: {time}")
print(f"Level: {level}")
print(f"Message: {message}")
Key insight: Groups let you extract specific parts of a match. Named
groups (?P<name>) make your patterns self-documenting.
4) Common Security Patterns
Here are battle-tested patterns for security data extraction:
import re
# IPv4 Address (basic)
ip_pattern = r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"
# MD5 Hash (32 hex characters)
md5_pattern = r"\b[a-fA-F0-9]{32}\b"
# SHA256 Hash (64 hex characters)
sha256_pattern = r"\b[a-fA-F0-9]{64}\b"
# Email Address (simplified)
email_pattern = r"\b[\w.-]+@[\w.-]+\.\w+\b"
# URL (basic)
url_pattern = r"https?://[^\s]+"
# Domain (basic)
domain_pattern = r"\b(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}\b"
Testing Patterns:
import re
def extract_iocs(text):
"""Extract common IOC types from text."""
iocs = {
"ips": re.findall(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", text),
"md5s": re.findall(r"\b[a-fA-F0-9]{32}\b", text),
"sha256s": re.findall(r"\b[a-fA-F0-9]{64}\b", text),
"urls": re.findall(r"https?://[^\s\"'<>]+", text),
"emails": re.findall(r"\b[\w.-]+@[\w.-]+\.\w+\b", text)
}
return iocs
# Test
sample = """
The malware connected to 192.168.1.50 and 10.0.0.25.
Hash: 5d41402abc4b2a76b9719d911017c592
C2 server: https://evil.com/beacon
Contact: attacker@malware.net
"""
results = extract_iocs(sample)
for ioc_type, values in results.items():
if values:
print(f"{ioc_type}: {values}")
Key insight: Build a library of tested patterns. IOC extraction patterns are reusable across many security scripts and tools.
5) Search, Match, Sub, and Split
The re module provides several functions for different use cases:
import re
text = "Error 404: Page not found at 10:45:30"
# re.search() - find first match anywhere
match = re.search(r"\d+", text)
print(match.group()) # "404"
# re.match() - match at START of string only
match = re.match(r"\d+", text)
print(match) # None (doesn't start with digit)
# re.findall() - find ALL matches (returns list)
matches = re.findall(r"\d+", text)
print(matches) # ['404', '10', '45', '30']
# re.finditer() - find all matches (returns iterator)
for match in re.finditer(r"\d+", text):
print(f"Found '{match.group()}' at position {match.start()}")
Substitution and Splitting:
import re
# re.sub() - replace matches
log = "User password is s3cr3t123"
redacted = re.sub(r"password is \S+", "password is [REDACTED]", log)
print(redacted) # "User password is [REDACTED]"
# Defang IOCs for safe sharing
ioc = "http://malware.com/beacon"
defanged = re.sub(r"http", "hxxp", ioc)
defanged = re.sub(r"\.", "[.]", defanged)
print(defanged) # "hxxp://malware[.]com/beacon"
# re.split() - split on pattern
log = "error;warning;info;debug"
levels = re.split(r"[;,|]", log)
print(levels) # ['error', 'warning', 'info', 'debug']
Compiled Patterns for Performance:
import re
# Compile pattern once, use many times
ip_regex = re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b")
logs = ["Connection from 10.0.0.1", "Request to 192.168.1.50", "No IP here"]
for log in logs:
matches = ip_regex.findall(log)
if matches:
print(f"Found: {matches}")
Key insight: Compile patterns you use repeatedly. The compilation overhead is paid once, and subsequent matches are faster.
Real-World Context: Regex in Security Tools
Regular expressions power security tools across the industry:
YARA Rules: YARA, the pattern-matching tool for malware
research, uses regex extensively. Rules like
$url = /https?:\/\/[^\s]+/ detect malicious URLs in binaries.
Snort/Suricata: IDS rules use PCRE (Perl Compatible Regular Expressions) to match network traffic patterns for detecting attacks.
Log Analysis: SIEM queries often include regex for
pattern matching. Splunk's rex command and ELK's Grok
patterns are regex-based extraction tools.
MITRE ATT&CK Reference: Technique T1027 (Obfuscated Files or Information) involves pattern detection. Regex helps identify attack patterns in logs, files, and network traffic.
Key insight: Regex is a universal skill—the same patterns work in Python, grep, sed, YARA, Snort, and most security tools.
Guided Lab: IOC Extractor Tool
Let's build a comprehensive IOC extraction tool using regex.