Regular expressions (regex) are powerful pattern-matching tools used in Octoparse to clean, filter, and extract precise text from scraped data. This cheatsheet covers essential syntax and practical examples for web scraping.
1. Basic Regex Syntax
Pattern | Example | Matches |
|
| "abc", "a2c" (any single character) |
|
| "42", "01" (two digits) |
|
| "hello", "A1_" (alphanumeric + underscore) |
|
| "a b" (whitespace) |
|
| "a", "e" (any vowel) |
|
| "A", "!" (non-digit characters) |
|
| "Hello" at the start of a string |
|
| "world" at the end of a string |
2. Quantifiers (Repetition)
Pattern | Example | Matches |
|
| "b", "aaab" (0+ repetitions) |
|
| "ab", "aaab" (1+ repetitions) |
|
| "color", "colour" (optional) |
|
| "123" (exactly 3 digits) |
|
| "hello", "regex" (4+ characters) |
|
| "12", "1234" (2 to 4 digits) |
3. Groups & Lookarounds
Pattern | Example | Matches |
|
| "foo", "foofoo" (capture group) |
|
| "abab" (non-capturing group) |
|
| "google" in "google.com" (positive lookahead) |
|
| "123" in "123EUR" (negative lookahead) |
4. Practical Examples for Web Scraping
A. Email Extraction
[\w.-]+@[\w.-]+\.\w+
Matches:
user@example.com
B. Phone Numbers (US Format)
\(\d{3}\) \d{3}-\d{4}
Matches:
(123) 456-7890
C. URLs
https?://(?:www\.)?\w+\.\w+(?:/\S*)?
Matches:
https://octoparse.com/docs
D. Prices
\$\d+(?:\.\d{1,2})?
Matches:
$19.99
,$100
5. Regex in Octoparse
How to Use Regex
In "Text Process" Step:
Select "Extract by Regex" → Enter your pattern.
For Data Validation:
Use "Filter by Regex" to exclude mismatches (e.g.,
^[A-Z]{2}\d{3}$
for product codes).
Pro Tips
✔ Test First: Use Regex101.com to debug patterns.
✔ Escape Specials: Use \
for literal .
, *
, etc. (e.g., \.com
).
✔ Combine with XPath: Use regex in contains()
: //div[matches(text(), '\d+% off')]
.
6. Quick Reference Table
Task | Regex Pattern |
Extract hashtags |
|
Remove HTML tags |
|
Match dates (YYYY-MM-DD) |
|
Split camelCase |
|