Skip to main content

Regular Expression (Regex) Cheatsheet for Data Extraction

Updated over 3 months ago

Regular expressions (regex) are powerful pattern-matching tools used in Octoparse to clean, filter, and extract precise text from scraped data. This cheatsheet covers essential syntax and practical examples for web scraping.

1. Basic Regex Syntax

Pattern

Example

Matches

.

a.c

"abc", "a2c" (any single character)

\d

\d\d

"42", "01" (two digits)

\w

\w+

"hello", "A1_" (alphanumeric + underscore)

\s

a\sb

"a b" (whitespace)

[abc]

[aeiou]

"a", "e" (any vowel)

[^abc]

[^0-9]

"A", "!" (non-digit characters)

^

^Hello

"Hello" at the start of a string

$

world$

"world" at the end of a string


2. Quantifiers (Repetition)

Pattern

Example

Matches

*

a*b

"b", "aaab" (0+ repetitions)

+

a+b

"ab", "aaab" (1+ repetitions)

?

colou?r

"color", "colour" (optional)

{n}

\d{3}

"123" (exactly 3 digits)

{n,}

\w{4,}

"hello", "regex" (4+ characters)

{n,m}

\d{2,4}

"12", "1234" (2 to 4 digits)


3. Groups & Lookarounds

Pattern

Example

Matches

(abc)

(foo)+

"foo", "foofoo" (capture group)

(?:abc)

(?:ab)+

"abab" (non-capturing group)

(?=abc)

\w+(?=\.com)

"google" in "google.com" (positive lookahead)

(?!abc)

\d{3}(?!USD)

"123" in "123EUR" (negative lookahead)


4. Practical Examples for Web Scraping

A. Email Extraction

[\w.-]+@[\w.-]+\.\w+
  • Matches: user@example.com

B. Phone Numbers (US Format)

\(\d{3}\) \d{3}-\d{4}
  • Matches: (123) 456-7890

C. URLs

https?://(?:www\.)?\w+\.\w+(?:/\S*)?
  • Matches: https://octoparse.com/docs

D. Prices

\$\d+(?:\.\d{1,2})?
  • Matches: $19.99, $100


5. Regex in Octoparse

How to Use Regex

  1. In "Text Process" Step:

    • Select "Extract by Regex" → Enter your pattern.

  2. For Data Validation:

    • Use "Filter by Regex" to exclude mismatches (e.g., ^[A-Z]{2}\d{3}$ for product codes).

Pro Tips

Test First: Use Regex101.com to debug patterns.
Escape Specials: Use \ for literal ., *, etc. (e.g., \.com).
Combine with XPath: Use regex in contains(): //div[matches(text(), '\d+% off')].


6. Quick Reference Table

Task

Regex Pattern

Extract hashtags

#\w+

Remove HTML tags

<[^>]+>

Match dates (YYYY-MM-DD)

\d{4}-\d{2}-\d{2}

Split camelCase

([a-z])([A-Z])$1 $2

Did this answer your question?