Skip to main content

Regular Expression Tool

Learn how to use Octoparse's AI RegEx tool to clean data

Updated over a week ago

Regular Expression (RegEx) is a special text string that can define a search pattern, which is used by string-searching algorithms for "find" or "find and replace" operations on strings. You could grab some basics of Regular Expression here.

In Octoparse, you can use RegEx to match/replace characters in a field value to refine the extracted data directly.

Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, the RegEx tool would be especially helpful.


How to Access the RegEx Tool

In Octoparse, there are two ways to access the Octoparse RegEx tool:

1. Via the Clean Data Menu

  • Select the data field you wish to customize.

  • Click the "..." button and choose Clean Data.

  • Click Add step and select the RegEx option.

2. Via the Sidebar

  • Locate and click the Tools icon in the left-hand sidebar navigation.


Understanding the RegEx Tool Interface

Version 8.8.0 and later

1. RegEx Patterns

This is a library of pre-built, commonly used regular expressions. You can browse or search for a pattern that fits your need (e.g., matching emails, phone numbers, URLs, or specific date formats). This is the fastest way to apply a powerful RegEx without needing to build it yourself.

2. AI RegEx Generator

Tired of writing complex regular expressions? Use our AI RegEx Generator to build them instantly by simply showing the tool what you want to extract.

How it Works:

  1. Find the data field you want to clean, click More >> Clean Data for that field.

  2. Add a Match with Regular Expression step to your workflow of the target data field.

  3. Click Need help with RegEx? Try our RegEx tools!

  4. For each test string, manually highlight only the text you want to match (e.g., NY in Suffolk County, NY).

  5. Click Generate. The AI will analyze your examples and propose a RegEx pattern.

  6. Click Test to verify the pattern works against all your samples.

  7. Click Apply & Save, give your pattern a name, and confirm.

Example Input & Output:

  • Input String: Suffolk County, NY

  • You Highlight: NY

  • Generated RegEx: Will create a pattern that matches the state code (e.g., NY, NC).

3. RegEx Builder

This is the evolution of the classic "Generate" tab. It provides a user-friendly, form-based interface to build your own custom regular expression by selecting options and filling in parameters (e.g., "Starts with," "Ends with," "Contains"). It automatically translates your choices into the correct RegEx syntax, making it perfect for those learning or who prefer a visual approach.


How to use the Octoparse RegEx Builder?

STEP 1:

Check the options and fill in the needed parameters(1) then Generate(2) a Regular Expression(3)

  • "Start/End with": Pick up the content that starts or ends with, but excludes the character/characters that you input in the box.

  • "Include Start/End": This option could only be used with "Start/End with" checked. Once you check "Include Start/End", the match result will include the text string you've entered.

  • "Contain One": Pick up the content that contains the character/characters that you've filled.

STEP 2:

Click the Match button(4) - check the Match All box if you'd like to have all matches.

STEP 3:

Once you are satisfied with the previewed matches, click the "Apply" button to confirm and implement the changes.

Before version 8.8.0

The main interface of the RegEx tool consists of 4 parts:

3.png
4.png

1. Original Text

  • If opened from the Clean Data menu, this area automatically displays the extracted text from your selected field.

  • If opened from the sidebar, you can manually type or paste a sample text string here to test your expressions.

2. Configuration Tabs (Generate/Reference/Sample)

  • Generate: This is the main tab for creating expressions. You can check various options and fill in parameters to have Octoparse build a RegEx for you automatically.

  • Reference & Sample: These tabs are reserved for future tutorials and guides.

3. Regular Expression

  • This box displays the auto-generated RegEx code based on your selections in the "Generate" tab.

  • Check the "Match All" box to find every occurrence that matches the pattern, then click the "Match" button to test the expression.

4. Matches

  • This box shows the results of the RegEx operation. The first match is displayed by default; if "Match All" is checked, all matches will be listed in order.

Did this answer your question?