Regular Expression (RegEx) is a special text string that can define a search pattern, which is used by string-searching algorithms for "find" or "find and replace" operations on strings. You could grab some basics of Regular Expression here.

In Octoparse, you can use RegEx to match out/replace characters in a field value to refine the extracted data directly.

Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, the RegEx tool would be especially helpful.


Where to find the RegEx tool?

In Octoparse, there are two ways to access the RegEx tool:

1. In the Clean Data options

  • Select the data field you want to customize

  • Click "..." and choose "Clean Data"

  • Click "Add step"

  • Choose "Replace with Regular Expression"/"Match with regular expression"

1.gif
  • Click "Not sure about RegEx? Try the RegEx tool!"

1.png

2. From the Sidebar Navigation

  • Select the "Tool Box" icon from the bottom of the sidebar navigation

  • Click "RegEx Tool"

2.png

The Interface of the RegEx tool

The main interface of the RegEx tool consists of 4 parts:

3.png

1. Original Text

If you open the RegEx tool within the Clean Data options, the extracted text string will be displayed here.

If you open it from the Sidebar Navigation, the character string should be entered in the Original Text directly by typing or pasting on your own.

2. Generate/Reference/Sample

There are 3 tabs on this part.

  • In the Generate tab, there are checkboxes for various options. You can check these boxes and fill in some parameters for Octoparse to automatically generate the Regular expression you need.

    • This section allows you to set conditions to filter out the part of data you want to sort out.

    • You can check the details in the following section (How to use Octoparse Regular Expression Tool?).

  • Reference and Sample are currently empty since we haven't prepared the reference tutorials.

3. Regular Expression

The regular expression will be generated automatically in the "Regular Expression" box after you check the option boxes and fill in the parameters in the "Generate" tab.

Check "Match All" if you'd like to have all matches. Then click the "Match" button to check the expression would find what you want.

4. Matches

Once you have an expression generated, the first match would be displayed in the Matches box.

If you've checked "Match All", then all matches would be displayed orderly in the box.


How to use Octoparse Regular Expression Tool?

Simply click 3 buttons one by one in order(Generate-Match-Apply) and we could easily get the result we need.

4.png

STEP 1: Check the options and fill in the needed parameters(1) then Generate(2) a Regular Expression(3)

  • "Start/End with": Pick up the content that starts or ends with, but excludes the character/characters that you input in the box.

  • "Include Start/End": This option could only be used with "Start/End with" checked. Once you check "Include Start/End", the match result will include the text string you've entered.

  • "Contain One": Pick up the content that contains the character/characters that you've filled.

STEP 2: Click the Match button(4) - check the Match All box if you'd like to have all matches.

STEP 3: Apply(5) the Regular Expression to get the result

Did this answer your question?