Having the same problem in Octoparse verion 8? Click here.
Regular Expression (RegExp) is a special text string which can be used to find patterns. As any string can be converted to a RegExp, a RegExp is able to perform pattern-matching and "search-and-replace" functions on the text.
You could grab some basics of Regular Expression here .
What is Octoparse Regular Expression Tool?
Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, Octoparse RegEx tool would be especially helpful.
In Octoparse, there are two ways to access the RegEx Tool:
Method 1: Within 2 Octoparse reformat options - Try RegEx Tool
· Select the data field you want to customize
· Click "Customize data field"
· Click "Refine extracted data"
· Click "Add Step"
· Click "Replace with Regular Expression"/"Match with regular expression"
- Replace with regular expression
- Match with regular expression
Method 2：From the Sidebar Navigation
· Select "Tools" from the sidebar navigation
· Click "RegEx Tool"
The main interface of Regular Expression Tool consists of 4 parts:
1. Source Text
If you open the RegEx Tool within the reformatting options, the extracted text string will be displayed here.
If you open it from the Sidebar Navigation, the character string should be entered in the Source Text directly by typing or pasting.
2. RegEx setting/Reference/Sample
There are 3 tabs on this part.
- In the "Auto Generate" tab, there are checkboxes for various options. You can check these boxes and fill in some parameters for Octoparse to automatically generate the Regular expression you need.
- You can also click the "Reference" tab to read the Regular Expression tutorials on W3Schools, or click the "Sample" tab to check some examples on W3Schools.
3. Regular Expression
The XPath expression will be generated automatically in the "Regular Expression" box after you check the option boxes and fill in the parameters in the "Auto Generate" tab.
Check "Match All" if you'd like to have all matches. Then click the "Match" button to check the expression would find what you want.
Once you have an expression generated, the first match would be displayed in the Matches box.
If you've checked "Match All", then all matches would be displayed orderly in the box.
How to use Octoparse Regular Expression Tool？
Simply click 3 buttons in order(Generate-Match-Apply) and we could easily get the result we need.
· Check the Options and fill in the needed parameters
There are 5 Options provided:
- "Start/End with"
Pick up the content that starts or ends with, but excludes, the character/characters that you input in the box.
- "Include Start/End"
This option could only be used with "Start/End with". Once you check "Include Start/End", the match result will include the text string you've entered.
- "Contain One"
Pick up the content that contains the character/characters that you've filled.
· Click the "Generate" button.
· Click the "Match" button
Remember to check "Match All" if you'd like to have all matches.
· Click the "Apply" button to apply the result
Let’s see some practical use cases in When and how to use Regular Expression Tool – a guide for beginners .
Artículo en español: Herramienta de Expresión Regular de Octoparse
También puede leer artículos de web scraping en el website oficial
- Use Regular Expressions in Octoparse
- Web scraping | Introduction to Octoparse XPath Tool
- Use Regular Expression to Reformat Captured Data
- Re-format data extracted