How and when to use Regular Expression in Octoparse - a guide for beginners
FollowIf you are totally new to the regular expression, then this tutorial could help — with which you would easily get your feet wet, and quickly use it like a pro.
1. Pick up the information you need from a text string(Match with Regular Expression)
If your desired data begins/ends with a fixed string, it would be especially easy to pick it up with Octoparse RegEx Tool. Below are two of the most common use cases.
- Pick up URLs from HTML
As you know, most URLs look very similar. A typical URL could have the shared form, that is, starts with "https", and ends with ".com" or ".html". And sometimes you may find the URLs you want do not have such a shared form, but are followed by the same string.
Step 1. Identify the pattern of the URLs you want
According to the source code, though all the URLs start with "https", they don't have the same ending. Fortunately, they are followed by the same attribute "hreflang", which could be used as the shared ending string.
Step 2. Fill in the parameters based on the pattern you've found
- Pick up the "hidden" information from HTML
You can use the same way to obtain the data "hidden behind" HTML, such as the star rating . When you extract the HTML of an element on the page, you will get the HTML markup contained within the element. Figure out the shared form of the data you want, and then you could leave the most difficult work--writing the proper regular expression--to Octoparse RegEx Tool.
Tips To know more about extracting from HTML, please refer to Extract data from the source code |
2. Remove unwanted information from a lengthy text(Replace with Regular Expression)
- Remove unwanted spaces
In most cases, you could just sit back and leave the writing work to Octoparse RegEx Tool. But sometimes it will be easier and faster by directly filling in the shorthand characters. Below are some most frequently used characters in Octoparse.
Character |
Meaning |
\s |
Matches a single white space character, including space, tab, form feed, line feed, and other Unicode spaces. |
\S |
Matches a single character other than white space. |
\t |
Matches a horizontal tab. |
\n |
Matches a linefeed. |
Here is an example to show how to remove the unwanted spaces with "\s".
Tips For more about Regular Expression, please refer to JavaScript RegExp Reference |
Artículo en español: Cómo y cuándo usar la Expresión Regular en Octoparse: una guía para principiantes
También puede leer artículos de web scraping en el website oficial
Related articles: