Modifying XPath in Octoparse is an essential skill to acquire for flexible and accurate data scraping. Octoparse XPath Tool can help you write the correct XPath expression and examine the output. With just a little bit of effort, you can greatly improve your productivity.
In this tutorial, I will introduce how to use the Octoparse XPath Tool. Before reading this article, you will need to grab some basics of HTML and XPath first.
There are two ways to access the Octoparse XPath Tool.
- Select the data field you want to customize
- Click "Customize data field"
- Click "Customize XPath"
- Click "Try XPath Tool"
- Select "Tools" from the sidebar navigation
- Click "XPath Tool"
The Octoparse XPath Tool is consisted of 4 main parts:
The Browser(Option 1) When you launch the XPath Tool directly from the task configuration interface, the current page will be loaded automatically in the built-in browser of the XPath Tool.
(Option 2) When opening the XPath Tool from the Sidebar, the XPath Tool will load the Octoparse homepage (www. octoparse.com). You can then enter the target URL in the browser and click "Open"; the content of the web page will be loaded in the built-in browser.
The HTML/MatchesThe corresponding source code is provided in the "HTML" tab but you can always use browsers like Chrome or Firefox to view the legibly structured source code.
Once you have an XPath generated, any content matched can be found by clicking on the "Matches" tab.
The XPath Setting/Reference/DemoIn the "Auto Generate" tab, there are checkboxes for various options. You can check these boxes and fill in some parameters to generate XPath expression by clicking the "Generate" button. You can also click buttons like "Sub-element", "Parent" and such to generate XPath expression.
You can also click the "Reference" tab and read our tutorials on XPath, or click the "Demo" tab to check XPath examples on W3Schools.
The XPath ResultThe XPath expression will be generated automatically in the XPath Result after you check the option boxes and fill in some parameters in the "Auto Generate" tab. You can click the "Match" button to see if the current XPath finds the elements you want on the web page.
A brief introduction to each option in the "Auto Generate" tab.
Item Tag Name:
"Item Tag Name" means the sky blue text such as SPAN, A, DIV in the source code that describes the tag names in the Firefox browser (purple text in Chrome).
Check the box for "Item Tag Name" when you want to include a specific tag name in your XPath expression. Octoparse will generate XPath that finds all the elements start with the tag name you filled in.
E.g., click "Item Tag Name" and select "Item Position" as "1", typing in "span" for the tag name and click the "Generate" button, you'll get the XPath expression "//SPAN", with which any elements with tag name "span" could be located.
"Item Position" refers to the position of the item among all the siblings. The default value is 1, that is, the first item among the siblings. If you want to select the third item, set the parameter as "3", and of course you could select any item you want.
Item ID, Item Name, Item Style Class:
In some cases, a tag element includes the attributes, such as an "id" attribute, a "name" attribute or a "class" attribute.
In Octoparse XPath Tool, "Item ID" refers to the "id" attribute, "Item Name" to the "name" attribute, and "Item Style Class" to the "class" attribute.
To locate the elements with any of the three attributes, just check the box and fill in the right value.
E.g., select "Item Style Class", type in "section-result-opening-hours", and click "Generate", the Tool will generate XPath expression "//*[@class='section-result-opening-hours']", with which all elements that have the "class" attribute with a value of "section-result-opening-hours" could be located, such as the info "Open until 1:00 AM" in the screenshot below.
"Item Text" refers to the the content of a tag element. You can use it if you want to locate the elements whose content is exactly the same text you fill in.
E.g., select "Item Text" and type in "Seafood" and click "Generate", it will generate XPath expression "//*[text(), 'Seafood']", with which only the elements whose content is the text "Seafood" could be located.
When using the option, you must make sure everything of the text you input is exactly the same as that in the source code, including the blank spaces, the punctuation, the full-angle and half-angle. Hence, to make sure that you enter the correct text, you can view the source code in the browser like Chrome and copy the text inside angle brackets by double-clicking from the original source code.
Item Text Contains:
"Item Text Contains" is used to find the tag element that contains the text you want.
E.g., select "Item Text Contains", type in "burger" and click "Generate", you'll get the XPath expression "//*[contains(text(), 'burger')]", with which any element that contains the text "burger" could be located.
Item Text Start With:
"Item Text Start With" is used to locate the elements whose content starts with the text you fill in.
E.g., select "Item Text Start With", type in "burger" and click "Generate", you'll get the XPath expression "//*[starts-with(text(), 'burger')]", with which any element whose content starts with "burger" could be located.
"Sub-element" button: It is used for selecting the child node of current XPath expression by generating "/" in the XPath Result.
"Parent" button: It is used for selecting the parent node of current XPath expression by generating "/parent::" in the XPath Result.
"Previous" button: It is used for selecting the previous node of current XPath expression by generating “/preceding-sibling::” in the XPath Result.
"Next" button: It is used for selecting the next node of current XPath expression by generating “/following-sibling::” in the XPath Result.
Artículo en español: Herramienta Octoparse XPath
También puede leer artículos de web scraping en el website oficial