What is "Extract data"?
"Extract Data" is a must-have step when you set up your task to get the data you need. All the data fields you need can be found in this step. Under this step, you can clean data, modify XPath, change the sequence of, copy, delete data fields, and so on.
Without this step, your task cannot be executed.
How to add "Extract data" to the workflow?
There are two ways to generate an "Extract Data" action.
1. Click the element on the web page to auto-generate one (The most common one)
When you want to take data from the page, you just need to click on the element first. Then, click the option of "Extract......" on the Tips panel and an "Extract Data" action would show in the workflow.
The options can be:
"Extract text/URL of the selected element"
"Extract the inner/outer HTML of the selected element"
"Extract data in the loop"
2. Add from the workflow
When you move your mouse around the workflow, you can see an icon showing up. Click the icon to display the drop-down options, and choose "Extract Data" to add this step to the workflow.
To adjust more settings related to the data fields, you can double-click the name of "Extract Data" or click the gear icon as follows:
After opening "Action Settings", you can see there are 4 main features.
1. Extract data in the loop
This option is normally ticked when you extract data directly from a listing page instead of clicking into the detail page to pull out data.
Here is an example of a product listing page.
To learn more about extracting data from a listing result page, check this guide: Extract a list.
2. Define data fields
You can find data fields here and you can take actions such as deleting, moving, or cleaning your data field(s) and add field(s) such as extraction time, current page URL, etc from a pre-defined list.
You can also revise the XPath of a certain data field here if it is not located correctly in the output.
To better understand what those icons stand for, check the following details:
: Batch delete data field(s) if you want to delete multiple fields at one time
: Import data field(s) from a data file [Octoparse extract config (*.oec)]
: Export data field(s) to a data file [Octoparse extract config (*.oec)]
: Customize XPath (To learn more about XPath, check this guide What is XPath and how to use it in Octoparse)
: More actions
- Customize field: to select what information (text, HTML, an attribute value, or URL) you need to scrape from the page element. To learn more about it, check this guide: Extract element text/URL/image/HTML/attribute.
- Clean data: to clean the data output to your desired one (e.g. add prefix, suffix, transform the time, replace, etc). To learn more about it, check this guide: Re-format data extracted.
- Combine data: to combine the same field of data from other loop items. To learn more about it, check this guide: Combine data extracted.
- When data cannot be found: when this data field is empty in some cases, you can choose to leave all the fields blank, leave this field blank, or use a fixed value.
- Move field: to move the position of a certain data field to the top or bottom, up or down.
- Copy: to duplicate a certain data field
Trigger is used when you want to scrape data based on some conditions.
For example, if this line of data is not blank in Field1, you want to dump this line of data. Check out more details about Trigger.
4. Before action is performed (add wait time)
This one is to allow you to add a wait time before executing this action.
Different websites may have different timeout settings to load the data, so sometimes you need to add some wait time or waiting conditions to give more time for the web page to load.
You can check this guide for different use cases: Wait before action.
If you have any trouble with setting up your task, you're welcome to submit a ticket to our Support team.
Artículo en español: Extraer datos
También puedes leer artículos de web scraping en sitio web oficial