Having data auto-detected is cool, but no algorithm is perfect. There will be occasions when the data you need is not accurately detected. In this lesson, we'll go over some easy fixes you can apply to optimize your scraping task.
1. If your target data fields were not detected
When Octoparse goes on to detect the data on any web page, it screens the whole page and fetches one or more sets of data using its machine learning algorithm. If you don't see your target data being detected on the first attempt, you can switch to the second set of data by clicking on Switch auto-detect results. The fraction here means Octoparse has detected 3 sets of data and you are looking at the first one.
2. If the auto-detected pagination is incorrect
If the auto-detection fails to locate the pagination correctly, you can easily fix it by clicking on Edit, and then following the instructions on the Tips panel to re-select the correct Next Page button.
TIP: If the "Next" button or "Load more" button is never detected, check out these tutorials for adding the pagination manually:
3. If you need to scroll down the page more in order to load more data
Whenever a web page is detected with an infinitive scroll, Octoparse automatically specifies the number of times to scroll down the page. If you prefer to scroll more before capturing the data, you can easily adjust the number of scroll times by clicking on Edit, and then completing the settings.
In this case, Repeats means how many times that Octoparse needs to scroll on this page and Wait means the dwell time between each scroll.
4. If you need to click links on the page to get more detailed data
In many cases, you will need to click on each product link to get to the product detail page which gives you more specific information, like product description. Octoparse offers an easy option for you on the "Tips".
Select Click on link(s) to scrape the linked page(s) and choose the data field to click on.
Alternatively, you can choose "Click a link on the web page" and select the link directly from the web page.
4. Working with the workflow directly
When you build a scraping task in Octoparse, it simulates real human browsing actions, such as opening a web page and clicking on a page element/button to extract data automatically. The whole extraction process is defined automatically in a workflow with each individual step/action representing a particular instruction in the scraping task.
Though Octoparse tries to make things easier for you by auto-generating the workflow through auto-detection, you can technically build the workflow from scratch or edit the auto-generated workflow to ensure the task does what you need it to do.
There are many different types of actions you can add to the workflow. Each step/action has various settings that you can modify to fine-tune your scraping task.
- Rearrange steps of the workflow by dragging and dropping to the right spot.
- Click to check and modify the settings of the specific step.
- To add an extra step to the workflow, place your mouse where you would like to insert the step. Wait until you see the + sign show up, click on it and select the action you'd like to add.
- Rename, copy or delete a step by right-clicking each step on the workflow.
continue to >> Lesson 3: Refine your data