In many cases, pagination is not an option for loading content, you will need to either
(The updated tutorial for the latest version 8.1 is available here. Go to have a check now!)
This tutorial will show you how to configure a task in Octoparse to deal with these two situations, making sure all available data is extracted.
Similar to how you will manually scroll down the page, Octoparse does it the same way with the proper settings. Basically, all you need to do is to tell Octoparse which page to scroll, how many times to scroll and the time interval between every two scrolls.
1) Navigate to the webpage that needs to be scrolled
It should either be an "Open Webpage" action or "Click" action depending on how the page is connected to the previous action in the workflow.
2) From "Advanced Option", locate the option for "Scroll Down"
3) Check "Scroll down to bottom of the page when finished loading"
4) Input the desired number for "Scroll times" and the number of seconds between the scrolls
5) From the drop-down menu, choose whether you would like to scroll down to the bottom of the page or scroll down for one screen.
6) Click "OK" to save the settings
It is easy to set up for infinitive loading but to find the most appropriate settings, you might want to test running the task to see if you’ve assigned enough scroll times and if the scrolling is working with the right pace.
2) Click "Load more" button
In addition to infinitive scrolling, some webpage requires clicking on the "Load More" button or "Show More" button as more content loads dynamically via AJAX.
To capture all available contents from the page, I will configure Octoparse to first click on the "Load More" button repeatedly until all the information needed is revealed, then go on to capture all the information at once.
Let’s see how it is done using Health.usnews.com (link) as an example [Download the example task]
1) Navigate to the page if you are not already there. Notice more content gets loaded every time you click on the "Load More" button located at the bottom of the page.
2) Hover over the "Load More" button and click on it (or right-click if left click triggers the link).
3) From the Action Panel, a variety of the next possible actions is provided. Go ahead and select "Loop click the selected link". This will tell Octoparse to click on the button repeatedly.
4) Now, toggle the workflow switch on the top and you should see the workflow generated by Octoparse. Although the click was identified by Octoparse as a paginating action, the "Load More" click is more often done via AJAX.
- Click on the action "Click to paginate" from the workflow
- From "Advanced Options", select "Load the page with AJAX" and set the timeout to as long as needed (eg. 1 or 2 seconds usually).
If you only wish to click the "Load More" button for X many of times, select the Pagination Loop from the workflow and click open "When loop end" setting from "Advanced options", set the execution times to the X.
5) Now, you can build a list of the sections to loop through (see lesson 4 ).
6) And proceed to extract the specific data fields from each section (see lesson 4 ).
7) Test running the task with "Local Extraction ". Every website works differently, so it is important to always test run the task and see if all steps in the workflow are executed correctly.
1. If the extraction loop has been built inside of the pagination loop, drag it out manually since we would want to finish the first loop before executing the second.
2. If an action had been made by mistake, use "Undo Action" to cancel the action.
También puede leer artículos de web scraping en el sitio web oficial.
Dealing with AJAX
Select items in a drop-down menu
Use lists to extract
Extract multiple pages through pagination