Paginated content is everywhere. Take your favorite e-commerce site for example, rather than having all the products listed on one single page, they are more likely to be spread out into multiple pages and thus we have pagination. So if you were to scrape product data from the site, you would need to configure your task with pagination in order to include all the products listed on the different pages.
This tutorial covers 2 common ways to deal with pagination：
1) Extract multiple pages using the "Next" button [check video tutorial]
- Load the list page/search result page in the built-in browser if you are not already there
- When the page is loaded, locate and click on the "Next" button
- From "Action Tips", select "Loop click next page"
Switch to the workflow mode by toggling the icon at the upper right-hand side and notice a "Click to paginate" step is automatically generated and added to the workflow.
(To finish setting up the task, learn how to capture items on the list and capture data from each item page by clicking on a list .)
In case if the paginated content is loaded dynamically via AJAX, set up a 2 to 4 seconds AJAX timeout for the "Click to paginate" step. Do not set up AJAX timeout if the item does not use AJAX technique.
2) Extract data from multiple pages when there is no "Next" button (Page number links only)
Sometimes the "Next" button is not available but only the page number links like this:
In this case, we would need to modify the XPath of the "Click to paginate" action from the workflow. We'll first add a pagination loop using page number "1" although the loop will not work properly without further adjustment.
- Click on page number "1"
- From the "Action Tips", select "Loop click the selected link" to create a pagination "Loop Item". (Learn more about using XPath in Octoparse )
The auto-generated pagination loop will not work properly here since we've selected page number "1" to loop through. With the current setup, Octoparse will simply keep clicking on "1" as it tries to paginate to the next page, leading to duplicated data being extracted endlessly.
Now we need to modify the XPath of the "Click to paginate" action which is the most important part of dealing with page number type of pagination.
The XPath syntax most often used here is "following-sibling" which selects the all the siblings after the current node.
For example, when we are on page 1, our goal is to click on page number "2" to get us to page 2, then subsequently page 3, so on and so forth.
1) To do this, write the XPath to locate the select-page item first
Inspect the source code and locate the code for the current page selected (this can often be done by right-clicking on the page number "1" then select "Inspect Source Code" or similar command). In the example below, the code for the node of page 1 is: <li class="nav-pageitem selected">.
Thus, the XPath of the select-page item would be:
2) Select the 2nd-page node with XPath Syntax, "following-sibling"
As the 2nd page is found within first "li" tag following the current "li" node, the correct XPath would be:
3) For clicking on the links, we would need to locate the "a" tag, which means the anchor for linking to the 2nd page.
Now we have the complete XPath:
4) Replace the auto-generated XPath for the pagination loop with the new XPath
- Click on the pagination loop and refer to the setting on the right side, input the new XPath into the textbox for "Single element"
5) Double-check the XPath to make sure it works for other pages
- Click on the pagination loop
- Click on the "Click to paginate" action
- Observe if the webpage has paginated to the subsequent page successfully
- Repeat the above step for more pages
While XPath is used to locate any particular items on a web page, it is based on the page's source code. Hence the XPath provided in this example will not apply to any other websites most likely but you can always apply the same method for writing the XPath that works for your target website.
Artículo en español: Extraer varias páginas a través de la paginación
También puede leer artículos de web scraping en el sitio web oficial.