You are browsing a tutorial guide for Octoparse's latest version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Extracting data from multiple pages using pagination is very common since most of the time, you'll need more than just one page of data for your project.

If you've run into issues with Octoparse, keep on scraping the last page and can't seem to stop, chances are Octoparse is still able to find and click the "Next" button even when it gets to the last page. We generally refer to this as an "endless loop" issue.

There are two ways to solve it:

  1. Set up conditions to end the pagination loop

  2. Modify the XPath manually to stop detecting the "Next" button when it is on the last page


1. Set up loop ending condition - Exit loop when

The Exit loop when option allows you to end the pagination loop after repeating the loop a certain number of times. For example, if you'd like to scrape the first 50 pages of data, you can set up 50 as the number of repeats, then Octoparse will click the "Next" button 50 times, then exit the pagination loop when it reaches page 50.

This is an easy and effective way to resolve the issue if you know the exact number of pages you'd like to fetch data from. Follow the steps below to set up end-loop conditions:

  • Go to the settings of the Pagination loop

  • Find the Exit Loop at the bottom of the settings

vvttyy.png
  • Tick the box and enter a number for the number of repeats

  • Click Apply to save the new settings


2. Modify XPath

If the issue cannot be resolved by setting up a loop-ending condition, you may need to modify the XPath of the pagination loop. Octoparse uses XPath to locate any elements on the page, including the "Next" button. In most cases, Octoparse can generate the XPath automatically and accurately; however, you may still need to revise the XPath manually from time to time. For example, in the case of an endless loop, you'll need to write an XPath that can precisely locate the "Next" button on all pages except the last page.

Tip: We suggest that you use the Chrome extension XPath Helper to write the XPath. You can check out how to write an XPath in the tutorial: What is XPath and how to use it in Octoparse.

Let's use an example to show you how to write an XPath that works for this purpose.

As you can see from the two screenshots below, the "Next" button is located by an XPath auto-generated both on the first and last pages with XPath Helper.

On the first page:

first_page.jpg

On the last page:

last_page.jpg

Now, we need to find out the difference between the button on the first and the last pages and utilize the difference to write the XPath. We can right-click on the button in Chrome to inspect the HTML code of the button.

On the first page:

first_page1.jpg

On the last page:

last_page1.jpg

Notice how the HTML code for the buttons is different. There is an attribute "aria-disabled" in the code on the last page.

We will then make use of this observation and write a new XPath to locate the "Next" button only when it is NOT on the last page. The new XPath is: //a[@class="pagination__next icon-link"][not(@aria-disabled)]

Simply enter the new XPath into Xpath Helper to verify if it can locate the "Next" button both on the first page and the last page.

On the first page:

first_page2.jpg

On the last page:

last_page2.jpg

Great! We've got no matching nodes on the last page, and this is exactly what we want: an XPath that successfully selects the "Next" button on the first page but not on the last page. Of course, you can always be more accurate by checking to see if "Next" can be selected on page 2, page 3, etc.

Once you have the new XPath ready, follow the steps below to apply the new XPath to the pagination loop:

  • Go to the settings of the Pagination

  • Replace the original XPath with the revised XPath

  • Click Apply to save the new settings

change_Xpath.jpg

To sum up, the endless loop issue is not daunting to deal with. Depending on your scraping requirements, you may set up conditions to end the loop or revise the XPath to fix it.

Did this answer your question?