Extracting data from multiple pages using pagination is very common since most of the time you'll need more than just one page of data for your project.
If you've run into issues with Octoparse keep on scraping the last page and can't seem to stop, chances are Octoparse is still able to find and click the "Next" button even when it gets to the last page. We generally refer to this as the "endless loop" issue.
There are two ways to solve it. You can set up conditions to force end the pagination loop or modify the XPath manually so Octoparse no longer detects the "Next" button when it is on the last page.
1. Set up loop ending condition - "Exit loop when"
The "Exit loop when" option allows you to force end the pagination loop after repeating for a certain number of times, for example, if you'd like to scrape the first 50 pages of data, you can set up 49 as the number of repeats then Octoparse will click the "Next" button for 49 times then exit the pagination loop when it reaches page-50.
This is an easy and effective way to resolve the issue if you know the exact number of pages you'd like to fetch data from. Follow the steps below to set up end-loop conditions:
1) Go to the setting of the pagination loop
2) Find and click open the setting for "Exit Loop"
3) Check the box and enter a number for the number of repeats
4) Click "OK" to save the new settings
2. Modify XPath
If the issue cannot be resolved by setting up a loop-ending condition, you may need to modify the XPath of the pagination loop. Octoparse uses XPath to locate any elements on the page, including the "Next" button. For the most part, Octoparse is able to generate the XPath automatically and accurately, however, you may still need to revise the XPath manually from time to time. For example, in the case of an endless loop, you'll need to write an XPath that can locate the "Next" button precisely on all pages except for the last page.
Let's use an example to show you how to write an XPath that works for this purpose.
As you can see from the two screenshots below, the "Next" button is located by an XPath auto-generated by the Firepath plugin, both on the first page and on the last page.
On the first page:
On the last page:
Now, notice how the "class" attribute under the "a" tag is different on the first page than on the last page. One is "gspr next", while the other is "gspr next-d".
We will then make use of this observation and write a new XPath to locate the "Next" button only when it is NOT on the last page. The new XPath is //a[@class='gspr next'].
Simply enter the new XPath into Firepath to verify if it can locate the “Next” button both on the first page and the last page.
On the first page:
On the last page:
Great! We've got no matching nodes on the last page and this is exactly what we want: an XPath that successfully selects the "Next" button on the first page but not on the last page. Of course, you can always be more accurate by checking to see if "Next" can be selected on page-2, page-3, and etc.
Once you have the new XPath ready, follow the steps below to apply the new XPath to the pagination loop:
1) Go to the setting of the loop item
2) Replace the original XPath with the revised XPath
3) Click "OK" to save the new settings.
If you would like to learn more about XPath modification, check out our XPath tutorial here .
To sum up, the endless loop issue need not be daunting to deal with. Depending on your scraping requirements, you may set up conditions to end the loop or revise the XPath to fix it. If you have any trouble getting the XPath that works for you, feel free to reach out we'd be more than happy to help.