What to do if Octoparse scrapes the last page forever?
FollowExtracting data from multiple pages using pagination is very common since most of the time you'll need more than just one page of data for your project.
If you've run into issues with Octoparse keep on scraping the last page and can't seem to stop, chances are Octoparse is still able to find and click the "Next" button even when it gets to the last page. We generally refer to this as an "endless loop" issue.
There are two ways to solve it. You can set up conditions to force end the pagination loop or modify the XPath manually so Octoparse no longer detects the "Next" button when it is on the last page.
1. Set up loop ending condition - "Exit loop when"
The "Exit loop when" option allows you to force end the pagination loop after repeating for a certain number of times. For example, if you'd like to scrape the first 50 pages of data, you can set up 50 as the number of repeats, then Octoparse will click the "Next" button 50 times then exit the pagination loop when it reaches page-50.
This is an easy and effective way to resolve the issue if you know the exact number of pages you'd like to fetch data from. Follow the steps below to set up end-loop conditions:
1) Go to the settings of the Pagination loop
2) Find the "Exit Loop" at the bottom of the settings
3) Tick the box and enter a number for the number of repeats
4) Click Apply to save the new settings
2. Modify XPath
If the issue cannot be resolved by setting up a loop-ending condition, you may need to modify the XPath of the pagination loop. Octoparse uses XPath to locate any elements on the page, including the "Next" button. In most cases, Octoparse is able to generate the XPath automatically and accurately; however, you may still need to revise the XPath manually from time to time. For example, in the case of an endless loop, you'll need to write an XPath that can locate the "Next" button precisely on all pages except for the last page.
Let's use an example to show you how to write an XPath that works for this purpose.
As you can see from the two screenshots below, the "Next" button is located by an XPath auto-generated both on the first page and on the last page with XPath Helper.
On the first page:
On the last page:
Now, we need to find out the difference between the button on the first page and that on the last page and utilize the difference to write the XPath. We can right click on the button in Chrome to inspect the HTML code of the button.
On the first page:
On last page:
Notice how the HTML code for the buttons is different. There is an attribute "aria-disabled" in the code on the last page. We will then make use of this observation and write a new XPath to locate the "Next" button only when it is NOT on the last page. The new XPath is //a[@class="pagination__next icon-link"][not(@aria-disabled)]
Simply enter the new XPath into Xpath Helper to verify if it can locate the “Next” button both on the first page and the last page.
On the first page:
On the last page:
Great! We've got no matching nodes on the last page and this is exactly what we want: an XPath that successfully selects the "Next" button on the first page but not on the last page. Of course, you can always be more accurate by checking to see if "Next" can be selected on page-2, page-3, etc.
Once you have the new XPath ready, follow the steps below to apply the new XPath to the pagination loop:
1) Go to the settings of the Pagination
2) Replace the original XPath with the revised XPath
3) Click Apply to save the new settings
To sum up, the endless loop issue is not daunting to deal with. Depending on your scraping requirements, you may set up conditions to end the loop or revise the XPath to fix it. If you have any trouble getting the XPath that works for you, feel free to reach out we'd be more than happy to help.
Author: Brian
Editor: Yina