If you are seriously looking into scraping a web site, chances are you would want to navigate through the different pages of the website and extract data from each one of them. The first step, however, is to identify the kind of pagination you are dealing with and work from there. A few examples are:
- Paginate using the "Next" button
- Paginate without the "Next" button
- Paginate with infinitive scrolling
- Paginate using the "Load more" button
In this tutorial, we will focus on how to create a pagination action when there is no next page button on the page. More specifically, one that requires clicking the numbered links when you want to turn the page, like the ones below.
Now, let's explore the various ways you can create a pagination action with no next page button in Octoparse.
1. Create a pagination with Auto-detect
If you are building a new task with Webpage Auto-detect, Octoparse automatically scans the web page for web data and pagination links.
If you have "Auto-detect" enabled in Settings, the auto-detect process will be initiated automatically.
If Octoparse detects any pagination links on the web page, pagination options will be provided in the Tips panel upon completion of the Auto-detect Process. You can click "Check" to see the link detected by Octoparse or click "Edit" to actually edit the link if it has not been detected correctly.
As we all know, web pages come in many different forms. There will be times when Auto-detect fails to detect the pagination links or actually have the wrong links detected. In this case, you can turn to one of the solutions below.
2. Using "Batch generate" to create the URLs for all pages
An alternative but every effective way to approach scraping multiple pages of a website is to first collect the URLs of all the pages you would need to scrape and build a task using the list of URLs collected.
Take a close look at the web page URLs for the different pages, do you notice something like this?
If you are seeing a similar pattern like the example above, with only the page-number changing in the URLs of the different pages, you can easily batch generate all the page URLs and scrape as many pages as needed. Once you have the links generated, Octoparse goes on to scrape all the pages automatically.
3. Create a pagination manually
Even when Auto-detect fails to work and page URLs are not showing pattern, you can still create a pagination action manually.
It will be a two-step process. First, you are going to write/find the XPath of the page element that actually takes you to the next page (e.g. if you are on page 1, then you would want to click page 2; if you are on page 2, then you would want to click page 3, so on and so forth), and second, you would revise the XPath of the "Click to Paginate" action of the workflow in Octoparse. Sounds complicated? No worries, let's dive into an example.
XPath knowledge is not mandatory but is extremely helpful for creating the task that does exactly what you need in Octoparse. Check out What is XPath and how to use it in Octoparse to learn more about how to use XPath to create the perfect web scraper.
Say that you'll need to create a pagination step for this web page (http://www.enzolifesciences.com/product-listing/?product_type=Antibodies&application=&text=) manually.
Step 1. Load the page in Octoparse and click on page number link "1". Then, select "Loop click single element". A "Loop Item" should be generated automatically in the Workflow.
Step 2. Leave the Octoparse App for now and follow the steps below to write/find the XPath you need for the pagination action.
1) Copy and paste the current page URL (http://www.enzolifesciences.com/product-listing/?product_type=Antibodies&application=&text=) to your own browser (e.g. Chrome). Now, you need to download a browser add-on tool called the XPath Helper.
2) In your browser, click to launch the XPath Helper.
3) Locate the page numbers on the web page, right click the page-number link "1" and select the Inspect option.
4) By now, you screen should look like this. The highlighted code correspond to the link of page 1.
5) Next, right click the highlighted code, select "Copy", then "Copy XPath". You have just now copied the XPath of page-number link "1".
This is the XPath you've copied:
6) Looking at the source code you can find that page-2 is located one line below the page-1 element.
Using XPath Syntax "following-sibling" which tracks for the next following node down the line, you can modify the copied XPath for the page-1 element to one that tracks the page following it (page-2 in this case).
So the correct XPath that is always to going locate the next page following the current page is:
Note: By adding "/following-sibling::a" to the end of previous XPath, it now looks for the first href element (a) following the first-page element.
Enter the correct XPath to the Query section of the XPath Helper, you can see that page "2" is correctly located using the XPath.
Step 3. Now that you have the correct XPath ready and tested, go back to Octoparse and revise the existing XPath with the new XPath.
Double-click "Pagination" to open the settings menu.
Revise the existing XPath to the new XPath. Click "OK" to save.
Step 4. Final check! Click the Pagination box, then the Click to Paginate action, Octoparse should flip the page to the next page if everything's setup correctly. If necessary, repeat the process the further how the pagination action is working.
If you still have trouble dealing with pagination without the next button, submit a ticket to our support team! We're here to help.
Artículo en español: Tratar la paginación (sin botón "Siguiente")
También puedes leer artículos de web scraping en sitio web oficial