If you are seriously looking into scraping a website, you may need to navigate the website's different pages and extract data from each page. The first step is to identify the pagination you are dealing with and work from there. A few examples are:
- Paginate using a "Next" button
- Paginate without a "Next" button
- Paginate with infinite scroll
- Paginate using a "Load more" button
In this tutorial, we will focus on how to create a pagination action when there is no next page button on the page. More specifically, one that requires clicking the numbered links when you want to turn the page, like the ones below.
Let's explore how you can create a pagination action with no next page button in Octoparse.
- Create pagination with Auto-detect
- Use "Batch Generate" to create URLs for all pages
- Create pagination manually
1. Create pagination with Auto-detect
With the Auto-detect function, Octoparse will automatically scan the web page for data and pagination links. It can be enabled in your settings to initiate the auto-detect process automatically every time.
If Octoparse detects any pagination links on the web page, there will be a pagination option in the Tips panel upon completion of the Auto-detect Process. You can click Check to see the link detected by Octoparse or click Edit to edit the link if it is not correct.
As we all know, web pages come in many different forms. Auto-detect will sometimes fail to detect pagination links or have the wrong links detected. In this case, you can turn to one of the solutions below.
2. Use "Batch Generate" to create URLs for all pages
An alternative but very effective way to approach scraping multiple pages of a website is to first collect the URLs of all the pages you would need to scrape and build a task using the list of URLs collected.
Take a closer look at the web page URLs for the different pages. Do you notice something like this?
If you see a similar pattern to the example above, with only the page number changing in the URLs of the different pages, you can easily batch generate all the page URLs and scrape as many pages as needed. Once you have the links generated, Octoparse will go on to scrape all the pages automatically.
3. Create pagination manually
Even if the Auto-detect fails to work and page URLs do not show a pattern, you can still manually create a pagination action.
It will be a two-step process:
STEP 1: Write/find the XPath of the page element that takes you to the next page (e.g., if you are on page 1, then you would want to click page 2; if you are on page 2, then you would like to click page 3, so on and so forth).
STEP 2: Revise the XPath of the Pagination in the workflow in Octoparse.
TIP: XPath knowledge is not mandatory but is extremely helpful to create a task that does exactly what you need in Octoparse. Check out What is XPath and how to use it in Octoparse to learn more about using XPath to create the perfect web scraper.
Sounds complicated? No worries, let's dive into an example.
To follow through, you may use the link below:
- Click the pagination section on the webpage and click Loop click single element
- Get the right Xpath
- Copy and paste the current page URL (http://www.enzolifesciences.com/product-listing/?product_type=Antibodies&application=&text=) to your browser (e.g., Chrome).
Note: You need to download a browser add-on tool called XPath Helper.
- In your browser, click to launch the XPath Helper.
- Locate the page numbers on the web page, right-click page 1 and select the Inspect option.
- By now, your screen should look like this. The highlighted code corresponds to the link on page 1.
- Next, right-click the highlighted code, select Copy, then Copy XPath. You have just now copied the XPath of page 1.
This is the XPath you've copied:
- Looking at the source code, you can find that page 2 is located one line below the page-1 element.
Using XPath Syntax following-sibling, which tracks for the next following node down the line, you can modify the copied XPath for the page-1 element to one that follows the page following it (page-2 in this case).
The correct XPath that is always to locate the next page following the current page is:
Note: By adding /following-sibling::a to the end of the previous XPath, it now looks for the first href element (a) following the first-page element.
Enter the correct XPath to the Query section of the XPath Helper, and you'll find that page 2 is correctly located using the XPath.
- Revise the existing XPath with the new XPath
Copy and paste the new Xpath under the pagination, then click "Apply" to confirm.