Lesson 6: Pagination - Capture data from multiple pages
FollowLet's have a little recap. So far you've learned how to scrape data directly from a webpage, scrape data from a list, and scrape data from detail pages by clicking into one or more links on a webpage. Guess what? You are now geared up for more complex jobs, such as one that captures data from multiple pages. This lesson will show you how to scrape data that spans through many pages by having Octoparse to click on the "Next" button automatically - a process generally called "pagination".
Creating a paginating scraper
We'll use our blog page again in this example: https://www.octoparse.com/blog
1. Once the blog page is loaded within Octoparse, find out where the "Next" button is on the webpage, then click on it.
2. Follow the instructions provided on "Action Tips", select "Loop click the selected link" (or "Loop click next page"). A Pagination Loop will be automatically generated and added to the workflow.
Step-by-step Gif
There you've had a pagination loop created. Octoparse would click on the "Next" button repeatedly until the last page is reached. Alternatively, if you prefer to extract only from a specific number of pages, you can also further define the number of times the loop should be executed. For example, if you want to extract data from the first 4 pages, set the number of execution times to 3 then Octoparse would only paginate 3 times (when you arrive on page-4)
Tips! In order for the scraper to paginate properly, it is critical to make sure that the Pagination Loop is positioned correctly. Octoparse executes steps of the workflow in a top-down, inside-out manner. Therefore, any scraping action that needs to be performed prior to paginating to the second page needs to be positioned within the Pagination Loop. In the example below, the task is built to first load the webpage, loop through the list to extract data on the first page, turn the page, then repeat the same scraping on the second page, so on and so forth. As the scraping steps need to be performed before flipping the page, so the List Loop should be built within the Pagination Loop.
Rearrange the steps in the workflow whenever needed. |
Lesson 7: Execute tasks
日本語記事:レッスン6:ページ遷移の扱い - 複数のページからデータを抽出する
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Lección 6: Paginación - Captura datos de varias páginas
También puede leer artículos de web scraping en el sitio web oficial.
From: https://www.octoparse.com/tutorial-7/capture-data-from-multiple-pages