The updated tutorial for the latest version 8.1 is available here. Go to have a check now!
1) Click through the links in a list
To do this, we will create a "Loop Item" to loop click each product link on the search result page.
- Click on the first product title that contains the URL to access the product page. The selected title will be highlighted in green while all the other similar product titles will be highlighted in red.
- Click on the second product title containing the URL
- Select "Loop click each URL" from "Action Tips". Notice a Loop-click step is being auto-generated and added to the workflow.
To loop click-through all links on the list, it is important that you select the anchor texts. Octoparse automatically identifies tags of selected items. So when you select an item with URL, the selected tag would be "A", which stands for anchor that usually links one page to another.
2) Capture the information you need from the product page
Once the "Loop Item" is created, Octoparse goes on to click the first link in the list and load the first product page in the built-in browser.
Now, click to capture the specific data fields you need - this will be used as a template for all the other product pages.
- Click on target data fields such as title, review, price, etc.
- Select "Extract data" from "Action Tips" to complete the extraction action. Notice an "Extract data" step gets auto-generated and added to the workflow. Data extracted will be shown in the "Data field" pane next to the workflow designer.
Set up a wait time in "Advanced Options" for steps like "Click Item" or "Extract Data" can effectively avoid data skip and make the crawling process more human-like! (Usually 2-5 seconds would work well).
Artículo en español: Lección 5: Obtener datos - Capture datos de cada página de elementos
También puede leer artículos de web scraping en el sitio web oficial.