When we scrape product information from e-commerce websites, more often than not, we expect to extract data not only from the search result page but also from each product's detail page. In this tutorial, we will teach you how to build a customized crawler to achieve that purpose.

Let's say we need to search for "camera lens" on eBay. See the sample URL below:

https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=camera+lens&_sacat=0&LH_TitleDesc=0&_odkw=camera+lens&_osacat=0

list_page_vs_detail_page.jpg

In this case, we want to extract the title of the camera lens from the listing page first and then go to its detail page to get the specifics. We have two methods to achieve the needs.

  1. Use the auto-detect web page function to create the workflow

  2. Manually create the workflow


1. Use auto-detect web page to create the workflow

The smart detection feature in Octoparse 8.X is more powerful than ever. We can use it to detect the webpage to save us some time.

  • Click Auto-detect web page data in the Tips box and wait for it to complete

  • Switch between the auto-detect results to find your desired data fields (result 1 in this case)

switch.jpg
  • Modify the settings of the data fields by renaming them and removing the ones you don't want in the Data Preview section

data_preview.jpg

When we search for popular product lines like the one we use to demonstrate, chances are that we need to navigate through multiple search result pages and extract data from each one of them.

  • Click on the Check button to see if Octoparse has successfully located a next page button

  • Uncheck Add a page scroll and click Create workflow

1.jpg

Octoparse has now created a Loop Item in the workflow, which can help to scrape from the search results page. We will continue to build the steps to go to detail pages.

  • Select Click on link(s) to scrape the linked page(s)

  • Choose a field with the URLs you want to click

2.jpg

Now Octoparse has taken us to the detailed page for further data extraction. We can take down the information we want from the page.

  • Click on any web element you want to extract

  • Click Extract the text of the element from the Tips panel

  • Modify the data field names in the Data Preview section

Extract_data.jpg

2. Manually create the workflow

In case the auto-detect function fails for some websites, we can also set up the workflow manually. See the steps below:

  • Select the first item on the list page

  • Click Select all on the Tips panel

  • Click Extract text of the selected elements

A Loop Item has now been added to the workflow, but only the product title has been scrapped. We can add other fields.

  • Select any information you want to scrape from the results page

  • Choose Extract text of the element

manually_create_the_workflow.gif

Then we need to build an action to click on the product title URL.

  • Select the first title on the list page

  • Click Click element

click_element.jpg

Once we are taken to the detail page, we can extract the information from the Item specifics.

  • Click on any web element you want to extract

  • Click Extract the text of the element from the Tips panel

  • Modify the data field names in the Data Preview section

Did this answer your question?