You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Clutch is a leading ratings and reviews platform for B2B service providers, featuring companies in over 100 countries and 500 industries. Clutch categorizes companies by their geographic location, the field of their expertise, and the focus on proven skills. Based on the data gathered, Clutch formulates a fair rating of all the firms.

This tutorial will show you how to scrape a company listing page for company details from clutch.co with Octoparse.

The sample URL we will scrape in this tutorial is:

https://clutch.co/agencies/digital?geona_id=26487

Here are the major steps of the tutorial: [Download task file here]

  1. Create a Go to Web Page - to open the target web page

  2. Set up a Pagination - to scrape data from all pages

  3. Create Loop Item - to go through all the companies

  4. Extract more data - to extract other information about the companies

  5. Run the task - to get your desired data


1. Create a Go to Web Page - to open the target web page

  • Enter the page URL on the home screen and click Start to create a new task

mceclip1.png

You can also enter the URL by creating the task in the Custom Task mode.

  • Find the +New button on the sidebar. Click on it and then select Custom Task.

custom_task.jpg
  • Input the URL into the URL Input box and click Save to start.

custom_task_2.png

2. Set up pagination - to scrape data from all pages

To instruct Octoparse to extract data from every page, you'll need to set up pagination first by scrolling to the bottom of the page

  • Click on the next button

  • Select Loop click next page in the Tips panel

mceclip0.png

3. Create Loop Item - to go through all the companies

  • Click on any of the company names, and all similar titles are highlighted in red.

  • Click Select All in the Tips panel.

mceclip1.png
  • Select Extract text of the selected links.

mceclip2.png

And you'll see the loop item being generated in your workflow for all 50 companies on one page.

Note: If you have more than 50 items in the loop, you probably have the sponsored results or ads on the page included too.

In this case, you can modify the loop item XPath to this to avoid including the sponsored result: //ul[@class='directory-list shortlist']/li[@data-position]

loop_xpath.png

4. Extract more data - to extract other information about the companies

To extract information other than the company name:

  • Click on your desired data (Location in this case)

  • Select Extract text of the selected element

add_info.png

Then you'll find a data field has been added to the data preview section:

data_preview.png

5. Run the task - to get your desired data

  • Click Save on the upper right to save your task

  • Click Run next to it and wait for a Run Task window to pop up

  • Select Run on your device to run the task on your local device

  • Wait for the task to complete


Here is a sample output from a local run:

sample.png
Did this answer your question?