You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

GoodFirms is a research and review platform that helps software buyers and service seekers opt for the best software or firm. At the same time, it helps IT companies and software vendors to boost user acquisition stats, market share, and brand awareness.

In four steps, this tutorial will show you how to scrape company info, such as company name, location, website, etc., from Goodfirms.

searchresult.png

To follow through, you may want to use the URL below:

https://www.goodfirms.co/directory/cms/top-website-development-companies

Here are the main steps of this tutorial: [Download task file here]

  1. Create a Go to Webpage - to open the target website

  2. Auto-detect the webpage - to create a workflow

  3. Modify the setting of Pagination - to locate the pagination button accurately

  4. Run the task - to get your target data


1. Create a Go to Webpage - to open the target website

  • Enter the page URL on the home screen and click Start to create a new task

start.png

2. Auto-detect the webpage - to create a workflow

  • Choose Auto-detect webpage data and wait for the detection to complete

autodetect.jpg
  • Check the data fields in Data preview and delete unwanted fields or rename them if needed (double click to rename)

delete_or_rename.jpg
  • Uncheck Add a page scroll

  • Click Create workflow

createworkflow.jpg

3. Modify the setting of Pagination - to locate the pagination button accurately

  • Click on the Pagination box

  • Replace the auto-generated Matching XPath with: //li[@class='next']/a[@title='Next page']

  • Click Apply to save the change

pagination_xpath.png

NOTE: To learn more about XPath in Octoparse, please check: What is XPath and how to use it in Octoparse?

  • Click on Click to Paginate box in the workflow

  • Select the Option panel

  • Tick Load with AJAX > set the AJAX timeout (7-10s recommended)

ajax.png

Note: Why do you need to set up AJAX timeout? Check out here: Handling AJAX


4. Run the task - to get your desired data

  • Click Save on the upper right to save your task

  • Click Run next to it and wait for a Run Task window to pop up

  • Select Run on your device to run the task on your local device

  • Wait for the task to complete

Here is sample output from a local run:

sample.png
Did this answer your question?