Crunchbase is a website for finding business information about different companies. It offers information including investments and funding information, founding members and individuals in leadership positions, mergers and acquisitions, news, industry trends, and so on.

In this tutorial, we are going to show you how to scrape company information from Crunchbase with a search result page URL.

For Crunchbase, you could also visit our easy-to-use "Task Template" on the main screen of the Octoparse scraping tool. All you need is to type in several parameters and the task is ready to go. For further details, you may check it out here: Task Templates

1.1.png

Crunchbase shows only the first 5 results per search for free users. Please make sure you have a pro account of Crunchbase before starting the task configuration.

We will scrape each company's detail page URL in Task 1 and scrape data such as the Company name, Location, Introduction, and Funding info from the company details page with Task 2.

To follow through, you can get a search result page URL first or use this one: https://www.crunchbase.com/discover/organization.companies/9472f4f3410c0010e2780a286ce97f9e

Here are the main steps in this tutorial:

Task 1: Extract all the URLs of detail pages on the search result pages [Download task file here]

  1. "Go To Web Page" - open the target web page

  2. Login to the website and save cookies

  3. Auto-detect web page data - create the workflow

  4. Select the company link to scrape

  5. Create pagination - scrape data from multiple pages

  6. Start extraction - run the task and get data

Task 2: Collect the product information from scraped URLs [Download task file here]

  1. Input a batch of the scraped URLs - loop open the detail pages

  2. Extract data - select the data for extraction

  3. Modify the XPath of fields

  4. Start extraction - run the task and get data


Task 1: Extract the detail page URLs on the search result pages

1. "Go to Web Page"- open the target web page

  • Input the URL on the home screen and click "Start"

1.png

2. Log in to the website ad save cookies

  • Switch on the Browse mode then log in with your account details

2.gif
  • Click open the settings of the "Go to Web Page" action

  • Tick "Use cookie" and click "Use cookie from the current page"

  • Click "OK" to save it

2.1gif.gif

3. Auto-detect web page data - create the workflow

  • Switch off the Browser mode

  • Select Auto-detect web page data and wait for the detection to complete

3.gif
  • Delete unwanted fields in the Data Preview section

3.2.png
  • Untick the Add a page scroll option and Create workflow from the Tips panel

mceclip2.png

Octoparse would generate a Loop Item in the workflow:

mceclip1.png

4. Select the company link to scrape

  • Select the first company name on the web page(the first line should have been highlighted in red)

  • Click the A tag on the Tips panel

  • Choose "Extract the URL of the selected link"

4.gif
  • Select other information of the first company to scrape the text

  • Rename the fields if needed

4.1.gif

5. Create pagination - scrape data from multiple pages

  • Select the Next button on the web page

  • Choose "Loop click single element"

  • Select a proper AJAX timeout

5.gif

6. Start extraction - run the task and get data

  • Click “Start Extraction” on the upper left side

  • Select “Local Extraction” to run the task on your computer

mceclip3.png

After scraping the data, you can export the data into an excel file.


Task 2: Collect the product information from scraped URLs

1. Input a batch of the scraped URLs - loop open the detail pages

  • Click + New, and select Advanced Mode

  • Input the URLs scraped from Task 1

6.gif

2. Extract data - select the data for extraction

  • Select Company name on the web page

  • Choose "Extract text of the selected element"

  • Do the same to scrape other company basic information

7.gif
  • Rename the fields is needed

8.gif

3. Modify the XPath of fields

For the funding information, the fields vary on different company pages. For example, the page of Apple company contains fields like "Number of Acquisitions" and "Stock Symbol". But the page of Shine company does not. Even the same field like "Total Funding Amount" is not in the same position on pages.

7.1.png

So we need to modify the XPath of these fields to locate the correct field on different pages. Let's take the field "Total Funding Amount". Since the field title won't change, we can locate the field value via the title. The XPath for the "Total Funding Amount" is: //span[contains(text(),'Total Funding')]/../../following-sibling::*[1]

  • Click open the settings of "Extract Data" action

  • Click "Customize XPath" of the field

  • Input the modified XPath

  • Click OK to save it

9.1.gif

Other fields' XPath can be modified in the same way.

4. Start extraction - run the task and get data

  • Click “Start Extraction” on the upper left side

  • Select “Local Extraction” to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)

mceclip4.png

Here is the sample data output:

mceclip3.png
Did this answer your question?