Scrape company information from Crunchbase
FollowCrunchbase is a website for finding business information about different companies. It offers information including investments and funding information, founding members and individuals in leadership positions, mergers and acquisitions, news, industry trends, and so on.
In this tutorial, we are going to show you how to scrape company information from Crunchbase with a search result page URL.
For Crunchbase, you could also visit our easy-to-use "Task Template" on the main screen of the Octoparse scraping tool. All you need is to type in several parameters and the task is ready to go. For further details, you may check it out here: Task Templates
Crunchbase shows only the first 5 results per search for free users. Please make sure you have a pro account of Crunchbase before starting the task configuration.
We will scrape each company's detail page URL in Task 1 and scrape data such as the Company name, Location, Introduction, and Funding info from the company details page with Task 2.
To follow through, you can get a search result page URL first or use this one: https://www.crunchbase.com/discover/organization.companies/9472f4f3410c0010e2780a286ce97f9e
Here are the main steps in this tutorial:
Task 1: Extract all the URLs of detail pages on the search result pages [Download task file here]
- "Go To Web Page" - open the target web page
- Login to the website and save cookies
- Auto-detect web page data - create the workflow
- Select the company link to scrape
- Create pagination - scrape data from multiple pages
- Start extraction - run the task and get data
Task 2: Collect the product information from scraped URLs [Download task file here]
- Input a batch of the scraped URLs - loop open the detail pages
- Extract data - select the data for extraction
- Modify the XPath of fields
- Start extraction - run the task and get data
Task 1: Extract the detail page URLs on the search result pages
1. "Go to Web Page"- open the target web page
- Input the URL on the home screen and click "Start"
2. Login to the website ad save cookies
- Click
to switch to Browse mode
- Log into the website just as what you do on a regular browser
- Click open the settings of the "Go to Web Page" action
- Tick "Use cookie" and click "Use cookie from the current page"
- Click "OK" to save it
3. Auto-detect web page data - create the workflow
- Turn off "Browser mode"
- Click "Auto-detect web page data" and wait for the detection to complete
- Delete unwanted fields on the Data Preview
- Choose "Create workflow" on the Tips panel
Octoparse would generate a Loop Item in the workflow:
4. Select the company link to scrape
- Select the first company name on the web page(the first line should have been highlighted in red)
- Click "A" tag on the Tips panel
- Choose "Extract the URL of the selected link"
- Select other information of the first company to scrape the text
- Rename the fields if needed
5. Create pagination - scrape data from multiple pages
- Select the Next button on the web page
- Choose "Loop click single element"
- Select a proper AJAX timeout
6. Start extraction - run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer
After scraping the data, you can export the data into an excel file.
Task 2: Collect the product information from scraped URLs
1. Input a batch of the scraped URLs - loop open the detail pages
- Click + New, and select Advanced Mode
- Input the URLs scraped from Task 1
2. Extract data - select the data for extraction
- Select Company name on the web page
- Choose "Extract text of the selected element"
- Do the same to scrape other company basic information
- Rename the fields is needed
3. Modify the XPath of fields
For the funding information, the fields vary on different company pages. For example, the page of Apple company contains fields like "Number of Acquisitions" and "Stock Symbol". But the page of Shine company does not. Even the same field like "Total Funding Amount" is not in the same position on pages.
So we need to modify the XPath of these fields to locate the correct field on different pages. Let's take the field "Total Funding Amount". Since the field title won't change, we can locate the field value via the title. The XPath for the "Total Funding Amount" is: //span[contains(text(),'Total Funding')]/../../following-sibling::*[1]
- Click open the settings of "Extract Data" action
- Click "Customize XPath" of the field
- Input the modified XPath
- Click OK to save it
Other fields' XPath can be modified in the same way.
4. Start extraction - run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)
Here is the sample data output:
Author: Yina
Was this article helpful? Contact us at any time if you need our help!