Scrape job data from Glassdoor
FollowIn this tutorial, we are going to introduce how to scrape information from glassdoor.com.
You can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Templates directly to save your time. For further details, you may check it out here: Task Templates
If you want to create the task from scratch, please follow the steps below. To follow through, you may want to use the URL in the tutorial:
We will scrape data such as the company title, type, address, and other related information from each job details page with Octoparse.
Here are the main steps in this tutorial. [Download the demo task from here]
- Open the target web page
- Auto-detect the web page to generate the workflow
- Click into each product link to scrape more information
- Select the target data fields
- Save and start to run the task and get data
1) Open the target web page
- Enter the URL on the home page and click Start
2) Auto-detect the web page to generate the workflow
- Click "Auto-detect web page data" and wait for the detection to complete
- Go to "Data preview" to see if you're okay with the current data output
- You can delete unnecessary data fields directly by clicking the icon
- You can also modify the data field names here directly by clicking the icon
- You can delete unnecessary data fields directly by clicking the icon
- Click "Create workflow"
Octoparse would generate a workflow like this:
As the next page is loaded with AJAX, we need to set up AJAX for the "Click to Paginate" action
- Click open the action settings of "Click to Paginate"
- Tick "Load with AJAX" and set up the AJAX timeout as 7-10s
If the data you need can all be scraped from the listing page, you can just jump to Save and start to run the task and get data. If you want to click into each detail link to get more information, please follow the next step.
3) Click into each product link to scrape more information
- Choose to “Click on link(s) to scrape the linked page(s)”
- Select "Click on an extracted data field" and select the one you want to click on from the drop-down menu, you can confirm if it's the correct link in the data preview section
- Click "Confirm"
Glassdoor does not open each job detail page on a new tab but loads it with AJAX on the current page, so we need to modify some settings for the "Click the URLs in the list" action.
- Click open the action settings of the "Click the URLs in the list"
- Uncheck "Open in a new tab"
- Tick "Load with AJAX" and set up the AJAX timeout as 7-10s
4) Select the target data fields
- Select information on the web page
- Choose "Extract text of the selected element"
- Repeat the above steps to extract all the data you need
Tips! If there is any pop-up on the web page, you can switch to Browse mode by clicking the button |
- Edit the name of the data field if needed
5) Save and start to run the task and get data
- Click"Save"
- Click "Run" on the upper left side
- Select "Run task on your device
" to run the task on your computer.
Tips! The task can only be run in your local device. It cannot run in the Cloud because of Classdoor's anti-scraping settings. |
Here is the sample output.
Is this article helpful? Contact us anytime if you need our help!
Author: Kara
Editor: Yina