In this tutorial, I am gonna show you how to scrape job information from Glassdoor.com using Octoparse step by step.
Before we get started, go to your own browser, open Glassdoor type in the job title and the location. In this case, we are looking for “Content Manager” in “New York”. So type that in. After the page finishes loading, copy this URL. This is the URL we will use in this demonstration.
note: To learn more about Ajax, click https://youtu.be/MuOC1yCKai0
Step One: Enter the URLs of the websites you would like to scrape
- Build a new task by clicking the “Advance Model”, and enter the URL of the websites. In this case the search result from the job title of “Content Manager in New York” from Glassdoor. Copy the URL and past it to the box
- And click “Save and Run” on the right bottom of the corner.
Step Two: Create a pagination loop
- Go to Job listing area and scroll down to find the pagination bar. Then click the “Next Page” button. The command panel called “Action tip” will show up once you interact with the websites with the action of clicking. It will ask you want you to want to do with the selected element
- In this case, select "Loop click next page"
- Click the “Ok” to save the steps.
Step Three: Create a “Loop item”
- To create a loop item, select the element, in this case, the Job title from the listing. You will notice that the selected element, in this case, the first job title is highlighted in green. This means it has been successfully selected. You will also notice that other job titles are highlighted in pink at the same time. This means that Octoparse finds similar elements.
- Follow the guide from “Action Tips”, and click “Select All” since we want Octoparse to get all job information, not just the first one.
- Then Click “Loop click the selected link” to create a “Loop Item.” Octoparse will click through each job title.
- Now go to the setting area, we need to do a few adjustments. Uncheck “Auto retry” since we don’t need this function in this tutorial. Octoparse selects the AJAX load by default as Glassdoor has applied this technique on its websites, now we need to set up the timing as “10” seconds.
- Click to save the step
Step four, data extraction.
Now we need to select the extraction field.
- To extract the data, Click the element, for example, in this case, the job title and select “extract the text of the selected element” from Action Tip. You can change the name, and type-in the name you want. In this case, the “Job title”
- You may find that the data we need has been divided into different sections. Like Glassdoor, the company’s information is not in the same section with Job information. We can tell Octoparse to go to this section and find the company’s information for us. In this case, for example, go to the Company section by clicking the “Company” button. Then select the “Click element” command to tell Octoparse to go to this section by doing the action of click.
- Now select the data we need, in this case, select “Size” then choose “extract the text of the selected element” command to tell Octoparse to get the text information
- Click “Headquarter” for instance and then select “Click element” command, then choose “extract text of the selected element” command to extract the text.
You need to pay attention here. We need to modify the setting whenever there is a transformation between webpages taking place during the process of extraction. In this case, there is part of the webpage is changed when we jump from job information to the Company’s information. So what we need to do is fix the setting at the action of “Click Element”. Go to the setting area, and choose AJAX loading
Step Five, Run the task and get data
- After finishing setting up the rules, we can run the task by clicking “start extraction”
- Then the Select “Local extraction” to run the task