In this tutorial, we will show you how to build a web scraping with Octoparse to collect job posting information on Indeed.
We can scrape data such as data science, recruitment, etc.. from job information on Indeed with Octoparse. After design the crawler, the entire scraping process is all automated without coding needed.
Let's see how it's done!
Before we start, we need to achieve the URL of the target result page by searching the keywords "DevOps" and "Dallas-Fort Worth, TX".
Then we will get the URL where we need to scrape data:
We will scrape job titles and description in this tutorial.
Here are the main steps in this tutorial [Download task file here]
- "Go To Web Page" - to open the target website
- Create a pagination - to extract multiple web pages
- Modify XPath - to paginate correctly
- Create a "Loop Item" - to loop extract each element on each row
- Extract data - to select data from your target website
- Run your task - to get data you want
1) "Go To Web Page" - to open the target website
- Create your task with "Advanced mode".
- Paste the URL we just got into "Extraction URL" box and save it to move on.
- Check "Block Pop-up" to avoid all possible pop-up windows and click "OK" to save
We always suggest you turn on "Workflow" to get a better picture of what you are doing with the task.
2) Create a pagination - to extract multiple web pages
- Scroll down to find the "Next" button.
- Select "A" tag, and click "Loop click the selected link" since it does not automatically locate on the "A" tag of the button.
3) Modify XPath - to paginate correctly
XPath is a language that allows you to locate specific elements from a page precisely based on the tags and attributes. So before you get down to write your own XPath, you would need to inspect the HTML structure of the page firstly.
- Find the correct XPath with Firepath/Firebug extension tool in Firefox browser.
The correct XPath is //span[contains(text(),'Next')][@class="np"]/../..
- Click pagination loop in your workflow and paste the correct XPath into "Single element" box under "Advanced Options"
Firebug extension tool is very useful for looking up the elements of an HTML document. (Firebug is now only available for old versions of Firebox. Get the old versions of Firebox here.)
Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated by clicking elements during the task configuration. So you need to check "Single element" in the "Loop mode" if you cannot extract data from the next page.
4) Create a "Loop Item" - to loop extract each element on each row
- Select all information of the first two listing items
Usually when creating a loop, choosing all information instead of only part of the information would be better. Selecting part of the information might cause a problem when you need more information than you have already selected.
- Click "Extract text from the selected elements" on the "Action Tips" panel
5) Extract data - to select data from your target website
- Select the job title of the first item and click "Extract text of the selected elements" in "Action Tips" panel
- Select the description of the first item and click "Extract text of the selected elements" in "Action Tips" panel
- Delete the first field because it contains all messy text in the first listing item
- Extract other data you want and rename the field names if necessary
For the description field, we need to modify the correct XPath. The correct XPath is .//td[@id='resultsCol']/div[contains(@class,'row')].
- Select "Description" and click
- Select "Customize XPath" and change "Matching XPath" into the correct XPath above
- Click "OK" to save