In this tutorial, we will show you how to build a web scraping with Octoparse to collect job posting information on Indeed.
We can scrape data such as data science, recruitment, etc.. from job information on Indeed with Octoparse. After design the crawler, the entire scraping process is all automated without coding needed.
Let's see how it's done!
Before we start, we need to achieve the URL of the target result page by searching the keywords "DevOps" and "Dallas-Fort Worth, TX".
Then we will get the URL where we need to scrape data:
We will scrape job titles and description in this tutorial.
Check out the main steps covered: [Download example task file ]
1) "Go To Web Page" - to open the target website
- Create your task with "Advanced mode".
- Paste the URL we just got into "Extraction URL" box and save it to move on.
We always suggest you turn on "Workflow" to get a better picture of what you are doing with the task.
2) Create a pagination - to extract multiple web pages
· Scroll down to find the "Next" button. Since it does not automatically locate on the "A" tag of the button, we need to select "A" tag, and click "Loop click the selected link".
3) Modify XPath - to paginate correctly
XPath is a language that allows you to locate specific elements from a page precisely based on the tags and attributes. So before you get down to write your own XPath, you would need to inspect the HTML structure of the page firstly.
· Find the correct XPath with Firepath/Firebug extension tool in Firefox browser. The correct XPath is //span[contains(text(),'Next')][@class="np"]/../..
· Click pagination loop in your workflow and paste the correct XPath into "Single element" box under "Advanced Options".
· Firebug extension tool is very useful for looking up the elements of an HTML document. (Firebug is now only available for old versions of Firebox. Get the old versions of Firebox here.)
· Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated by clicking elements during the task configuration. So you need to check "Single element" in the "Loop mode" if you cannot extract data from the next page.
· If you are new to XPath, please learn more from the tutorials here. [Click here ]
4) Extract data - to select data from your target website
· Select the first job title, click "Select all" and "Extract link text" in "Action Tips" panel.
· Paste the correct XPath in "Variable list" box under "Advanced Options" and click "OK" to save. The correct XPath is .//td[@id='resultsCol']/div[contains(@class,'row')].
· Extract other data you want and rename the field names if necessary.
5) Run your task - to get data you want
· Click "Start Extraction" and then select "Local Extraction".