In this tutorial, we will show you how to scrape the product reviews from Amazon.com.
For Amazon product scraping, you could visit our easy-to-use "Task Template" on the main screen of the Octoparse scraping tool. All you need is to type in several parameters and the task is ready to go. For further details, you may check it out here: Task Templates
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the customers' name, rating, title, time and review contents from product details page with Octoparse.
Here are the main steps in this tutorial: [Download task file here]
- "Go To Web Page" - to open the targeted web page
- Create a pagination loop - to scrape all the reviews from multiple pages
- Create a "Loop Item" to scrape all the reviews on one page
- Extract data - to select the data and remove the unwanted information
- Run extraction - to run your task and get data
1)"Go To Web Page" - to open the targeted web page
- Click "+ Task" to start a new task with Advanced Mode
- Paste the URL into the "Website" box
- Click "Save URL" to move on
If you are landing on Amazon for the first time, you may encounter a robot detect which needs you to enter security code. In this case, you can switch to browser mode by clicking on top right of the build-in browser. And then type in the code to pass the detection. Finally, click to switch back to select mode.
2) Create a pagination loop - to scrape all the reviews from multiple pages
If you are using the latest version of Octoparse, the "Workflow Mode" would be automatically on. If not, you could turn on the "Workflow Mode" by switching the "Workflow" button in the top-right corner in Octoparse
- Scroll down the page and click “see all reviews”
To scrape all the reviews, we need to add this step to lay out all the reviews on this page.
- Scroll down the page and click the “Next page” button
- Click "Loop click the selected link" on the "Action Tips"
3）Create a "Loop Item"- to scrape all the reviews on one page
The Reviews are organized on the page as a list. We need to build a "Loop Item" to loop extracting each review one by one.
- Select the first title in the built-in browser
- Click "Select all"
- Click "Extract data from the selected elements"
To learn the detailed information about capturing a list of items, here is the tutorial you might need.
After looping through the reviews, we found that there are totally 11 items instead of 10 in the looping list. The last one is not the item we need. That's because the XPath is not accurate enough. So we have to modify the XPath in "Loop item".
- Click "Loop item"
- Add "[position()<11]" at the end of original XPath
So the final XPath should be "//body/div/div/div/div/div/div/div/div/div[position()<11]". It means to select first 10 items in the page since there are only 10 items in the page.
Modifying XPath in Octoparse is important when you need to locate items accurately.
Here are some related tutorials you might need：
4) Extract data - to select the data
- Click the data in build-in browser
You should only select data from the first review which is highlighted in red.
- Click "Extract text from the selected element"
- Repeat the following steps to all information you need
- Rename the fields by selecting from the pre-defined list or inputting on your own
- Click "OK" to save the result.
5) Run extraction - to run your task and get data
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here is the sample output.
Was this article helpful? Contact us any time if you need our help!