Scrape product information from Amazon
FollowIn this tutorial, we are going to show you how to scrape the product information from Amazon.com.
Also, you can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Amazon Templates directly to save your time. Octoparse has just launched some Amazon templates designed for different countries such as Germany, France, the US, Spain, and India. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial.
We will scrape each detail page URL in Task 1 and extract the product details such as the product title, price, and brand from the product details page with Task 2. By splitting one task into two, the extraction speed could be improved to a certain degree, especially when we use the Octoparse Cloud Extraction.
To follow through, you may want to use this URL in the tutorial:
This tutorial will also cover:
- Deal with AJAX for pagination
Here are the main steps in this tutorial: [Download task file here ]
Task 1: Extract all the URLs of detail pages on the search result pages
- "Go to Web Page" - open the target web page
- Create a pagination loop - scrape all the results from multiple search results pages
- Build a "Loop Item"- loop extract each URL on the search results pages
- Start extraction - run the task and get data
Task 2: Collect the product information from scraped URLs
- Input a batch of the scraped URLs - loop open the detail pages
- Extract data - select the data for extraction
- Start extraction - run the task and get data
Task 1: Extract all the URLs of detail pages on the search result pages
1. "Go To Web Page" - open the target web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like amazon.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down and click the "Next" button
- Click "Loop click the selected element" on the "Action Tips" panel
- Set up AJAX Load for the "Click to paginate" action
Amazon.com applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load for the "Click to paginate" action.
- Uncheck the box for "Retry when page remains unchanged (use discreetly for AJAX loading)"
- Check the box for "Load the page with AJAX" and set up AJAX Timeout as "10" seconds
- Click "OK" to save
Tips! If you want to learn more about AJAX, here are the related tutorials you might need: |
3. Build a "Loop Item"- loop extract each URL on the search results pages
- Click "Go To Web Page" to go back to the first page and then click the "Pagination" box
When extracting data throughout multiple pages, you should always begin your task building on the first page.
- Select the title of the first item in the list
- Click "A" tab on the bottom of the "Action Tips" panel
As we need to extract the URLs in a loop, we’d better make sure you select the "A" tag when you extract the URL. ( "A" tag stands for anchor.)
- Click "Select All"
- Click "Extract the URLs of the selected elements"
4. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here is the sample output for Task 1.
Tips! When the content of the page has already shown out, but it is still loading, you could click the "X" button at the right end of the navigating bar to stop loading. |
Task 2: Collect the product information from scraped URLs
1. Input a batch of the scraped URLs - loop open the detail pages
In Task 1, we have already got a batch of URLs.
- Click "+ Task" to start a task using Advanced Mode to build Task 2
- Input batch URL
There are three ways to batch import URLs to any single task/crawler (up to one million URLs). In this case, we will batch import URLs from local files. You just need to copy the URLs from Task 1 extraction output file, and then paste them under "Input URL" text box. For further study, please refer to Batch Import URLs
- Select "Input from file", and then select the output file of the scrape URLs
- Click "Save URL"
2. Extract data - select the data for extraction
As we can see, we are on the detail page now.
- Click the information you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
- Modify XPath - extract data accurately (optional)
In this case, the correct XPath isn't located accurately by default, so we have to input the right XPath manually for accurate data. The correct XPath is "//td[@class='a-span12']/span[1]".
Tips! Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated. |
In this step, we are able to rename the fields by selecting from the predefined list or inputting on your own.
3. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended because Amazon has anti-scraping setting to prevent information from being extracted and Cloud Extraction can solve this problem. If you start local extraction, it's likely to be blocked by Amazon.
Now you have got the data you want. A splitting task can help improve the efficiency of data extraction as well as minimize some problems caused by small changes happened in websites.
Tips! By default, if Octoparse cannot find the element of the defined pattern on the page, the field will be left blank. However, Octoparse may fail to find the element of the defined pattern even if the element needed is shown on the website. If you encounter this problem, here is a related tutorial you might need: |
日本語記事:Amazonから製品情報をスクレイピングする
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Scrape la información del producto de Amazon
También puede leer artículos de web scraping en el website oficial
Related articles:
Scrape Amazon product information with ASIN/UPC
Scrape product information from Sam's Club
Author: Vanny
Editor: Fergus
Was this article helpful? Contact us any time if you need our help!