Scrape Amazon product information with ASIN/UPC
FollowUsing ASIN to capture some information you need for your business is good for selling on Amazon. Scraping Amazon product data with ASIN/UPC could be the assistance of learning homogeneity products and dealing with pricing strategy.
For Amazon, you could visit our easy-to-use "Task Template" on the main screen of the Octoparse scraping tool. All you need is to type in several parameters and the task is ready to go. For further details, you may check it out here: Task Templates
In this tutorial, we will scrape data such as the title, price, rating, and reviews from the product details page with Octoparse.
Before getting started, you’ll need to have a list of ASINs prepared in advance. Here is an example list of ASINs.
B07JJK7J3K
B00LB01FNO
B003EM8008
B07THVNSCV
B07VCWM8QD
B07VC5M21C
B07TX7PCFH
B0753GRNQZ
B07V9S26D2
B01E4A6JDI
To follow through you might want to use the URL in this tutorial:
Here are the main steps in this tutorial [Download demo task file here]
- "Go To Web Page" - open the target web page
- Build a "Loop Item" - loop search each ASIN in the list
- Extract data - select the data for extraction
- Customize data field by refining data – improve the accuracy of a certain data field (Optional)
- Run extraction - run your task and get data
1) "Go To Web Page" - open the target web page
- Click "+ Task" to start a new task with Advanced Mode
- Paste the URL into the "Website" box
- Click "Save URL" to move on
2) Build a "Loop Item" - loop search each ASIN in the list
By pasting the ASIN list into the “Text list”, we could create a loop search action, with which Octoparse will automatically enter every ASIN in the list into the search box, one code a time.
- Drop a "Loop Item" action into the workflow designer
- Click "Text list" on the "Loop Mode"
- Click "A" bar
- Paste the ASIN list into the text box
- Click "OK" to save
Now, we can see the ASIN list is presented in the Loop Item box. Let’s start creating the loop search action.
- Click the search box on the web page
- Click "Enter text" on the "Action Tips"
- Input the first ASIN into the text box
- Click "Ok" to save
We need to adjust the position of the "Enter text" action in the workflow to generate the right execution order for Octoparse.
- Drag "Enter text" action into the "Loop item"
- Check "Use the text in Loop Item to fill in the text box"
- Click "Ok" to save
Then we need to locate where the text should be typed in.
- Click "Enter Text"
- Click "Customize" and then modify the XPath as "//input[@id='twotabsearchtextbox']"
After setting up the "Loop item" and "Enter text" action, we will need to add a "Click Item" action to activate the search action.
- Click "Search button" on the web page
- Click "Click button" on the "Action Tips"
Since Amazon loads the search results with AJAX, we need to set up "AJAX Load" to avoid the software from getting stuck.
- Uncheck "Auto retry"
- Check "AJAX Load" and set up the time as 5s
- Click "Save" to move on
Tips! AJAX timeout can often be used as web page timeout for Click Action. For example, when you have a page that takes forever to finish loading, long after the data you need gets loaded, you can conveniently use AJAX timeout to tell Octoparse to move on to the next action when the set time is reached. If you want to learn more about AJAX, here are two related links you might need: |
Tips2! Modifying XPath in Octoparse works very well with more flexibility and accuracy than the XPath auto-generated. Here are some related tutorials you might need: |
3) Extract data - select the data for extraction
- Click the information you need on the page
- Select "Extract data" on the "Action Tips" panel
- Rename the fields by selecting from the predefined list or inputting on your own
4) Customize data fields by refining data – improve the accuracy of a certain data field (Optional)
In this case, the "Price" element is not a single number we want. So to clean the data, we need to use the regular expression to refine it.
- Select the "Price" data field
- Click "Customize data field"
- Select "Refine extracted data"
- Click "Add step" and then click "Match with Regular Expression"
- Paste the regular expression in "Matching XPath"
The regular expression for the "Price" field is "[0-9.]{5}"
- Click "Evaluate" to see make sure the regular expression works
- Click "OK" to move on
5) Run extraction - run your task and get data
- Click"Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here is the output sample.
Was this article helpful? Contact us any time if you need our help!
Artículo en español: Scrape La Información de Craigslist
También puede leer artículos de web scraping en el website oficial
Writer: Vanny
Editor: Fergus