Scrape product information from BestBuy (V7.3)
FollowIn this tutorial, we will show you how to scrape data from directories by using BestBuy as an example.
To follow through, you may want to use this URL in the tutorial:
https://www.bestbuy.com/?intl=nosplash
We will scrape data such as the product title, model, star rating, and the number of reviews from each product page with Octoparse.
This tutorial will also cover:
- Deal with AJAX for pagination and entering text
Tips! It is recommended that you use the URL of the search result page directly whenever possible. Adding keywords/filters within Octoparse can complicate the task and leads to less efficient scraping. |
Here are the main steps in this tutorial: [Download task file here ]
- Go To Web Page - to open the targeted web page
- Enter Text - to input the keyword in the search box
- Create a pagination loop - to scrape all the details from multiple pages
- Create a "Loop Item" - to loop click into each item on each list
- Extract data - to select the data for extraction
- Start extraction - to run the task and get data
1. Go To Web Page - to open the target web page
- Click "+ Task" to start a task using Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like BestBuy.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2. Enter Text - to input the keyword in the search box
- Enter Text "laptop" and click "OK" on the "Action Tips" panel
- Click the search button in the inner browser
and choose "Click the Button" on the "Action Tips" panel
- Set "Wait before execution" for 30 seconds (optional)
You can set up a time for "Wait before execution" whatever you like.
- Undo "Retry when the page remains unchanged"
- Click "Open the link in new tab"
- Click "Load the page with AJAX" and set timeout as 15s (optional according to your network)
- Set up "Scroll Down" in order to load all items from one page
In order to fully load the listings, we need to scroll the page down to the bottom continuously."Interval" is the time interval between every two scrolls. In this case, we are going to set "Interval" as 0.5 seconds.
For “Scroll way”, remain it to "Scroll down to the bottom of the page".
3. Create a pagination loop - to scrape all the details from multiple pages
- Close the button for "Workflow" for a better view
- Scroll down the page and click the next page button ">"
- Click "Loop click next page" on the "Action Tips"
- Open the button for "Workflow" for advanced actions
- Undo the "Retry when the page remains unchanged"
- Click "Open the link in new tab"
- Click "Load the page with Ajax" and set timeout 10s (optional according to your network)
- Set up "Scroll Down" to scroll the screen down to the bottom
"Interval" is the time interval between every two scrolls. In this case, we are going to set "Interval" as 0.5 seconds.
In this case, we set up "Scroll Down" as 20 times as an example.
- Change the scroll way to "scroll down for one screen"
Tips! To learn more about how to deal with infinite scrolling in Octoparse, please refer to: |
4. Create a "Loop Item"- to scrape all the reviews on one page
- Click the title of the first list on the current page
- Click "Select All" on the Action Tips panel
- Click "Extract link text"
Octoparse will automatically select all the listings on the current page and loop through each title of the list.
5. Extract data - to select the data for extraction
- Click on the data you need on the page
- Select "Extract text of the selected element" from the "Action Tips"
- Rename the fields by selecting from the pre-defined list or inputting on your own
Tips! If you want to learn more about XPath and how to generate it, here is a related tutorial you might need: |
6. Start extraction - to run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here is the sample for your information.
Author: Cathy
Editor: Yvonne