Scraping product information from Target.com
FollowTarget.com, one of the largest online retailers in the United States. In this tutorial, we are going to show you how to scrape the product information from Target.com.
If you would like to know how to build the task from scratch, you may read the following tutorial.
Because of the website structure, we need to use 2 tasks to achieve the goal. We will scrape the URL of each item page in Task 1, and then scrape the detailed product information, such as the product title, price, and description from the product detail page in Task 2. By splitting one task into two, the extraction speed could be improved to a certain degree, especially when we use the Octoparse Cloud Extraction.
To follow through, you may want to use this URL in the tutorial:
https://www.target.com/c/milk-substitutes-dairy-grocery/-/N-5xszh?lnk=MilkMilkSubstit
This tutorial will also cover:
- Deal with AJAX for pagination
Here are the main steps in this tutorial: [Download task file here ]
Task 1: Extract all the URLs of detail pages on the search result pages
- "Go to Web Page" - open the target web page
- Create a pagination loop - scrape all the results from multiple search results pages
- Build a "Loop Item"- loop extract each URL on the search results pages
- Start extraction - run the task and get data
Task 2: Collect the product information from scraped URLs
- Input a batch of the scraped URLs - loop open the detail pages
- Extract data - select the data for extraction
- Start extraction - run the task and get data
Task 1: Extract all the URLs of detail pages on the search result pages
1. "Go To Web Page" - open the target web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Target.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Website" box and click "Save URL" to move on
2. Create a pagination loop - scrape all the results from multiple pages
- Scroll down and click the ">" button
- Click "Loop click the selected link" on the "Action Tips" panel
- Set up AJAX Load for the "Click to paginate" action
Target.com applies the AJAX technique to the pagination button. Therefore, we need to set up AJAX Load for the "Click to paginate" action.
- Uncheck the box for "Retry when page remains unchanged (use discreetly for AJAX loading)"
When you have set up the AJAX load, remember to uncheck "Auto Retry".
- Check the box for "Load the page with AJAX" and set up AJAX Timeout as "3" seconds
You can set up a longer timeout to make sure the page loads well.
- Click "OK" to save
Tips! If you want to learn more about AJAX, here are the related tutorials you might need: |
3. Build a "Loop Item"- loop extract each URL on the search results pages
- Click "Go To Web Page" to go back to the first page and then click the "Pagination" box
When extracting data throughout multiple pages, you should always begin your task building on the first page.
- Select the title of the first item in the list
- Click "Select All"
- Click "Extract the URLs of the selected elements"
You may notice that not all the items on the list are selected correctly. In this case, we need to revise the default XPath of the Loop Item to locate all the items.
- Select "Loop Item" in the workflow
- Select "Variable list", and then enter the revised XPath below:
- //li[contains(@class,'h-padding-a-none')]
- Click "OK" to save
You can also add a wait time to this step so that the website will have enough time to load.
- Go to "Wait before execution" and select a time from the drop-down menu based on your Internet condition. For the demonstration, we set "10" seconds.
In addition, some items in the Loop Item cannot find a corresponding URL. In this case, we need to customize the XPath of the data field.
- Click the
icon, and then select "Customize XPath"
- Enter the revised XPath below to the text box of "Relative XPath":
- //A[1]
- Click "OK" to save
Tips! 1. "Variable list" is a loop mode in Octoparse. For more about loop modes in Octoparse: 2. If you want to learn more about XPath and how to generate it, here are the related tutorials you might need: |
4. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
Here is the sample output for Task 1.
Tips! When the content of the page has already shown out, but it is still loading, you could click the "X" button at the right end of the navigating bar to stop loading. |
Task 2: Collect the product information from scraped URLs
1. Input a batch of the scraped URLs - loop open the detail pages
In Task 1, we have already got a batch of URLs.
- Click "+ Task" to start a task using Advanced Mode to build Task 2
- Input batch URL
There are three ways to batch import URLs to any single task/crawler (up to one million URLs). In this case, we will show you how to batch import URLs from a local file.
You can also copy the URLs from Task 1 extraction output file, and then paste them under "Website" text box.
For further study, please refer to Batch Import URLs
- Select "Input from file", and then select the output file of the scrape URLs
- Click "Save URL"
2. Extract data - select the data for extraction
As we can see, we are on the detail page now.
- Click the information you need on the page
- Select "Extract text of the selected element" on the "Action Tips" panel
In this step, we are able to rename the fields by selecting from the predefined list or inputting on your own.
3. Start extraction - run the task and get data
- Click "Save"
- Click "Start Extraction" on the upper left side
- Select "Local Extraction" to run the task on your computer, or select "Cloud Extraction" to run the task in the Cloud (for premium users only)
For a premium user, Cloud Extraction is highly recommended.
Now you have the data you want. There are some blank fields in the output because there are no rating values for some product items.
A splitting task can help improve the efficiency of data extraction as well as minimize some problems caused by small changes happened in websites.
Tips! By default, if Octoparse cannot find the element of the defined pattern on the page, the field will be left blank. However, Octoparse may fail to find the element of the defined pattern even if the element needed is shown on the website. If you encounter this problem, here is a related tutorial you might need: |
Artículo en español: Extracción de información del producto de Target.com
También puede leer artículos de web scraping en el website oficial
Related articles:
Scrape Amazon product information with ASIN/UPC
Scrape product information from Sam's Club
Author: Vanny
Editor: Kara
Was this article helpful? Contact us at any time if you need our help!