Scraping product information from Target.com
FollowTarget.com, one of the largest online retailers in the United States. In this tutorial, we are going to show you how to scrape the product information from Target.com.
If you would like to know how to build the task from scratch, you may read the following tutorial.
Because of the website structure, we need to use 2 tasks to achieve the goal. We will scrape the URL of each item page in Task 1, and then scrape the detailed product information, such as the product title, price, and description from the product detail page in Task 2. By splitting one task into two, the extraction speed could be improved to a certain degree, especially when we use the Octoparse Cloud Extraction.
To follow through, you may want to use this URL in the tutorial:
https://www.target.com/c/milk-substitutes-dairy-grocery/-/N-5xszh?lnk=MilkMilkSubstit
Here are the main steps in this tutorial: [Download task file here ]
Task 1: Extract all the URLs of detail pages on the search result pages
- "Go to Web Page" - open the target web page
- Auto-detect the web page data - create the workflow
- Set up AJAX timeout for the "Click to Paginate" action
- Run extraction - run your task and get data
Task 2: Collect the product information from scraped URLs: [Download task file here ]
- Input a batch of the scraped URLs - loop open the detail pages
- Extract data - select the data for extraction
- Set up wait time to slow down the scraping
- Start extraction - run the task and get data
Task 1: Extract all the URLs of detail pages on the search result pages
1. Go To Web Page - open the target web page
- Enter the URL on the home page and click Start
2. Auto-detect the web page data - create the workflow
- Click "Auto-detect web page data" and wait for the detection to complete
- Go to "Data preview" to see if you're okay with the current data output(remember to keep the URL field of the product)
- You can delete unnecessary data fields directly by clicking the icon
- You can also modify the data field names here directly by clicking the icon
- You can delete unnecessary data fields directly by clicking the icon
- Click "Create workflow"
3. Set up AJAX timeout for the "Click to Paginate" action
Target uses AJAX to load the next pages, so we need to set up an AJAX timeout.
- Click open the settings of "Click to Paginate" action
- Tick "Load with AJAX"
- Set up the timeout as 7-10s
4. Run extraction - run your task and get data
- Click"Save"
- Click "Run" on the upper left side
- Select "Run on your device" to run the task on your computer, or select "Run in the Cloud" to run the task in the Cloud (for premium users only)
Here is the sample output for Task 1.
Task 2: Collect the product information from scraped URLs
1. Input a batch of the scraped URLs - loop open the detail pages
In Task 1, we have already got a batch of URLs.
- Click "New +" to start a task using Advanced Mode to build Task 2
- Select "Enter manually", and then paste the URLs scraped
- Click "Save URL"
Tips! There are three ways to batch import URLs to any single task/crawler (up to one million URLs). You can also copy the URLs from Task 1 extraction output file, and then paste them under the "Website" text box. For further study, please refer to Batch Import URLs |
2. Extract data - select the data for extraction
- Select information on the web page
- Choose "Extract text of the selected element"
- Repeat the above steps to extract all the data you need
- Click
to modify the field names if needed
3. Set up wait time to slow down the scraping
- Click open the settings of the "Extract Data" action
- Tick "Wait before action"
- Set up the wait time as 7-10s
4. Start extraction - run the task and get data
- Click"Save"
- Click "Run" on the upper left side
- Select "Run task on your device
" to run the task on your computer, or select "Run task in the cloud
" to run the task in the Cloud (for premium users only)
Here is the sample output.
Is this article helpful? Contact us any time if you need our help!
Author: Kara
Editor: Yina