You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

SHEIN is an online fast fashion retailer now has a great impact on the fast fashion industry and is a super hit on Tik-Tok. It serves fast fashion mainly on women's wear at a much lower price.

To follow through, you may want to use the URL in the tutorial:

https://us.shein.com/women-dresses-c-1727.html?ici=us_tab01navbar06&scici=navbar_WomenHomePage~~tab01navbar06~~6~~webLink~~~~0&srctype=category&userpath=category%3EDRESSES

We will scrape data such as the Product Name, Price, Image URL, SKU, Number of reviews, and Scores.

Here are the main steps in this tutorial: [Download task file here]

  1. "Go to Web Page" - to open the targeted web page

  2. Auto-detect web page - to create a workflow

  3. Click into each link - to get more detailed information

  4. Extract data - to select the data for extraction

  5. Run task - run the task and get the data


1. "Go to Web Page" - to open the targeted web page

  • Enter the URL on the home page and click Start

mceclip0.png

2. Auto-detect web page - to create a workflow

  • Choose Auto-detect web page data

  • Wait for the detection to complete

auto_detect.jpg
  • Check the data fields on the Data Preview, and you can also delete the unwanted fields or rename fields if needed

mceclip4.gif
  • Untick Add a page scroll

  • Click Create workflow button on the Tips panel

Create_workflow.jpg

3. Click on links to scrape the linked page - to extract detailed product information

  • Choose Click on links to scrape the linked page on the Tips panel

  • Select the "Title_URL" button on the web page from the drop-down menu (you can confirm if it's the correct link on the Data Preview)

  • Click Confirm

mceclip6.gif

4. Extract data - to select the data for extraction

  • Click on the data you want to extract on the page

  • Select Extract the text of the selected element on the Tips panel

  • Repeat the steps until you get all the data needed to be scraped

mceclip8.gif
  • Edit the name of data fields if needed

Scraping the product rating is a little bit tricky in this case since there is no text information we can scrape directly. We need to get the data from the source code.

  • Select the stars

  • Choose Extract outer HTML of the selected element

rating.jpg
  • Click on the More button and choose Clean Data

Clean_Data.jpg
  • Click Add Step and choose Match with Regular Expression

Match.jpg
  • Select Not sure about RegEx? Try the RegEx tool!

RegularExpression_tool.jpg
  • Tick Start with and End with

  • Input "Average Rating " (with a space in the end) in the Start with box

  • Input a space to the End with the box

  • Click on Generate and Apply

Generate_expression.jpg
  • Confirm and Apply to save


5. Run task - get the data you want

  • Click Save, and click Run on the upper right side

  • Select Run on your device to run the task on your computer

Here is the sample output -

mceclip9.png
Did this answer your question?