Lesson 7: Wrap-up! Build your first scraping task
FollowThis is the last lesson of the intro series. We hope you've had fun learning something new and useful. To place all the puzzle pieces together, let's have a recap with a step-by-step tutorial on how to build a scraping task from scratch. We'll walk you through the entire process from entering the URL to downloading the extracted data. Let's dive right into it.
For this example, we'll scrape product information and prices from eBay.com
1. Start a new task
- Open the Octoparse App and enter the target URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1312.R1.TR11.TRC2.A0.H0.Xwireless.TRS1&_nkw=wireless+earbuds&_sacat=0) into the search bar. Click "Start" to create a new task
2. Let Octoparse do the Auto-detect
As soon as the webpage is loaded in the built-in browser, select "Auto-detect web page data" on the Action Tips. Octoparse will start detecting web page data right away. You can check the progress bar and wait patiently for it to finish.
3. Preview your data
Once the auto-detect process is completed, go ahead and check your data in the preview section. You can double-click the field name to rename it or click to remove those that are not needed.
4. Save auto-detect settings
Now, go back to "Tips" and check the settings.
4.1 Do you want to scroll down to load more data → "No, this web page does not need to be scrolled." so uncheck the box for scroll down.
4.2. Do you want to scrape more pages?→ "Yes", so check the box for pagination.
4.3. Do you have the correct Next Page button selected → "Yes", check and it is highlighted.
Once you have completed all the actions on "Tips", click "Create workflow" and wait patiently while Octoparse auto-generates the workflow. It is important to ensure that each of the settings is correctly configured as these are the exact settings that Octoparse will use to generate the scraping task.
You might want to scrape more info from the product detail page so we need to set one more step onto the workflow to ensure Octoparse can click on each product link on the page automatically.
4.4 Click "Click on link(s) to scrape the linked page(s) "
Choose the option "Click on an extracted data field" and select "product_url" from the dropdown menu and click "Confirm".
Notice how an extra step gets added to the workflow which is the "Click URL in the list" step.
5. Select data from the detail page
You will now arrive on the detail page. Once again, select "Auto-detect web page data" on the Action Tips. The auto-detection process will start automatically. You can switch between the detected results until you have the right data selected.
Click "Create workflow" and the updated workflow should be like this:
You can also manually select the information on the web page to scrape it:
6. Clean the extracted data
Looking at the extracted data, there's something we would like to change. For example, we would like to get rid of the preposition "from" in the "Location" field, therefore we need to use "Clean Data" to do so.
Click the "more" icon on the top right corner and select "clean data".
Then click “Add step” - "Replace". We need to get rid of "from" and ensure all the rows could be matched with it that we have to replace "from" with nothing, as this GIF is shown below.
Tips!
|
7. Test-run the task
The scraping task is now completed. As mentioned before, it's always recommended that you test the workflow step-by-step, making sure that each step does what it needs to do, for example, if you click on "Go to Web Page", it should load the web page in the built-in browser without a problem.
Launch the workflow and click through all the steps from top to bottom and inside to outside for nested steps (like pagination). Observe if the web page is responding as expected. The detailed testing methodology is entailed here, feel free to check it out.
8. Schedule and run
Now that your task is fully tested working, you can extract the data much faster by running the task in the Cloud or you can also schedule it to run on a recurring basis.
To start a cloud run, click "Run Now" under "Run in the Cloud".
To schedule the task, click "Schedule Local Runs" or "Schedule Cloud Runs".
Pick your desired frequency and designate a day and time for the run.
9. Export your data
Go to the Dashboard to find your task and click open task status to view the data extracted. Click
"Export Data" at the bottom and choose the format you'd like to download the data.
Tips! Check this step-by-step tutorial for how to download the extracted data. |
Congrats! You've done a good job of making this far and working your way to becoming the next web scraping expert. We hope this is not the end of your learning but the beginning of your web scraping journey.
If you have any questions, whether they are task-related, web scraping-related, or service-related, let us help. The Octoparse team is proud of being part of your web scraping experience.
Author: Brian
Editor: Yina