This is the last lesson of the intro series. We hope you've had fun learning something new and useful. To place all the puzzle pieces together, let's have a recap with a step-by-step tutorial on how to build a scraping task from scratch. We'll walk you through the entire process from entering the URL to downloading the extracted data. Let's dive right into it.
For this example, we'll scrape product information and prices from this sample URL: https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1312.R1.TR11.TRC2.A0.H0.Xwireless.TRS1&_nkw=wireless+earbuds&_sacat=0
1. Start a new task
- Enter the target URL into the search bar. Click Start to create a new task
2. Start the Auto-detect
As soon as the webpage is loaded in the built-in browser, select Auto-detect web page data from the Tips panel. Octoparse will start detecting web page data right away.
3. Preview your data
Once the auto-detect process is completed, go ahead and check your data in the Data preview section. Double-click the field name to rename it or click the trash icon to remove those that are not needed.
4. Save auto-detect settings
Go back to the Tips panel and check the below settings:
- Check the Add a page scroll box if your target website is loading more items while the page scroll
- Check the Paginate to scrape more pages box if you'd like to scrape more than one page
- Check if the correct pagination button has been selected from the website (highlighted)
Now, click Create workflow and Octoparse will auto-generate the workflow.
Apart from the listing page, if you want to scrape more data from the product detail page, please follow the below steps:
- Select Click on link(s) to scrape the linked page(s)
- Choose the option Click on an extracted data field select product_url from the dropdown menu and click Confirm
Notice how an extra step gets added to the workflow which is the Click URL in the list step.
5. Select data from the detail page
You will now arrive on the detail page. Once again, select Auto-detect web page data from the Tips panel
TIP: The auto-detection process will start automatically. You can switch between the detected results until you have the right data selected.
Click Create workflow and the updated workflow should be like this:
You can also manually select the information on the web page to scrape it:
6. Clean the extracted data
Looking at the extracted data, there's something we would like to change. For example, we would like to get rid of the preposition "from" in the "Location" field, therefore we need to use Clean Data to do so.
Click the more icon on the top right corner and select Clean data -
Then click Add step - Replace. We need to get rid of "from" and ensure all the rows could be matched with it that we have to replace "from" with nothing, as this GIF is shown below.
7. Test-run the task
The scraping task is now completed. As mentioned before, it's always recommended that you test the workflow step-by-step, making sure that each step does what it needs to do, for example, if you click on Go to Web Page, it should load the web page in the built-in browser without a problem.
Launch the workflow and test run it by clicking through all the steps from top to bottom and inside to outside for nested steps (like pagination). Observe if the web page is responding as expected.
8. Schedule and run
Now that your task is fully tested and working, you can extract the data much faster by running the task in the Cloud or you can also schedule it to run on a recurring basis.
To start a cloud run, click Run Now under Run in the Cloud.
To schedule the task, click Schedule Local Runs or Schedule Cloud Runs.
Pick your desired frequency and designate a day and time for the run.
9. Export your data
Go to the Dashboard to find your task and click open task status to view the data extracted. Click
Export Data at the bottom and choose the format you'd like to download the data.
Congrats! You've done a good job of making this far and working your way to becoming the next web scraping expert. We hope this is not the end of your learning but the beginning of your web scraping journey.