Lesson 7: Wrap-up! Build your first scraping task
FollowThis is the last lesson of the intro series. We hope you've had fun learning something new and useful. To place all the puzzle pieces together, let's have a recap with a step-by-step tutorial on how to build a scraping task from scratch. We'll walk you through the entire process from entering the URL to downloading the extracted data. Let's dive right into it.
For this example, we'll scrape product information and prices from eBay.com
1. Start a new task
- Open the Octoparse App and enter the target URL (https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1312.R1.TR11.TRC2.A0.H0.Xwireless.TRS1&_nkw=wireless+earbuds&_sacat=0) into the search bar. Click "Start" to create a new task
2. Let Octoparse do the Auto-detect
As soon as the webpage is loaded in the built-in browser, Octoparse will start detecting web page data right away. You can check the progress bar and wait patiently for it to finish.
3. Preview your data
When the auto-detect process is completed, go ahead and check your data in the preview section. You can rename the data fields or remove those that are not needed.
4. Save auto-detect settings
Now, go back to"Tips" and check the settings.
4.1 Do you want to scroll down to load more data → "Yes, why not?" so check the box for scroll down.
4.2. Do you want to scrape more pages?→ "Yes", so check the box for pagination.
4.3. Do you have the correct Next Page button selected → "Yes", check and it is highlighted.
4.4. Do you need to scrape detail page → "Yes", so check the box for clicking through the links.
4.5. Do you have the correct links to the detail pages? → "Yes", check and the correct links are highlighted.
After you've completed all the options on "Tips", click "Save Setting" to have Octoparse auto-generate the corresponding workflow. It is important to ensure that each of the settings is correctly configured as these are the exact settings that Octoparse will use to generate the scraping task.
5. Select data from the detail page
You'll now arrive on the detail page. The auto-detect process may start once again but you can cancel it and choose manual select instead. Auto-select works best when you have a data list to capture while manual select may work more efficiently for selecting individual data fields.
Click the data you'd like to capture, then select "Extract the text of the selected element" on "Tips". Repeat the same steps for every data field.
Check data preview for the extracted data and rename the fields if needed.
6. Clean the extracted data
Looking at the extracted data, there's something we'd like to change. For the feedback data, we'd like to keep it numerical so it's easier to work with on an excel sheet. The idea is to replace the words "Positive feedback" with nothing so we can keep the percentage value without having any words. Let's clean the data.
Click the Show more icon and select "clean data".
Click "Add step", then "Replace".
Replace the words "Positive feedback" with nothing. Then, click "Evaluate" and watch the original text being cleaned into "100%". Once done, click "Confirm" and "OK".
The preview data will auto-refresh to reflect the cleaned data.
Tips!
|
7. Test-run the task
The scraping task is now completed. Like mentioned before, it's always recommended that you test the workflow step-by-step, making sure that each step does what it needs to do, for example, if you click on "Go to Web Page", it should load the web page in the built-in browser without a problem.
Launch the workflow and click through all the steps from top to bottom and inside to outside for nested steps (like pagination). Observe if the web page is responding as expected. The detailed testing methodology is entailed here, feel free to check it out.
Once you have every step tested, it's a perfect time to perform a test run. Click "Run" and select "Run task on your device".
Watch your data get extracted live!
8. Schedule and run
Now your task is fully tested working, you can extract the data much faster by running the task in the Cloud or you can also schedule it to run on a recurring basis.
To start a cloud run, click "Run" and select "Run task in the Cloud".
To schedule the task, click "Run" and select "Schedule task (Cloud)".
Pick your desired frequency and designate the day and time for the run.
9. Export your data
Go to the Dashboard and find your task, click open the task status to view the data extracted. Click
"Export Data" at the bottom and choose the format you'd like to download the data.
Tips! Check this step-by-step tutorial for how to download the extracted data. |
Congrats! You've done a good job making this far and working your way to becoming the next web scraping expert. We hope this is not the end of your learning but the beginning of your web scraping journey.
If you have any questions, whether they are task-related, web scraping related, or service related, let us help. The Octoparse team is proud of being part of your web scraping experience.
Artículo en español: Lección 7-¡Conclusión! Crea tu primera tarea de scraping
También puede leer artículos de web scraping en el sitio web oficial