Isn't it excited that you are about to finish your first scraping task? There is just one more thing you should do (or better do) before officially running your task - test your workflow step by step to make sure things are working just as expected. With a test run, you can check if you need to adjust your task settings to get data correctly.
To demonstrate the process, we'll keep on using the test site as an example: http://test-sites.octoparse.com/?product_cat=e-commerce-category-1).
Test-run workflow steps
The steps of the workflow should always be read from top to bottom, and from inside to outside for nested steps.
So for our example, we should test the steps in this order:
- "Go to Web Page" → test if the web page loads properly
- "Pagination" → test if the Next Page button is located correctly
- "Click to Paginate" → test if the web page paginates properly
- "Loop Item" → test if the list of items is complete and correct
- "Extract Data" → test if the data is selected and extracted correctly
It's worth mentioning that not all tasks are created the same, you may have a completely different task to test, but the testing methodology can generally be extended to tasks of all kinds. Let's get started!
1. Click on "Go to Web Page"
Once you click on the step, it should load the web page in the built-in browser. If the web page loads well, there isn't much you need to adjust; however, there are a few things you should always watch out for.
1.1 If the web page loads with infinitive scroll-down → you want to select "Scroll down the page after it is loaded" and complete the proper settings.
1.2 If the web page is taking longer than usual to load → you may want to increase the page timeout.
2. Click the "Pagination" box
In order for pagination to work consistently, there are two things we need to check for sure.
- If the Next Page button/arrow is being located correctly.
- If the paginating process works well on all pages, ie. it needs to paginate correctly going from page-1 to page-2, page-2 to page 3, page-3 to page-4, so on and so forth.
After you click on the pagination box, go to the highlighted element on the web page and confirm if it is the correct Next Page button. If you don't have the right Next button, you may need to manually fix it by altering the corresponding XPath.
3. Click on "Click to Paginate"
When you click on "Click to Paginate", you are literally instructing Octoparse to click on the Next Page button defined in Step-2. If things are working right, it should go from page-1 to page-2. Repeat this two-steps process (click "Pagination" box then click "Click to Paginate") as many times as needed to make sure pagination is working correctly on all sequential pages. If the web page is not paginating properly on any of the pages, fix the element XPath in step 2 and test again.
Check out these pagination troubleshooting ideas:
4. Click on the "Loop Item" box
Testing the "Loop Item" is essentially confirming if all the desired items have been selected correctly.
Once clicked, go to the web page in the built-in browser and make sure all the items you need are being highlighted.
Or, you can also click open the list-icon to load the list of items and confirm if the list is complete.
If your list is not complete upon tested, you can check out the troubleshooting ideas below:
5. Click on "Extract Data"
Here is the final step - check if the data is being extracted as needed.
Once clicked, check the data in the preview section and confirm if this is the data that you need.
If you see any blank fields or if you find misplaced data, you can check out these troubleshooting ideas:
Perform a test run
After you’ve gone through each step in the task workflow, it is the perfect time to perform a test run on your local device. Click "Run" and select "Run task on your device".
Now watch your data get extracted live!
Check out the FAQs below for why you are not getting the data you need.
Now you know your task is working right, it's time to get data for real! >> Lesson 5: Get data
Artículo en español: Lesson 4: Test-run la tarea
También puede leer artículos de web scraping en el sitio web oficial