Skip to main content

How to deal with data missing issues in a Cloud run?

Reasons and solutions for troubleshooting data missing in Cloud runs

Updated over 6 months ago

Cloud run is a convenient and fast way for scraping data when using Octoparse. However, like any technology, Cloud extraction can sometimes encounter issues. One common issue that users may face is data missing, which means in local runs, the task collects all the data as expected but in Cloud runs, some of the URLs are skipped or some items are not scraped. In this article, we will discuss what causes data missing issues in a cloud run and how to deal with them.

Tasks are splittable and working too fast hence some data got skipped

Tasks with "Fixed List," "List of URLs," and "Text List" loop mode are splittable in Octoparse. The main tasks will be split into sub-tasks executed by multiple cloud nodes simultaneously. So, in this case, every step of the task will work very fast; hence some pages may not be loaded completely before moving to the next step.

To ensure the web page is loaded completely in the Cloud, you can try the following solutions.

1. Increase timeout for Go to Web Page step

0002.jpg

2. Set up more Wait before action for all steps

  • Click Options

  • Tick Wait before action

0001.jpg
  • Set up an anchor element to find before action to guarantee that the extraction only starts after a specific element has been found. You can choose any element's XPath from the desired fields.

  • Tick Wait until a designated element appears

  • Fill the element with a Matching XPath and change "Wait before action" to "30s".

0000.jpg

Note: If you want to learn more about Wait before action, please check Wait before action.


The website has enabled anti-scraping technology to block scrapers

Many websites are using anti-scraping tech to stop scrapers accessing their pages. It could be a captcha or IP restriction that prevents the scraping process and causes the data to be missing.

To identify if the website blocks, you can open the Event Log to see if the screenshot captured by Octoparse. Usually it will show a captcha, "Access Denied", "403 Forbidden", "The website is not supported in your country", etc.

Note: You can refer to this tutorial for more details about event log: What is cloud live log & history?

To break such blocks, you can

1. Set up IP proxies in Task Settings

  • Go to Task Settings > Anti-blocking

  • Tick Access website via proxies and select a Country/Region you'd like for the IP address (Default means to use IPs from random countries)

IP.png

2. Set up Bypass Cloudflare with credits if you see Cloudflare captcha shows

If you cannot figure out why your Cloud runs miss data, always feel free to contact support with your task file for help!

Did this answer your question?