Data missing on Cloud Extraction could occur when:
1. Tasks are splittable and working too fast hence some data got skipped
Tasks with "Fixed List," "List of URLs," and "Text List" loop mode are splittable in Octoparse. The main tasks will be split into sub-tasks executed with multiple cloud servers simultaneously. So, in this case, every step of the task will work very fast; hence some pages may not be loaded completely before moving to the next step.
To ensure the web page is loaded completely in the Cloud:
1. Increase timeout for Go to Web Page step
2. Set up Wait before action for all steps
- Click Options
- Tick Wait before action
3. Set up an anchor element to find before action to guarantee the extraction only starts after a specific element has been found. You can choose any element's XPath from the desired fields.
- Tick Wait until a designated element appears
- Fill the element with a Matching XPath and change "Wait before action" to "30s".
2. Your target website is multi-regional
A multi-regional website could have different page structures for the content provided to visitors from different countries. When a task is set to run in the Cloud, it is executed with our IP based in America. In this case, for tasks targeting websites outside of America, some data may be skipped as it can't be found on the website opened in the Cloud.
To identify if a website is multi-regional:
- Test the task with local extraction. If there's no data missing as it does on the cloud extraction, then the website is most likely multi-regional. In this case, as the targeted content can only be found when opening the website with your own IP, we suggest you Local Extraction to get the data instead.
- Extract the outer HTML of the whole page. By checking the extracted HTML, you could find what has caused the data missing by the prompt in the source code, like "Access denied."