Data missing on Cloud Extraction could occur when:
1. Tasks executed with cloud extraction are split-table and working too fast hence some elements may skip.
Tasks with "Fixed List", "List of URLs" and "Text List" loop mode are split-table. The main tasks will be split into sub-tasks executed with multiple cloud servers simultaneously. So in this case, every step of the task will work very fast hence some pages may not be loaded completely before moving to the next step.
2. The website you are after is actually multi-regional.
A multi-regional website could have different page structures for the content provided to visitors from different countries. When a task is set to run in the cloud, it is executed with our IP's based in America. In this case, for tasks targeted websites outside America, some data may be skipped as it can’t be found on the website opened in the cloud.
3. When the task has both 1 and 2 situations.
Here are common solutions to deal with data missing on cloud extraction.
- To ensure the web page to be loaded completely in cloud, you could try to
1. Increase timeout for "Go To Web Page“ step
Advanced Options > Timeout
2. set up Wait before execution
All steps created in the workflow are able to set up a waiting time, except Go To Web Page.
Advanced Options > Wait before execution
3. Set up an anchor element to find before execution
This step will guarantee the extraction only starts after a certain element has been found. You can choose any element's XPath from the desired fields.
Firstly, you click the 'Extract Data' step. Secondly, fill the element with a XPath and change “Wait before extraction” to "Random".
Tips: How to get the XPath of a certain element on the page.
Octoparse can locate the elements you want by 'clicks' on the page. Once a certain field has been generated by your click, you can get its XPath. For example, if you'd like to get the XPath of 'Field 3' in the following case.
Select the data field that needs to be modified, select customize data field
Select "Customize XPath:
Here is the XPath of the Field 3. Now you can copy and paste it for something else.
- To identify if the website is multi-regional, you could
- Test the task with local extraction. If there's no data missing like it does on the cloud extraction, then the website is most likely a multi-regional. In this case, as the targeted content can only be found when opening the website with your own IP, we suggest you Local Extraction to get the data instead.
- extract the outer HTML of the whole page. By checking the extracted HTML, you could find what has caused the data missing by the prompt in the source code like "Access denied".
Artículo en español: ¿Cómo lidiar con los datos que faltan en cloud extraction？
También puede leer artículos de web scraping en el sitio web oficial