Sometimes the Cloud run may return no data with a task that runs perfectly in the local. Here in this article, we are going to show you some tips to troubleshoot this problem.
Below are some of the main reasons why no data is returned:
1. Target website fails to load completely or the data to be extracted is not loaded
The website loading time depends on the internet condition and the website response time. When you test the website on a local computer, the loading time may be shorter than that in the Cloud.
If you find no data extracted, please try increasing the timeout for the "Go to Web Page" action. You may find a selection for the timeout at the bottom.
If a longer timeout does not work, you can try to change the browser UA in the task settings to Chrome 91 Linux, or Safari 15.1, which helps to load the web page.
- Click Task Settings on the upper right corner of the screen
- Select Chrome 91.0 for Linux or Safari 15.1
- Also, you can try to set up the IP pool to other options like JP, or UK 2.
2. Cloud IPs are restricted from accessing the website due to heavy scraping frequency
Many websites apply anti-scraping techniques to avoid being scraped. They may limit the times IPs can access during a certain time and block any IP that exceeds the limitation.
Some websites may even block all the IPs of one location, for example, a Japanese website may not be opened in Canada.
IP blacklisted due to too frequent scraping can be resolved by adding wait time to slow down the extraction, but the restriction to IP location is currently a remaining issue as we only have IP's from the US, Japan, Germany, and the UK.
3. A CAPTCHA needs to be solved before accessing the web page
CAPTCHA is also a frequently used method for a website to anti-scrape. The website might recognize that it is a Cloud server IP instead of a residential IP that is accessing the pages. In many cases, the CAPTCHA is shown directly when we open the first page of the website, which breaks the whole scraping process. It is hard to solve the CAPTCHA in the Cloud. If this error occurs, please contact us and we will try to find a workaround for you.
4. Logging into the target website fails
If you set up login steps or save cookies in a task to scrape a website, local extraction will work perfectly but cloud extraction may fail due to different IPs rotating while executing.
Many websites would ask for verification before you log in. Such verification like CAPTCHA is not resolvable in cloud extraction.
The saved cookies always have a valid time and will no longer work when they get expired. To resolve this, you will need to go through the log-in steps once again in order to obtain and save the updated cookies. (Check out how to save cookies)
5. The website HTML design is different when opened in the cloud
For Octoparse, extracting the web data is to pick up content from the source code/HTML file. It needs to recognize the HTML code to know what data to extract. There are cases where website design is different in the Cloud causing the extraction failure.
For example, when you open Sephora.com with an IP from China, the page will be redirected to Sephora.cn. The design of the sites for different locations is totally different. So when using Octopare Cloud extraction, please make sure you are extracting a site that will not be redirected according to IP locations.
Even if the website is not redirected, the source code can also be changed a little bit in a different browser under different network conditions.
How do I identify the cause of Cloud extraction failure?
Octoparse Cloud extraction process cannot come into our sight like local extraction. There is a simple way to test what happens in the Cloud: to extract the outer HTML code of the whole website page.
You can follow these steps to extract the HTML code:
Step 1. After opening the website page, click anywhere to trigger the "Tips", for instance, we click the "Octoparse" on this page
Step 2. Click on the HTML Tag at the bottom of the Tips panel and then extract the outer HTML code
Step 3. Run the task in the Cloud and get the HTML code
Step 4. Copy the data extracted into a text file and save it as an HTML file
Step 5. Open the HTML file with Chrome or Firefox to see how the website page loads in the Cloud
Step 6. Check the web page to find out what's the reason for the extraction failure.
For example, if the page shows "Access Denied", it means the cloud IP is blocked. If the page looks the same, you can inspect the HTML code carefully to get the correct XPath for extraction.