Why does the task get no data in the Cloud but work well when running in the local? (Version 8)
FollowSometimes the Cloud run may return no data with a task that runs perfectly in the local. Here in this article, we are going to show you some tips to troubleshoot this problem.
Below are some of the main reasons why no data is returned:
1) The target website fails to load completely or the data to be extracted are not loaded
The website loading time depends on the internet condition and the website itself. When you test the website on a local computer, the loading time may be shorter than that in the Cloud.
So if you find no data extracted, please try increasing the timeout for the "Go to Web Page" action.
2) Cloud IPs are restricted to accessing the website due to heavy scraping frequency
Many websites apply anti-scraping techniques to avoid being scraped. They may limit the times IPs can access during a certain time and block any IP that exceeds the limitation.
Some websites may even block all the IPs of one location, for example, a Japanese website may not be opened in Canada.
IP blacklisted due to too frequent scraping can be resolved by adding wait time to slow down the extraction, but the restriction to IP location currently is a remained issue as all we only have IP of the US, Japan, German, and the UK.
3) A CAPTCHA needs to be solved before accessing the web page
CAPTCHA is also a frequently used method for a website to anti-scraping. It might recognize it is a Cloud server IP instead of a residential IP that is accessing the pages. In many cases, the CAPTCHA is shown directly when we open the first page of the website, which breaks the whole scraping process. It is hard to solve the CAPTCHA in the Cloud. If you have this error, please contact us and we will try to find a workaround for you.
4) Logging into the target website fails
If you set up login steps or save cookies in a task to scrape a website, local extraction would work perfectly but cloud extraction may fail due to different IPs rotate while executing.
Many websites would ask for a verification before you log in. Such verification like CAPTCHA, is not resolvable in cloud extraction.
The saved cookie always has a valid time and will no longer work when it gets expired. To solve this, you will need to go through the log-in steps once again in order to obtain and save the updated cookies.(Check out how to save cookie)
5) The website HTML design is different when opened in the cloud
For Octoparse, extracting the web data is actually to pick up content from the source code/HTML file. It needs to recognize the HTML code to know what data to extract. There is the case that website design is different in the Cloud causing the extraction failure.
For example, when you open Sephora.com with an IP from China, the page would be redirected to Sephora.cn. The design of the sites for different locations is totally different. So when using Octopare Cloud extraction, please make sure you are extracting a site that will not be redirected according to IP locations.
Even the website would not be redirected, the source code can also be changed a little bit in a different browser under different network conditions.
How do I know what causes the Cloud extraction failure?
Octoparse Cloud extraction process cannot come into our sight like local extraction. There is a simple way to test what happens in the Cloud: to extract the outer HTML code of the whole website page.
You can follow the next steps to extract the HTML code:
Step 1. After opening the website page, click anywhere to trigger the "Tips"
Step 2. Click the HTML Tag on the bottom of the "Tips" and then extract the outer HTML code
Step 3. Run the task in the Cloud and get the HTML code
Step 4. Copy the data extracted into a text file and save it as an HTML file
Step 5. Open the HTML file with Chrome or Firefox to see what the website page loads in the Cloud
Step 6. Check the web page to find out what's the reason for the extraction failure.
For example, if the page shows "Access Denied", it means the cloud IP is blocked. If the page looks the same, you can inspect the HTML code carefully to get the right XPath for extraction.
If you still have no idea what happens to your task, feel free to leave your message.
Author: Kara
Editor: Yina