Sometimes when executing your task in the cloud after a test run with local extraction, you may encounter no data extracted.
Below are some of the main reasons why no data is returned:
1) The target website fails to load completely or the data to be extracted are not loaded
Website loading time depends on the internet condition and the website itself. When you test the website in a local computer, the loading time may be shorter than that in the cloud.
So if you find no data extracted, please try increasing the timeout for the "Go To Web Page" action.
2) Cloud IPs are restricted to accessing the website due to heavy scraping frequency.
Many websites apply anti-scraping technique to avoid being scraped. They may limit the times IPs can access during a certain time and block any IP that exceeds the limitation.
Some websites may even block all the IPs of one location, for example, a Japanese website may not be opened in Canada.
IP blacklisted due to too frequent scraping can be resolved by adding wait time to slow down the extraction, but the restriction to IP location currently is a remained issue as all of Octoparse cloud IPs are based in the United States.
3) Logging into the target website fails
If you set up login steps or save cookies in a task to scrape a website, local extraction would work perfectly but cloud extraction may fail due to different IPs rotate while executing.
Many websites would ask for a verification before you log in. Such verification like captcha, is not resolvable in cloud extraction.
The saved cookie always has a valid time and will no longer work when it gets expired. To solve this, you will need to go through the log in steps once again by adding in the proper actions in order to obtain and save the updated cookie.(Check out how to save cookie)
4) The website HTML design is different when opened in the cloud
For Octoparse, to extract the web data is actually to pick up content from the source code/HTML file. It needs to recognize the HTML code to know what data to extract.
There is the case that website design is different in the cloud causing the extraction failure.
For example, when you open Sephora.com with an IP from China, the page would be redirected to Sephora.cn. The design of the sites for different locations is totally different. So when using Octopare cloud extraction, please make sure you are extracting a site that will not be redirected according to IP locations.
Even the website would not be redirected, the source code can also be changed a little bit in a different browser under different network condition.
How do I know what causes the cloud extraction failure?
Octoparse cloud extraction process cannot come into our sight like local extraction. There is a simple way to test what happens in cloud: to extract the outer HTML code of the whole website page.
You can follow the next steps to extract the HTML code:
Step 1. After opening the website page, click anywhere to trigger the "Action Tips"
Step 2. Click the HTML Tag on the bottom of the "Action Tips"
Step 3. Run the task in the cloud and get the HTML code
Step 4. Copy the data extracted into a text file and save as HTML
Step 5. Open the HTML file with Chrome or FireFox to see what the website page loads in cloud
Step 6. Check the web page to see find out what's the reason for the extraction failure.
For example, if the page shows "Access Denied", it means the cloud IP is blocked.
If the page looks the same, you can inspect the HTML code carefully to get the right XPath for extraction.
También puede leer artículos de web scraping en el sitio web oficial