More and more web owners have equipped their sites with all kinds of anti-scraping techniques to block scrapers, which makes web scraping more difficult. In this article, we will introduce you some techniques to anti-block in Octoparse.
1. Set up wait time to slow down the scraping
Most websites block by detecting the accessing speed of one IP address. You can set up the wait time for any steps in the Workflow to control the scraping speed. There is even a “random” option to make the scraping more human-like.
2. Set up IP rotation(local extraction only)
When a site detects there are a number of requests from a single IP address, it will easily block the IP address. To avoid sending all of your requests through the same IP address, you can use proxy servers.
Octoparse local extraction allows users to set up proxies to avoid being blocked.
3. Switch user-agents and clear cookies
Every request made by a web browser contains a user-agent. Using a user-agent for an abnormally large number of requests will lead you to the block. To get past the block, you should switch user-agent frequency instead of sticking to one.
With Octoparse, you can easily enable automatic UA rotation in your crawler to reduce the risk of being blocked.
And some websites may remember the cookies you use for accessing the pages. We can clear the cookies automatically to pretend to be the first time to access the pages.
You can check more details about Octoparse anti-blocking settings here: Octoparse Anti-Blocking Settings
Artículo en español: ¿Cómo scrape sitios web sin ser bloqueado?
También puede leer artículos de web scraping en el sitio web oficial