Octoparse Anti-Blocking Settings
FollowSome websites are very sensitive to web scraping and take anti-scraping measures such as IP blocking to avoid any possible scraping activities.
In this tutorial, I will show you how to set up Octoparse Anti-Blocking in "Settings" above the Workflow in a task before running this task to reduce the chance of being blocked.
Use IP proxies (for local extraction only)
You can set up proxies manually in Octoparse if you would like to access the website with external proxies (e.g. from a specific country) or you prefer to use your own proxies instead of using our auto IP rotation features of Cloud Extraction. For more information about how to set up proxies, please refer to Set up proxies.
- Check the box for "Use IP proxies" and click "Settings ".
- Enter the proxies and the number of seconds for switching proxies.
- Click "OK" to save the change.
Octoparse will automatically switch proxies as you set when the task is running locally.
Auto switch browser (User-agent)
Your browser sends what’s known as a user agent for any web page you visit. This is a string to tell the target website what kind of device you are accessing the page with. When scraping a website very consistently with the same user agent, it is easy to be detected as a scraping bot activity. Thus, with this feature, the chance of being blocked can be reduced.
To set up the auto switch browser:
- Check the box for "Auto switch browser (User-agent)".
- Click "Settings" to set up the type of user agent.
Not all the UAs work for every website, so you might need some testings. If you want Octoparse to visit the website "via PC" when scraping the website, you should check the box for "Select all" and uncheck the box for "Firefox for mobile 29.0"; if you want Octoparse to visit the website "via mobile", you should only check the box for "Firefox for mobile 29.0".
- Click OK to save the change.
- Either check the box for "Custom interval" and select the number of minutes for switching user agent or check the box for "Switch IPs concurrently".
Octoparse will automatically switch the user agent as you set when the task is running locally or in the cloud.
Auto clear cookies
When scraping a website very consistently with the same cookie, it is easy to be detected as a scraping bot activity. Thus, with this feature, the chance of being blocked can be reduced.
- Check the box for "Auto clear cookies"
- Either check the box for "Custom interval" and select the number of minutes for switching user agent or check the box for "Clear cookies when IPs switch".
Octoparse will automatically clear cookies as you set when the task is running locally or in the cloud.
After setting up Octoparse Anti-Blocking, you can click "Save" to save the settings.
Artículo en español: Octoparse Configuración de Anti-Bloqueo
También puede leer artículos de web scraping en el website oficial
Author: Yvonne
Editor: Kara