Octoparse is a powerful web scraping tool that allows users to extract data from websites without any coding knowledge. With its user-friendly interface and advanced features, Octoparse has become a popular choice for data extraction among businesses and individuals alike. In this article, we will introduce some of the most useful tools, features and tips that help you get data easily.
1. Browse Mode
The Octoparse built-in browser has two modes: Select Mode and Browse Mode.
By default, it is in Select mode. When you move your cursor over the page elements, you will see them highlighted in blue. If you click on an element, Octoparse will not execute the click action right away. Instead, it will just select the element and show you Tips.
Browse Mode can turn the built-in browser to a normal browser, which means you don't see the blue highlight and if you click on elements, real clicking actions will be done.
Click on the Browse button in the upper right corner to enable Browse mode. You can use it to close any unwanted pop-ups (e.g., cookie pop-ups) or solve captchas.
2. Show browser in local run
When you run a task on your device, the scraping process is visible if you click on the Show Browser button.
You will see the web pages open in the window and watch the process going on to see if every step works well.
Show Browser can be automatically enabled if you choose this option in the task settings:
3. Event logs
When a task runs locally or in the Cloud, every step executed is recorded. You can find the event logs in the extraction window. These logs can help us find which steps are not working well.
Related tutorial: What is cloud live log & history?
4. Regenerate XPath
Octoparse sometimes fails to work due to website changes. In this case, we need to update the XPath. Here is a little trip for you to update an XPath quickly.
Just click on this icon after the XPath box.
Go to the web page to select the target element and the XPath will be updated.
5. Customize field
Octoparse can scrape the text info, URL, HTML code or attribute values of one element. If we have a field that scrapes the text, and we need to change it to scrape the HTML code, what should we do?
We can go to More -> Customize field -> select the target info to scrape
6. Enter subpage
Enter subpage helps you to quickly select the link of detail page to enter.
Once you've set up the scrape from the listing page, you can click on Enter subpage to choose a link to open (when you cannot find the option from the Tips). Octoparse will set up the step of click to help get data from each link.
7. Add steps from the workflow
You can add any step directly from the workflow besides selecting the element and choosing the action from Tips.
Move your cursor over the arrows in the workflow, a + button will show up. Click it to select a step to add.
If you cannot select an entire area by moving your cursor over the page elements, you can try to click on an element inside this area, then click on the Expand selection area button.
9. Split the task into 2 tasks
In many cases, we suggest you split one task into two tasks to speed up or scrape the data better.
For example, if you need to scrape an e-commerce search result page and want to click on each product link to get information. You can create one task to scrape all the product links first, then create another task to scrape data from product links.
Here are some common situations in which you can try to split the task into two.
You need to click on each product link to get the data and the data amount is quite large. By scraping the links first, it would be easier for you to scrape the missing products as you already have the product links.
This is especially useful when the website uses infinite scroll or load more.
The website uses AJAX load when you click on the link directly.
If the scraping process is long, you can consider splitting the task to make sure it works well.
10. Click Loop Item to see if Octoparse can be back to the previous page
You can check to see if the Loop Item is working correctly by manually clicking through the actions in the workflow.
Click Loop Item after the listing page loads
Click on Click Item and wait for the new page to load
Click on Loop Item again to see if Octoparse shows the listing page
If you can see the listing page, that means Octoparse is able to return to the listing page and the Loop should be working. Otherwise, you will need to modify the workflow to make it work.
Here is a related tutorial: Why does Octoparse only click the first item and stop?
11. Switch Browser User Agent to Edge 121.0 to bypass CAPTCHA in Settings
Switching User Agent to Edge 121.0 can be a very effective way to bypass CAPTCHA or Cloudflare security checks. Additionally, if you encounter loading issues with web pages, this simple tweak could be the solution you've been searching for.