In this web scraping tutorial, we will scrape the content of these articles: the title, the content of the article, published date, the author and the article URL. Before we get started, Open https://www.reuters.com/finance/markets using your own browser, navigate to the webpage you’d like to scrape. In this case, we are looking for "Market News". After the page finishes loading, copy this URL. This is the URL we will use in this demonstration.
note: To learn more about AJAX click https://youtu.be/MuOC1yCKai0
To learn more about XPath click https://youtu.be/kZwD6szlvas
Step One: Enter the URLs of the websites you would like to scrape
- Build a new task by clicking “Advance Model”, and enter the URL
- And click “Save URL” on the left corner. This will bring you to the news articles listing page with Octoparse’s built-in browser.
Step Two: Create a pagination loop
- Scroll down to the bottom of the page and find the pagination bar. Then click the “Next Page” button. The command panel called “Action tip” will show up once you interact with the websites with the action of clicking. It will show you what you can to do with the selected element
- Choose "Loop click the selected link"
- Go back to the setting areas after we’ve done setting the actions. Reuters doesn't Ajax Technique at the pagination. We need to uncheck this function. If you want to know what is pagination, you can click this video right here. I have also attached the tutorial below for your reference. As you may notice, Octoparse selects auto-retry, and use a loop by default.
- Fix the Xpath. You don’t need to fix the Xpath each time. But you do need to check if it has been correctly set up. If you click through the workflow, and the web browser does the corresponding reaction, it means it is correct. In this case, as you click through the workflow, you will notice from the setting area that says “cannot locate the element with existing XPath expression. And the built-in browser doesn’t paginate to the second page. This means we need to fix the Xpath and to tell Octoparse to locate the” NEXT PAGE” button precisely. [ If you wondering what is Xpath, click the Youtube right here: https://www.youtube.com/watch?v=kZwD6szlvas&t=19s ] And I also attache below you're your reference. I have the Xpath prepared for the purpose of the demonstration:
5. Click the “Ok” to save the steps.
Step three: Create a “Loop item”
Click through each detail page, get detail information. We need to create a “Loop item”To create a loop item, select the element. In this case the product from the listing. Click the first article. The selected item is highlighted in green color. Octoparse should be able to find other similar items and put them in red color. Choose “Select All”, and then choose “loop click selected link from the “Action Tip”
- Now we need to come back and check if things have been set as expected. To check the workflow, if the web page in the built-in browser shows the corresponding reaction as we click through each step, it means the setting is correct. You may notice, the same thing happened here. The loop item shows “Can’t find any element using this XPath expression”. We need to fix the expression. Copy and paste the correct XPath at a variable List:
then click the element, and uncheck “Ajax load” Click “Save” to save the step.
Now we have a loop list with 10 articles from the same listing page. Octoparse will click through each page and doing the exact same thing as the action we take on the first page.
Step four: Data extraction.
- To extract the data, Click the element, for example, in this case, the product name and choose “extract the text of the selected element” from Action Tip.
- You can preview the extraction from the data fields and edit accordingly.
- We finish setting one extraction field if you want to extract more element just repeat the steps
- For example Click “Price”, choose “extract the text of the selected element” from Action tip.
- Click “Save” to save the steps.
Step five: Run the task and get data
- After finishing setting up the rules, we can run the task by clicking “start extraction”
- Then the Select “Local extraction” to run the task. You can switch the view to check the scraping status, and extracted data.