You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!
Bloomberg, one of the biggest global financial websites, delivers business and market news, data, analysis, and video to the world, featuring stories from Businessweek and Bloomberg News. From the website, we can grab news on markets, technology, politics, as well as wealth. In this case, we will scrape news about Covid on Bloomberg, and scrape data such as the image URL, news title, author, and summary of the news with Octoparse.
The case URL is provided below:
Here are the major steps in this tutorial:
- Go to Web Page - to open the target website
- Auto-detect the web page - to create a workflow
- Modify the XPath of Loop Item - to locate the news item accurately
- Run the task - to get the final data
1. Go to Web Page - to open the target website
To start web scraping, we need to first enter the website URL.
- Enter the Bloomberg search URL into the search box at the center of the home screen, and click Start to create a new task with Advanced Mode.
2. Auto-detect the web page - to create a workflow
On this page, the auto-detect function could help us get data easily.
- Click Auto-detect web page on the Tips and wait for the detection to complete
- Check the data fields on the Data Preview and delete unwanted fields or rename them if needed
- Click Create workflow to generate a workflow
The workflow would be created as below:
3. Modify the XPath of Loop Item - to locate the news item accurately
- Click Loop Item 1 to open its settings
- Input the Matching XPath for each news section, which would be
- Click Apply to save the settings
4. Run the task - to get the final data
- Click the Save button first to save all the settings you have made
- Click Run to run your task either locally or cloudly
- Here we select Run on your device to run the task on your local device and wait for completion
Here is the sample output from the local run.
If you have further issues with the task or any suggestions, we’d love to hear about them. Submit a request here.