You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier, and more robust! Download and upgrade here if you haven't already done so!

Bloomberg, one of the biggest global financial websites, delivers business and market news, data, analysis, and video to the world, featuring stories from Businessweek and Bloomberg News. From the website, we can grab news on markets, technology, politics, as well as wealth. In this case, we will scrape news about Covid on Bloomberg and scrape data such as the image URL, news title, author, and summary of the news with Octoparse.

INFO.jpg

The case URL is provided below:

https://www.bloomberg.com/search?query=covid

Here are the main steps in this tutorial: [Download task file here]

  1. Go to Web Page - to open the target website

  2. Auto-detect the web page - to create a workflow

  3. Modify the XPath of Loop Item - to locate the news item accurately

  4. Run the task - to get the final data


1. Go to Web Page - to open the target website

To start web scraping, we need to first enter the website URL.

  • Enter the Bloomberg search URL into the search box at the center of the home screen, and click Start to create a new task with Advanced Mode.

1.png

Note: If you encounter a robot verification, please complete the verification in browse mode and remember to turn it off for further operation.


2. Auto-detect the web page - to create a workflow

On this page, the auto-detect function could help us get data easily.

  • Click the Auto-detect web page on the Tips and wait for the detection to complete

detec.jpg
  • Check the data fields on the Data Preview and delete unwanted fields or rename them if needed

data_edit.jpg
  • Click Create workflow to generate a workflow

4.jpg

The workflow would be created as below:

workflow1.jpg

3. Modify the XPath of Loop Item - to locate the news item accurately

  • Click Loop Item 1 to open its settings

  • Input the Matching XPath for each news section, which would be

    • //div[contains(@class,'storyItem')]

  • Click Apply to save the settings

Xpath.jpg

4. Run the task - to get the final data

  • Click the Save button first to save all the settings you have made

  • Click Run to run your task either locally or cloudly

mceclip8.png
  • Here we select Run on your device to run the task on your local device and wait for completion

mceclip9.png

Here is the sample output from the local run.

data_preview.jpg

TIP: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or a mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks could be scheduled hourly, daily, or weekly and data delivered regularly.

Did this answer your question?