You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

As one of the most popular news websites in the U.S., the Washington Post provides mass news not only happening in America but also around the world. From the website, news for almost every field including politics, opinions, coronavirus, and sports could be found.

To get as much news as we want at a time, in this case, we will scrape data such as the title URL, news title, and published date for the Covid-related news posted on the Washington Post with Octoparse.

info.jpg

Target URL used below:

https://www.washingtonpost.com/search/?query=Covid&btn-search=&facets=%7B%22time%22%3A%22all%22%2C%22sort%22%3A%22relevancy%22%2C%22section%22%3A%5B%5D%2C%22author%22%3A%5B%5D%7D

Here are the main steps in this tutorial: [Download task file here]

  1. Enter the URL on the home page - to open the target page

  2. Start auto-detection - to create a workflow

  3. Add pagination - to scrape more results from multiple pages

  4. Run the task - to get the wanted data


1. Enter the URL on the home page - to open the target page

To start our scrape journey, the target URL should be input first.

  • Input the web page URL in the search box at the center of the home screen

  • Click Start to create a new task with Advanced Mode

____.jpg

2. Start auto-detection - to create a workflow

Octoparse's auto-detect function can identify the page structure and help to create a workflow quickly.

  • Click on Auto-detect web page data to start the detection automatically and wait for it to complete

detec.jpg
  • Check the data fields in the Data Preview and delete unwanted fields or rename them if needed

FIELD.jpg
  • Once the auto-detect is done, click Create workflow to generate a workflow

create_workflow.jpg

The automatically generated workflow for this task would show as below:

workflow.jpg

3. Add pagination - to scrape more results from multiple pages

To get more results, pagination for loading more results is needed.

  • Click the Load more results" button at the bottom of the page first

  • Click Loop click single button in the Tips box to generate pagination

Loadmore.jpg

4. Run the task - to get the wanted data

The final workflow will look as below:

final_workflow.jpg
  • Click the Save button first to save all the settings you have made

  • Click Run to run your task either locally or cloudly

mceclip8.png
  • Here we select Run on your device to run the task on your local device and wait for completion

mceclip9.png

Below is a sample data run from the local run. Excel, CSV, HTML, and JSON formats are available for export.

mceclip0.png

Tip: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks can be scheduled hourly, daily, or weekly and data delivered regularly.

Did this answer your question?