As one of the most popular news websites in the U.S., the Washington Post provides mass news not only happening in America but also around the world. From the website, news for almost every field including politics, opinions, coronavirus, and sports could be found.
To get as much news as we want at a time, in this case, we will scrape data such as the title URL, news title, and published date for the Covid-related news posted on the Washington Post with Octoparse.
Target URL used below:
Here are the main steps in this tutorial:
- Enter the URL on the home page - to open the target page
- Start auto-detection - to create a workflow
- Add pagination - to scrape more results from multiple pages
- Run the task - to get the wanted data
1. Enter the URL on the home page - to open the target page
To start our scrape journey, the target URL should be input first.
- Input the web page URL in the search box at the center of the home screen
- Click Start to create a new task with Advanced Mode
2. Start auto-detection - to create a workflow
Octoparse's auto-detect function can identify the page structure and help to create a workflow quickly.
- Click on Auto-detect web page data to start the detection automatically and wait for it to complete
- Check the data fields in the Data Preview and delete unwanted fields or rename them if needed
- Once the auto-detect is done, click Create workflow to generate a workflow
The automatically generated workflow for this task would show as below:
3. Add pagination - to scrape more results from multiple pages
To get more results, pagination for loading more results is needed.
- Click the Load more results" button at the bottom of the page first
- Click Loop click single button in the Tips box to generate pagination
4. Run the task - to get the wanted data
The final workflow will look as below:
- Click the Save button first to save all the settings you have made
- Click Run to run your task either locally or cloudly
- Here we select Run on your device to run the task on your local device and wait for completion
Below is a sample data run from the local run. Excel, CSV, HTML, and JSON formats are available for export.