You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
As one of the most popular news websites in the U.S., the Washington Post provides mass news not only happening in America but also around the world. From the website, news for almost every field including politics, opinions, coronavirus, and sports could be found.
To get as much news as we want at a time, in this case, we will scrape data such as the title URL, news title, and published date for the Covid-related news posted on the Washington Post with Octoparse.
Target URL used below:
The main steps are shown in the menu on the right. [Download task file here]
1. Create a Go to Web Page - to open the target page
To start our scrape journey, the target URL should be input first.
Input the web page URL in the search box at the center of the home screen
Click Start to create a new task with Custom Task
2. Create a Loop Item - to scrape the list of articles
Select the first article content block
Choose Select all similar elements
Select Text + Link
A Loop Item will be generated in the workflow.
Delete unwanted data fields directly by clicking More and Delete field
Modify the data field names by double-clicking the headers
Add a field by selecting the text info and choosing Text
3. Set up Pagination Loop - to scrape more results from multiple pages
To get more results, pagination for loading more results is needed.
Click the Load more results button at the bottom of the page first
Click Loop click on the Tips panel to generate pagination
4. Run the task - to get the wanted data
The final workflow will look as below:
Click the Save button first to save all the settings you have made
Click Run next to it and wait for a Run Task window to pop up
Select Standard mode under Run on your device section to run the task and wait for the completion
Here is the sample output data, which can be exported in Excel, CSV, HTML and JSON formats.
Note: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or a mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks can be scheduled hourly, daily, or weekly and data delivered regularly.