You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Medium is an open platform where readers find dynamic thinking, and where expert and undiscovered voices can share their writing on any topic.

This tutorial will show you how to scrape articles from Medium.

1.png

The URL being used in this tutorial is: https://medium.com/search?q=covid

NOTE: If you're going to check whether your workflow works correctly, please download the Task OTD file for this case at the bottom of this page.

Here are the main steps in this tutorial: [Download task file here]

  1. Create a Go to Web Page - to open the target website

  2. Set up pagination - to scrape more articles

  3. Create a loop click step - to click into articles

  4. Extract data - to choose the target data

  5. Modify Xpath for the data field - to locate elements accurately for every detailed page

  6. Back to previous page - to go back to the listing page

  7. Run the task - to get the target data


1. Create a Go to Web Page - to open the target website

  • Enter the target URL into the search bar on the home screen and click Start

OPEN.png

2. Set up pagination - to scrape more articles

  • Click on Show more button

  • Click Loop click single button on Tips

pagination.png
  • Input Xpath for pagination as: //button[contains(text(),'Show more')]

  • Click Apply

xpath.png

Set up page scroll after new content loaded

  • Click Click to paginate step

  • Click Options

  • Tick Scroll down the page after it is loaded

  • Choose Scroll for one screen

  • Scroll 100 times

  • Click APPLY

SCROLL.png

3. Create a loop click step - to click into articles

  • Click on one title

  • Click Select All in the Tips box after the title turns green, Octoparse will then select all titles

LOOP_CLICK.png
  • Click Loop click each element on the Tips

CLICK.png

Modify Loop Item settings

  • Click Loop Item frame

  • Choose Variable List as Loop Mode

  • Input Matching Xpath as: //a[@aria-label="Post Preview Title"]/div/h2

  • Click Apply

LOOP.png
  • Untick Load with AJAX for the click settings

  • Click Apply

UNTICK.png

4. Extract data - to choose the target data

  • Click on the wanted data

choose.png
  • Delete unwanted data by clicking the delete icon

DELETE.png
  • Click Extract data from the Tips box

EXTRACT.png
  • Untick Extract data in the loop for Extract Data settings

  • Click Apply to save the settings

un.png
  • Double click the header of the field to rename it

RENAME.png

5. Modify Xpath for the data field - to locate elements accurately for every detailed page

Octoparse auto-generated XPath for data fields may not work for all pages. We can rewrite XPath for the elements to make sure they are being detected for every page.

  • Change the Data Preview to a verticle view

  • Input Xpath for the data fields below:

    • author: //div[contains(@class,'author')]//a

    • published_time: //p[contains(@class,'published-date')]/span

    • title: //h1[contains(@class,'post-title')]

    • sub_title: //h2[contains(@class,'subtitle')]

    • article: //article[@class="meteredContent"]/div

data.png

6. Back to the previous page - to go back to the listing page

Medium website loads the detail article page with AJAX, so the article page will cover the previous listing page once we click open one article. In this case, we need to add a step to get back to the listing page.

  • Click "+" icon under Extract data to add a step

  • Click Back to Previous Page

back.png

The final workflow will look as:

WORKFLOW.png

7. Run the task - to get the target data

  • Click the Save button first to save all the settings you have made

  • Then click Run to run your task either locally or cloudly

mceclip8.png
  • Select Run on your device and click Run Now to run the task on your local device

  • Waiting for the task to complete

mceclip9.png

Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.

RESULT.png

TIP: Medium requires a premium account to view more articles. You may need to log in to your account to get more data. Here is the related tutorial: Scrape data behind a login

Did this answer your question?