Extract element text/URL/image/HTML/attribute
FollowThere are different kinds of information on web pages, such as text, images, etc. Octoparse is able to handle scraping of different information. In this tutorial, we will show you how to use Octoparse to extract text, URL, image URL, HTML, and attribute value.
2) Extract the URL of a link or an image
1) Extract Text
Most of the data are represented as human-readable text on the web, such as news articles, product information, and blogs.
Let's see how to select and extract the text data with Octoparse.
1. Click on the target data you want
When you click on the element you need, the selection area will be highlighted in green.
2. Extract text
Click "Extract text of the selected element" to fetch the text.
2) Extract the URL of a link or an image
A URL is a hyperlink. With a single click on a URL, you can open a new web page or go to a new website, just like what happens when you click on the title of a book on Amazon.
Besides a web page, the URL also enables you to access the specific file resource via the Internet, such as an image or a PDF doc. If you get the URL, you can download the corresponding file or image from the Internet via the URL.
Let's see how to select and extract the URL of a link or an image with Octoparse.
Extract the URL of a link
1. Click on the link you want
When you click on the link/image you need, the selection area will be in a green box.
Tips! When you select an item with a URL, the selected tag on the bottom of "Tips" should be "A", which stands for an anchor that usually links one page to another. Please make sure you select the right area. |
2. Extract the URL
Click "Extract the URL of the selected element" on Tips to get the URL
Extract the image URL
1. Click on the image you want
Tips! Can I just use Octoparse to directly get an image, not its URL, from the web page? Unfortunately, you can’t use Octoparse to extract the image itself. If you want to download images, you can scrape the URLs of the images with Octoparse first, and then bulk downloads the images with a "download from URL" tool |
3) Extract inner/outer HTML
Unlike the text and URL, data like icons are not available to be extracted directly. When you want to extract some visual non-text contents, like the star rating, you have to extract the inner/outer HTML of these contents.
Besides icons, you can also scrape hidden texts, charts, and graphs from a web page by extracting the HTML of these elements first. After getting the HTML code, you need to apply regular expressions to clean the data up.
First, let's see how to select and extract inner/outer HTML with Octoparse.
1. Click on the target data you want
When you click on the element you need, the selection area will be highlighted in green.
2. Extract inner/outer HTML
Click "Extract inner/outer HTML of the selected" on the "Tips" panel.
Tips! Octoparse provides both useful features and tools for you to apply regular expressions. Related articles: |
4) Extract attribute value
Attributes are within the HTML code, providing additional information about HTML elements. For example, the star rating is usually stored in the attribute. It usually comes in name/value pairs like name="value". Octoparse can help to scrape the value directly.
1. Select the element (here we take the star rating as an example)
2. Extract the text or HTML of the element
3. Hover over the name field, you can see . Click on it and move to the "Customize field" and "Extract attribute"
Tips! 1. You can modify to extract other types of information from the element by using the "Customize data field". For example, you have selected to extract the text, but later you want to scrape the HTML code of the element. You can just go to the "Customize data field" to select "Extract the outer HTML".
2. All kinds of data are stored in text format when exporting to a file. |
Author: Brian
Editor: Yina