In this tutorial, we will show you how to extract text, URL, image URL, HTML, and other attribute values.

  1. Extract Text

  2. Extract the URL (of a link or an image)

  3. Extract the inner/outer HTML

  4. Extract other attribute value


1. Extract Text

Click on your target data then select Extract text of the selected element from the tips panel

45645.gif

2. Extract the URL (of a link or an image)

A URL is a hyperlink. With a single click on a URL, you can open a new web page or go to a new website, just like what happens when you click on the title of a book on Amazon.

Besides a web page, the URL also enables you to access the specific file resource via the Internet, such as an image or a PDF doc. If you get the URL, you can download the corresponding file or image from the Internet via the URL.

2.1 Extract the URL of a link

Click on your target data then select Extract the URL of the selected element from the tips panel

39.gif

TIP: When you select an item with a URL, the selected tag on the bottom of "Tips" should be "A", which stands for an anchor that usually links one page to another. Please make sure you select the right area.

14.png

2.2 Extract the image URL

Click on your target data then select Extract the URL of the selected image from the tips panel

97.gif

FAQ: Can I use Octoparse to directly get an image, not its URL, from the web page?

A: Yes! With the brand new scrape and download feature introduced in version 8.5.4, you can now download the image directly while scraping.


3. Extract the inner/ outer HTML

Unlike the text and URL, data like icons are not available to be extracted directly. If you want to extract some visual non-text contents, like the star rating, you have to extract the inner/outer HTML of these contents.

Besides icons, you can also scrape hidden texts, charts, and graphs from a web page by extracting the HTML of these elements first. After getting the HTML code, you need to apply regular expressions to clean up the data.

To extract inner/ outer HTML, click on your target data then select Extract the inner/ outer HTML of the selected element from the tips panel

6666.gif

TIP: To refine the extracted inner/outer HTML into useful data, you might want to check out these tutorials -


4. Extract attribute value

Attributes are within the HTML code, providing additional information about HTML elements. For example, the star rating is usually stored in the attribute. It usually comes in name/value pairs like name="value". Octoparse can help to scrape the value directly.

Click on the target element (here we take the star rating as an example) and select Extract the text or HTML of the element

1879.gif

Go to the Data Preview section, hover over the name field, and click on the ... more button, select Customize field, then choose your target attribute in the Extract attribute

17777.png
Did this answer your question?