A web page is an HTML document. An HTML tag is a piece of the markup language used to indicate the beginning and end of a web element in an HTML document.
To correctly select the HTML tag, let's have a look at the tags we usually encounter in a task. Knowing the meanings of the tags can help us understand which tag to select in different cases.
defines hyperlink, it can realize open a new page by click
defines paragraphs when organizing text content
defines a block or knob to segment different areas of the page
defines a list item
defines image elements of the page
defines HTML table element
defines a row in an HTML table
defines a standard data cell in an HTML table
When different tags are located, Octoparse will show different options on the Tips. At the bottom of the Tips, you can see an HTML path and the last tag is the one located now.
If the current one is located is not what you want, you can click on the other tags you want from the path.
If you cannot find the correct one on the current path, you can also click open the > to find more tags inside.
Here is an Expand selection area button that will help you expand the selected area. If you find your target area hard to be selected directly, you can select part of it first, and keep clicking on this Expand selection area button until the target area is selected.
Let's take some elements for examples:
1. Image extraction
If you want to scrape an image URL, you will need to locate the img tag as this tag will include the image URL in it.
Click on the image and you will see that the IMG tag is the last one which means you are locating the correct tag.
2. Link Extraction
To get the link of an element, you need to make sure the locate the element contains the URL. Usually the A tag contains the URL you want.
Only when you click on the A tag, the option Extract URL of the selected link will show.