1. What is text()
in XPath?
In XPath, the text()
function is used to extract the text content of a node. Normally, when you locate an element (e.g., //address
), you get all the text inside that node combined as one string.
However, if the element contains line breaks (<br>
tags), each text segment split by <br>
is treated as a separate text node. By using text()[1]
, text()[2]
, text()[3]
, etc., you can extract each line individually.
2. When should you use text()
?
If the content you want to scrape is split into multiple text nodes (for example, multi-line addresses, product descriptions with line breaks, or text separated by inline tags), and you don't want to scrape the entire block of text into one field but rather extract each part separately, or just want one specific line of information, text() is really useful.
3. How to use text()?
Example: Extracting a Multi-line Address
Imagine you want to scrape the following address from a webpage:
<address> 7700 Irvine Center Dr
<br> Suite 270
<br> Irvine, CA 92618
</address>
If you use //address
as the field XPath, Octoparse would capture the entire text as one field:
7700 Irvine Center Dr Suite 270 Irvine, CA 92618
This made it difficult to separate the street, suite, and city information.
Now you can extract each line individually using:
//address/text()[1]
→7700 Irvine Center Dr
//address/text()[2]
→Suite 270
//address/text()[3]
→Irvine, CA 92618
This way, each part of the address is captured into a separate field, making the scraped data cleaner and easier to use.
The text()
function in XPath helps you target specific text nodes inside an element. Whenever you face multi-line text content separated by <br>
or inline tags, text()[n]
is the perfect way to extract each part individually.
Tip:
To learn more about XPath, you can also refer to this tutorial: