I need to extract data where the Xpath is not constant and the only way to constantly extract it correctly is to use regular expressions to find where the field is, and then extract data.
I need to extract data where the Xpath is not constant and the only way to constantly extract it correctly is to use regular expressions to find where the field is, and then extract data. I cannot use regular expressions directly on the <tr> tag that I am extracting data from because the only place that it says what the field type is is the tr above my data. Because some field will not always be at the same place, this conclusion that I have drawn seems to be the best. What should I do?
-
For your future reference, since the title of these items remains unchanged, we can use the method introduced in this tutorial to revise the XPath:
How to associate data with nearby text?
To learn more about XPath, this tutorial can be very helpful: What is XPath and how to use it in Octoparse
In the attached task, I have revised several data fields already, if you need to add more, you can just use the XPath below:
When the title is just one word, replace the bold part:
//*[contains(text(),'Type')]/../following-sibling::tr[1]//td[1]/div[1]
When the title is two words, replace the bold part:
//*[contains(text(),'Publisher') and contains(text(),'Location')]/../following-sibling::tr[1]//td[1]/div[1]
So on and so forth.
Please sign in to leave a comment.
Comments
1 comment