Python – xpath is p h3 empty

xpath is

empty… here is a solution to the problem.

xpath is

empty

I started using XPath in python3 and faced this behavior. This seems wrong to me. Why does it match span-text and not p-text in h3?

>>> from lxml import etree

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]

>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']

Thank you very much!

Solution

Your first XPath correctly returns no results because <h3> does not contain any text nodes in the corresponding tree. You can use tostring() to see the actual contents of the tree:

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'

The parser probably made this -turned h3 into an empty element – because it considers the paragraph inside the title tag invalid (while the span inside the title is valid): Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?

Keeping the element h3 inside p You can try using a different parser, i.e. using BeautifulSoup’s parser :

>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'

Related Problems and Solutions