xpath is
empty… here is a solution to the problem.
xpath is
emptyI started using XPath in python3 and faced this behavior. This seems wrong to me. Why does it match span-text and not p-text in h3?
>>> from lxml import etree
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]
>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']
Thank you very much!
Solution
Your first XPath correctly returns no results because <h3>
does not contain any text nodes in the corresponding tree
. You can use tostring()
to see the actual contents of the tree:
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'
The parser probably made this -turned h3
into an empty element – because it considers the paragraph inside the title tag invalid (while the span inside the title is valid): Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?
Keeping the element
h3
inside p You can try using a different parser, i.e. using BeautifulSoup’s parser :
>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'
Related Problems and Solutions
I started using XPath in python3 and faced this behavior. This seems wrong to me. Why does it match span-text and not p-text in h3?
>>> from lxml import etree
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]
>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']
Thank you very much!
Solution
Your first XPath correctly returns no results because <h3>
does not contain any text nodes in the corresponding tree
. You can use tostring()
to see the actual contents of the tree:
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'
The parser probably made this -turned h3
into an empty element – because it considers the paragraph inside the title tag invalid (while the span inside the title is valid): Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?
Keeping the element
h3
inside p You can try using a different parser, i.e. using BeautifulSoup’s parser :
>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'