Get only direct elements in a nice soup
<pre class=”lang-html prettyprint-override”><body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
....
<b>
A tale
</b>
</p>
</body>
I need to get all the direct children of the label <body>
, but not grandchildren. So in this case, it should only output <p class=”title”> and <p class="story">
The closest method I found outputs both the tags and all their subtags. How can I get it right?
Solution
First, you can use find_all (recursive=false)
to get all subtags. recursive=false
makes you a direct child of the tag. Then, the only thing I did was format the data as a string.
I added more attributes to the label to indicate that it works in all cases.
html = '''
<body>
<p class="title" id="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story stories">
....
<b>
A tale
</b>
</p>
</body>
'''
soup = BeautifulSoup(html, 'lxml')
for tag in soup.body.find_all(recursive=False):
attributes = ' '.join('{}="{}"'.format(
.key
' '.join(value) if isinstance(value, list) else value
) for key, value in tag.attrs.items())
tag_string = '<{} {}>'.format(tag.name, attributes)
print(tag_string)
Output:
<p class="title" id="title">
<p class="story stories">
The reason I use ' '.join(value) if isinstance(value, list) else value
instead of using value
directly is that class
is available in the list.