Python – Get only direct elements in a nice soup

Get only direct elements in a nice soup… here is a solution to the problem.

Get only direct elements in a nice soup

<pre class=”lang-html prettyprint-override”><body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
....
<b>
A tale
</b>
</p>
</body>

I need to get all the direct children of the label <body>, but not grandchildren. So in this case, it should only output <p class=”title”> and <p class="story">

The closest method I found outputs both the tags and all their subtags. How can I get it right?

Solution

First, you can use find_all (recursive=false) to get all subtags. recursive=false makes you a direct child of the tag. Then, the only thing I did was format the data as a string.

I added more attributes to the label to indicate that it works in all cases.

html = '''
<body>
  <p class="title" id="title">
    <b>
      The Dormouse's story
    </b>
  </p>
  <p class="story stories">
    ....
    <b>
      A tale
    </b>
  </p>  
</body>
'''

soup = BeautifulSoup(html, 'lxml')

for tag in soup.body.find_all(recursive=False):
    attributes = ' '.join('{}="{}"'.format(
        .key 
        ' '.join(value) if isinstance(value, list) else value
    ) for key, value in tag.attrs.items())

tag_string = '<{} {}>'.format(tag.name, attributes)
    print(tag_string)

Output:

<p class="title" id="title">
<p class="story stories">

The reason I use ' '.join(value) if isinstance(value, list) else value instead of using value directly is that class is available in the list.

Related Problems and Solutions