Python – Extract text from HTML tags and plain text (not included in tags)

Extract text from HTML tags and plain text (not included in tags)… here is a solution to the problem.

Extract text from HTML tags and plain text (not included in tags)

<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a> 
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a> 
from one's 
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>

I’m trying to refactor the sentence “to pay charges from one’s bank account” split into the above HTML code. My problem is that part of the sentence is not included in the HTML tag. When I try to use :

BeautifulSoup.find_all()

I

only get the text between the link tags when I try to use it

BeautifulSoup.contents

I only get “from someone’s” instead of the text between the link labels.

Is there a way to traverse this code and refactor sentences?

Edit:
The code above is just an example, I’m trying to grab the dictionary, so the order of the strings and which parts will be arbitrary/outside the label.

Solution

from bs4 import BeautifulSoup

html = """<p class="qotCJE">
<a href="https://ejje.weblio.jp/content/to+pay" title="to payの意味" class="crosslink">to pay</a>
<a href="https://ejje.weblio.jp/content/charges" title="chargesの意味" class="crosslink">charges</a>
from one's
<a href="https://ejje.weblio.jp/content/bank+account" title="bank accountの意味" class="crosslink">bank account</a>
</p>"""

soup = BeautifulSoup(html)

print(soup.text)
# to pay
# charges
# from one's
# bank account

print(soup.text.replace('\n', ' '))
# to pay charges from one's bank account 

Related Problems and Solutions