Python – How to encode/decode this BeautifulSoup string in Python in order to output non-standard Latin characters?

How to encode/decode this BeautifulSoup string in Python in order to output non-standard Latin characters?… here is a solution to the problem.

How to encode/decode this BeautifulSoup string in Python in order to output non-standard Latin characters?

I’m scraping a page with Beautiful Soup and the output contains non-standard Latin characters displayed as hexadecimal.

I’m crawling https://www.archchinese.com It contains phonetic words that use non-standard Latin characters such as ǎ, ā. I’ve been trying to iterate through a series of links containing pinyin, using the BeautifulSoup .string function as well as utf-8 encoding to output these words. The word appears in hexadecimal where non-standard characters are placed. The word “hǎo” comes out as “h\xc7\x8eo”. I’m sure I’m doing something wrong while coding, but I don’t know what to fix. I first tried decoding with UTF-8, but I got the error that the element does not have a decoding function. Trying to print strings without encoding gives me an error about characters not being defined, I think this is because they need to be encoded into something first.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

url = "https://www.archchinese.com/"

driver = webdriver. Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)

driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.

python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.

soup=BeautifulSoup(driver.page_source, 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.

Actual Results:
b’h\xc7\x8eo’
b’h\xc3\xa0o’

Expected Results:
hǎo
Good

EDIT: The problem seems to be related to UnicodeEncodeError in Windows. I tried installing win-unicode-console, but it didn’t work. Thanks to snakecharmerb for the information.

Solution

You don’t need to encode the value when printing – the print function handles it automatically. Now you are printing the byte representation that makes up the encoded value, not just the string itself.

>>> s = 'hǎo'
>>> print(s)
hǎo

>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'

Related Problems and Solutions