Python - Remove "characters with encodings larger than 3 bytes" with Python 3

Remove “characters with encodings larger than 3 bytes” with Python 3… here is a solution to the problem.

Remove “characters with encodings larger than 3 bytes” with Python 3

I want to remove characters with encoding greater than 3 bytes.
Because when I upload my CSV data to the Amazon Mechanical Turk system, it asks me to do so.

Your CSV file needs to be UTF-8 encoded and cannot contain characters
with encodings larger than 3 bytes. For example, some non-English
characters are not allowed (learn more).

To overcome this problem,
I want to make a filter_max3bytes function to remove those characters in Python3.

x = 'below ð\x9f~\x83,'
y = remove_max3byes(x)  # y=="below ~,"

Then I’ll apply the function before saving it to a UTF-8 encoded CSV file.

This post is related to my issue, but they use python 2 and that solution doesn’t work for me.

Thanks!

Solution

All the characters in your string don’t seem to take up 3 bytes in UTF-8:

x = 'below ð\x9f~\x83,'

Anyway, if anything, the way to remove them is:

filtered_x = ''.join(char for char in x if len(char.encode('utf-8')) < 3)

For example (there are such characters):

>>> x = 'abcd chinese character efg'
>>> ''.join(char for char in x if len(char.encode('utf-8')) < 3)
'abcdefg'

By the way, you can verify that your original string does not have a 3-byte encoding by doing the following:

>>> for char in 'below ð\x9f~\x83,':
...     print(char, [hex(b) for b in char.encode('utf-8')])
...
b ['0x62']
e ['0x65']
l ['0x6c']
o ['0x6f']
w ['0x77']
  ['0x20']
ð ['0xc3', '0xb0']
  ['0xc2', '0x9f']
~ ['0x7e']
  ['0xc2', '0x83']
, ['0x2c']

EDIT: A wild guess

I believe the OP asked the wrong question, the question is actually whether the characters are printable or not. I’m assuming any Python shown as \x<number> not printable, so this solution should work:

x = 'below ð\x9f~\x83,'
filtered_x = ''.join(char for char in x if not repr(char).startswith("'\\x"))

Result:

'below ð~,'

Python – Remove “characters with encodings larger than 3 bytes” with Python 3

Remove “characters with encodings larger than 3 bytes” with Python 3

Solution

Related Problems and Solutions