Python – How to get reasonable results from len(), str.format(), and zero-width spaces?

How to get reasonable results from len(), str.format(), and zero-width spaces?… here is a solution to the problem.

How to get reasonable results from len(), str.format(), and zero-width spaces?

I’m trying to format text in a kind of table and write the result to a file, but I’m having trouble with alignment because my source code sometimes contains Unicode characters “zero width space” or \u200b in python.
Consider the following code example:

str_list = ("a\u200b\u200b", "b", "longest entry\u200b")
format_str = "| {string:<{width}}| output of len(): {length}\n"

max_width = 0
for item in str_list:
    if len(item) > max_width:
        max_width = len(item)

with open("tmp", mode='w', encoding="utf-8") as file:
    for item in str_list:
        file.write(format_str.format(string=item,
                                     width=max_width,
                                     length=len(item)))

The contents of “tmp” after running the above script:

|a​​           | output of len(): 3
|b             | output of len(): 1
|longest entry​| output of len(): 14

So it looks like len() doesn’t cause a “print width” of the string, and str.format() doesn’t know how to handle zero-width characters.

Or, the behavior is intentional and I need to do something else.

To be clear, I’m looking for a way to get results like this:

|a​​            | output of len(): 1
|b            | output of len(): 1
|longest entry​| output of len(): 13

I prefer to do it without breaking my source code.

Solution

wcwidth package has a function wcswidth() that returns the width of the string in the character cell:

from wcwidth import wcswidth

length = len('sneaky\u200bPete')      # 11
width = wcswidth('sneaky\u200bPete')  # 10

The difference between wcswidth(s) and len(s) can then be used to correct the error introduced by str.format(). Modify the code above:

from wcwidth import wcswidth

str_list = ("a\u200b\u200b", "b", "longest entry\u200b")
format_str = "| {s:<{fmt_width}}| width: {width}, error: {fmt_error}\n"

max_width = max(wcswidth(s) for s in str_list)

with open("tmp", mode='w', encoding="utf-8") as file:
    for s in str_list:
        width = wcswidth(s)
        fmt_error = len(s) - width
        fmt_width = max_width + fmt_error
        file.write(format_str.format(s=s,
                                     fmt_width=fmt_width,
                                     width=width,
                                     fmt_error=fmt_error))

… Produces this output:

|a​​            | width: 1, error: 2
|b            | width: 1, error: 0
|longest entry​| width: 13, error: 1

It can also produce the correct output for strings that contain double-angle characters:

str_list = (“a\u200b\u200b”, “b”, “

㓵", "longest entry\u200b")

|a​​            | width: 1, error: 2
|b            | width: 1, error: 0
|㓵 | width: 2, error: -1
|longest entry​| width: 13, error: 1

Related Problems and Solutions