Python - Parses JSON containing "\u as UTF-8 bytes" in Python

Parses JSON containing “\u as UTF-8 bytes” in Python… here is a solution to the problem.

Parses JSON containing “\u as UTF-8 bytes” in Python

I have a JSON file from Facebook’s “Download Your Data” feature that instead of escaping Unicode characters as their codepoint number, it escapes to UTF-8 byte sequences.

For example, the letter á (U+00E1) is escaped in a JSON file as \u00c3\u00a1 instead of \u00e1. 0xC3 0xA1 is the UTF-8 encoding of U+00E1.

The

json library in Python 3 decodes it into ¡, which corresponds to U+00C3 and U+00A1.

Is there a way to properly parse such a file in Python (so that I get the letter á)?

Solution

They seem to use utf-8 to encode Unicode strings into bytes and then convert the bytes to JSON. This is their very bad behavior.

Python 3 example:

>>> '\u00c3\u00a1'.encode('latin1').decode('utf-8')
'á'

You need to parse the JSON and iterate through the entire data to fix it:

def visit_list(l):
    return [visit(item) for item in l]

def visit_dict(d):
    return {visit(k): visit(v) for k, v in d.items()}

def visit_str(s):
    return s.encode('latin1').decode('utf-8')

def visit(node):
    funcs = {
        list: visit_list,
        dict: visit_dict,
        str: visit_str,
    }
    func = funcs.get(type(node))
    if func:
        return func(node)
    else:
        return node

incorrect = '{"foo": ["\u00c3\u00a1", 123, true]}'
correct_obj = visit(json.loads(incorrect))

Python – Parses JSON containing “\u as UTF-8 bytes” in Python

Parses JSON containing “\u as UTF-8 bytes” in Python

Solution

Related Problems and Solutions