Python – Which of these coding methods is the most reliable?

Which of these coding methods is the most reliable?… here is a solution to the problem.

Which of these coding methods is the most reliable?

I’m new to python, but because my native language contains some nasty umlauts, I had to fall into coding in the first nightmare.
I read JoelonSoftware’s text on encoding and understood the difference between code points and actual letter rendering (and the connection between unicode and encoding).
To get me out of trouble, I found 3 ways to deal with diacritics, but I can’t decide which of them is suitable for what situation.
If someone could shed light on it? I would like to be able to write text to a file, read from it (or sqlite3) and give text, all including readable diacritics…
Thank you very much!

# -*- coding: utf-8 -*-
import codecs

# using just u + string
with open("testutf8.txt", "w") as f:
    f.write(u"Österreichs Kapitän")

with open("testutf8.txt", "r") as f:
    print f.read()

# using encode/decode
s = u'Österreichs Kapitän'
sutf8 = s.encode('UTF-8')
with open('encode_utf-8.txt', 'w') as f2:
    f2.write(sutf8)
with open('encode_utf-8.txt','r') as f2:
    print f2.read().decode('UTF-8')

# using codec
with codecs.open("testcodec.txt", "w","utf-8") as f3:
    f3.write(u"Österreichs Kapitän")

with codecs.open("testcodec.txt", "r","utf-8") as f3:
    print f3.read() 

Edit:
I tested this (the file reads “Österreichs Kapitän”) ):

with codecs.open("testcodec.txt", "r","utf-8") as f3:

s= f3.read()
    print s
    s= s.replace(u"ä",u"ü")
    print s

Do I have to use u’string’ (unicode) everywhere in my code? I’ve found that diacritic substitution doesn’t work if I only use a blank string (without “u”….

Solution

As a general rule of thumb, you typically want to decode the encoded string as early as possible, then manipulate it as a Unicode object, and finally encode it as late as possible (for example, before writing it to a file).

For example:

with codecs.open("testcodec.txt", "r","utf-8") as f3:
    s = f3.read()

# modify s here

with codecs.open("testcodec.txt", "w","utf-8") as f3:
    f3.write(s)

Regarding your question, which way is best: I don’t think there’s a difference between using codec libraries or using encoding/decoding manually. It’s a matter of preference, and both work.

Simply using

open like in your first example won’t work because python will try to encode the string using the default codec (ASCII if you haven’t changed it).

On the question of whether unicode strings should be used everywhere:
In principle, yes. If you create a string s = ‘asdf’, its type is str (you can check with type(s)), and if you do s2 = u'asdf' its type is unicode.
Since it is best to always manipulate Unicode objects, the latter is recommended.

If you don’t want to always prepend “u” to the string, you can use the following import:

from __future__ import unicode_literals

Then you can execute s='asdf' and s will have the unicode type. This is the default setting in Python3, so imports are only required in Python2.

For potential pitfalls, you can look at Any gotchas using unicode_literals in Python 2.6? Basically you don’t want to mix UTF-8 encoded strings and Unicode strings.

Related Problems and Solutions