Specify text encoding when writing files best practice

Specify text encoding when writing files

Not specifying encoding when writing a file can cause UnicodeEncodeError because Python assumes the string's characters can fit in the OS's default text encoding, but that's often an invalid assumption.

Files are stored as bytes. Therefore before we can save a Python string to disk the string must be serialising to bytes, and conversely it's necessary to decode those bytes back to string in order to read the file from disk. There are a variety of different text serialisation codecs that handle this encoding and decoding, which are collectively referred to as text encoding. In order to make sense of bytes and decode them correctly it's necessary to know what text encoding was used when it was saved to disk.

By default Python assumes the file is encoded with the OS's default text encoding. Take this example

with open('/tmp/polish.txt', 'w') as f:
    f.write('Witaj świecie')

with open('/tmp/polish.txt', 'w') as f:
    f.write('Witaj świecie')

That may look OK. On Mac and Linux it will probably work fine, but it's actually a common mistake that introduces a bug that will primarily effect Windows (aka 50% of all Python developers): for Python running on Windows the content will be serialised using the ASCII-based ISO-8859 text encoding. What will happen to the Unicode character ś? Python will try to decode the bytes representing ś to ASCII and a UnicodeEncodeError exception will occur because there is no way to fit ś into ASCII-ish character range. This problem is less likely to happen on Mac and Linux as the default text encoding is usually utf-8 for those systems, which can handle Unicode characters like ś.

The encoding problem can be solved by changing code to instead do:

with open('/tmp/polish.txt', 'w', encoding='utf-8') as f:
    f.write('Witaj świecie')

with open('/tmp/polish.txt', 'w', encoding='utf-8') as f:
    f.write('Witaj świecie')

This problem has been recognised by the Python community and PEP 0597 highlights the issue, as a result if encoding is not used Python 3.10 then a EncodingWarning can be raised.

If our GitHub code review bot spots this issue in your pull request it gives this advice: