Specify text encoding when reading files best practice

Specify text encoding when reading files

Not specifying encoding when reading a file can cause UnicodeDecodeError because Python assumes the file is encoded with the OS's default text encoding, but that's often an invalid assumption.

Files are stored as bytes. Therefore before we can save a Python string to disk the string must be serialising to bytes, and conversely it's necessary to decode those bytes back to string in order to read the file from disk. There are a variety of different text serialisation codecs that handle this encoding and decoding, which are collectively referred to as text encoding. In order to make sense of bytes and decode them correctly it's necessary to know what text encoding was used when it was saved to disk.

By default Python assumes the file is encoded with the OS's default text encoding, and according to PEP 0597, 12% of the most popular packages on PyPI fail during installation on Windows because of this assumption. Those packages have setup.py files that do:

setup(
    ...
    long_description=open("README.md").read()
    ...
)

setup(
    ...
    long_description=open("README.md").read()
    ...
)

That may look OK. On Mac and Linux it will probably work fine, but it's actually a common mistake that introduces a bug that will primarily effect Windows (aka 50% of all Python developers): for Python running on Windows README.md will be opened using the ASCII-based ISO-8859 text encoding. What if README.md contains Unicode characters like ś? Then Python tries to decode the bytes representing ś to ASCII and a UnicodeDecodeError exception will occur because there is no way to fit ś into ASCII-ish character range. This problem is less likely to happen on Mac and Linux as the default text encoding is usually utf-8 for those systems, which can handle Unicode characters like ś.

The encoding problem can be solved by changing setup.py to instead do:

setup(
    ...
    long_description=open("README.md", encoding="utf-8").read()
    ...
)

setup(
    ...
    long_description=open("README.md", encoding="utf-8").read()
    ...
)

This problem has been recognised by the Python community and PEP 0597 highlights the issue, as a result if encoding is not used Python 3.10 then a EncodingWarning can be raised. You can view the raw stats for the 12% failures here, and see an example of an affected library being fixed here.

Our code best practice checker infers the encoding of the file and suggests that encoding is used during open.

If our GitHub code review bot spots this issue in your pull request it gives this advice:

code-review-doctorbotsuggested changes just now

catalogue.py

with open('some/path.txt') as f:

Not specifying encoding when reading a file can cause UnicodeDecodeError because Python assumes the file is encoded with the OS's default text encoding, but that's often an invalid assumption.

Suggested changes