Use correct file encoding best practice

Use correct file encoding

Specifying incorrect encoding when reading a file can cause UnicodeDecodeError if the contents of the file is incompatible with the specified encoding.

Files are stored as bytes. Therefore before we can save a Python string to disk the string must be serialising to bytes, and conversely it's necessary to decode those bytes back to string in order to read the file from disk. There are a variety of different text serialisation codecs that handle this encoding and decoding, which are collectively referred to as text encoding. In order to make sense of bytes and decode them correctly it's necessary to know what text encoding was used when it was saved to disk.

It's important to use encoding when reading and writing files, but it's more important to use the correct encoding, as otherwise a UnicodeDecodeError will occur at runtime. It's far to easy to mistakenly assume all files are encoded as utf_8.

Our checks can infer the encoding of the file and detect when the encoding specified is wrong and suggest the fix.

If our GitHub code review bot spots this issue in your pull request it gives this advice:

code-review-doctorbotsuggested changes just now

catalogue.py

with open('sample-chinese.txt', encoding='utf_8') as f:

Specifying incorrect encoding when reading a file can cause UnicodeDecodeError if the contents of the file is incompatible with the specified encoding.

Suggested changes