Specifying incorrect encoding
when reading a file can cause UnicodeDecodeError
if the contents of the file is incompatible with the specified encoding.
Files are stored as bytes. Therefore before we can save a Python string to disk the string must be serialising to bytes, and conversely it's necessary to decode those bytes back to string in order to read the file from disk. There are a variety of different text serialisation codecs that handle this encoding and decoding, which are collectively referred to as text encoding. In order to make sense of bytes and decode them correctly it's necessary to know what text encoding was used when it was saved to disk.
It's important to use encoding
when reading and writing files, but it's more important to use the correct encoding
, as otherwise a UnicodeDecodeError
will occur at runtime. It's far to easy to mistakenly assume all files are encoded as utf_8.
Our checks can infer the encoding of the file and detect when the encoding
specified is wrong and suggest the fix.
If our GitHub code review bot spots this issue in your pull request it gives this advice:
1 | + | with open('sample-chinese.txt', encoding='utf_8') as f: |
Specifying incorrect encoding
when reading a file can cause UnicodeDecodeError
if the contents of the file is incompatible with the specified encoding.
- | with open('sample-chinese.txt', encoding='utf_8') as f: |
+ | with open('sample-chinese.txt', encoding='big5') as f: |
2 | + | content = f.read() |