Encoding

You will notice that all lower-level functions in Dulwich take byte strings rather than unicode strings. This is intentional.

Although C git recommends the use of UTF-8 for encoding, this is not strictly enforced and C git treats filenames as sequences of non-NUL bytes. There are repositories in the wild that use non-UTF-8 encoding for filenames and commit messages.

The library should be able to read all existing git repositories, regardless of what encoding they use. This is the main reason why Dulwich does not convert paths to unicode strings.

A further consideration is that converting back and forth to unicode is an extra performance penalty. E.g. if you are just iterating over file contents, there is no need to consider encoded strings. Users of the library may have specific assumptions they can make about the encoding - e.g. they could just decide that all their data is latin-1, or the default Python encoding.

Higher level functions, such as the porcelain in dulwich.porcelain, will automatically convert unicode strings to UTF-8 bytestrings.