# Motivating example

Let’s say that we have a program that does some processing of filesystem paths. On the Mac platform you might expect to see all the following types of paths, in different parts of the same program:

• Normal Unix paths like /Applications/Adobe Photoshop CC 2014.app
• File URIs such as file:///Applications/Adobe Photoshop CC 2014.app, there are myriad variations of these, e.g. file://localhost/Applications/Adobe Photoshop CC 2014.app is equivalent wikipedia

There are variations of encoding within those types of paths like:

• Paths containing relative components, like “.” or “..”.
• Paths containing environment variables.
• Different types of string encoding of strings containing non-Latin characters, e.g. é can be either a single Unicode 00e9 or two Unicode characters 0065 (e) and 0301 (combining acute accent) in sequence. Different system calls on the same OS can produce different forms of string more details.

OK. So we’ve established that there are a lot of variations of data in path strings. But at the end of the day they can all be encoded by UTF-8 strings (or whatever your preferred encoding is). Hence we can always use a string type for them.

I’ve seen plenty of programs that will use a string type for all string data. But I think it’s a bad idea. Why?

That looks OK’ish. But there may be a problem in the calling code.

Now if at any point we forget whether oldPath is an NFD-normalised Unix path and that defaultPath is something else, our comparison makes no sense any longer. It will be false whether the data is semantically different or not.

There are other examples:

• std::string can be a first name, a surname, a path, an HTML document
• int can be a football score, a financial quantity, a count of people
• std::pair<bool, std::string> can be an error status and an error message; or a flag indicating whether a light is on, plus who last flicked the switch.

At least 99 times out of 100, you don’t want to compare/overwrite or apply almost any operation to a first name and a second name1. Similarly you don’t ever want to do an operation on two different types of path. Even with the best documentation and the best memory, doing it in your head is hard. Moreover it’s not necessary because the compiler or runtime is great at stopping this type of this from happening, if you give it the chance…

# The solution

The solution is pretty simple:

never ever ever use the same type for semantically different data


So don’t do:

Don’t even do:

because as far as the compiler is concerned that’s exactly the same thing!

I think this is the thing to do:

It’s quite verbose, but usually we don’t need so many new data structures once a project is established. The advantages are myriad though:

1. You cannot easily assign, compare or otherwise improperly mix multiple types that are incompatible.
2. If you are mixing types the code looks different and you are doing it deliberately, i.e. you need name.str() and you know you’re doing something dangerous.
3. If you decide to change the meaning of a function later, the signature will probably change and the compiler/interpreter will re-check that you’re not misusing types.
4. It’s a form of documentation in your function signatures (assuming the language requires function types to be declared.)

# Compilers as theorem provers

I am a big fan of statically typed languages and this type of thing is exactly why. When a program in a statically compiled language like C++, Haskell or Go is compiled, the compiler has just proved that types cannot be used in a way they were not designed to be used anyhere in your program (unless you’ve been sneaky, and you usually know you have been).

Interpreters for many dynamically typed languages will stop you from doing it but not until the code path runs and that might conceivably not happen until it’s running on a user’s computer in a galaxy far far away.

By using strings for everything, we have essentially defeated the type system because it cannot distinguish different types of data and help us to use them properly.

The idea of compilers as theorem provers is pretty fundamental to programming languages but easy to forget about when it usually manifests as pesky compiler errors.

If you manage to encode a desirable property into the type system, the compiler will prove it for you each and every time the program runs.

By creating distinct types for name, Unix path, Unix path normalised, etc. the compiler will statically prove that the program doesn’t improperly mix them together at any time. That’s a theorem, you proved it!!! This is a significantly more useful property than the compiler just saying that you don’t ever mix an integer and a string, so let the compiler do the hard work.