How to make even more incorrect states even more inexpressible
Not so long ago on Habré translated an article about how using algebraic data types to ensure that incorrect states were ineffable. Today we will look at a slightly more generalized, scalable and secure way to express the inexpressible, and haskel will help us in this.
In short, in that article, an entity is considered with a postal address and an e-mail address, as well as with the additional condition that there must be at least one of these addresses. How is it proposed to express this condition at the type level? It is proposed to record addresses as follows:
typeContactInfo =
| EmailOnlyofEmailContactInfo
| PostOnlyofPostalContactInfo
| EmailAndPostofEmailContactInfo * PostalContactInfo
What are the problems with this approach?
The most obvious (and it was noted several times in the comments to that article) - this approach is not scalable at all. Imagine that we have not two types of addresses, but three or five, and the condition of correctness looks like “there must be either a postal address or an email address and an office address at the same time, and there should not be several addresses of the same type”. Anyone can write the appropriate type as an exercise for self-checking. The task with an asterisk is to rewrite this type for the case when the condition about the absence of duplicates disappeared from the TZ.
We divide
How can this problem be solved? Let's try to fantasize. Only first we will decompose and divide the address class (for example, mail / email / table number in the office) and the contents corresponding to this class:
data AddrType = Post
| Email| Office
We will not think about the content yet, because there is nothing about it in the statement of work on the condition of the validity of the address list.
If we checked the corresponding condition in the runtime of some constructor of some ordinary OOP-language, then we would just write a function like
valid :: [AddrType] -> Boolvalid xs = let hasNoDups = nub xs == xs -- не делайте так в продакшене
hasPost = Post `elem` xs
hasEmail = Email `elem` xs
hasOffice = Office `elem` xs
in hasNoDups && (hasPost || (hasEmail && hasOffice))
and would throw any action if it returns False
.
Can we instead check for a similar condition using a tip checker when compiling? It turns out that yes, we can, if the type system of the language is sufficiently expressive, and the rest of the article we will pick this approach.
Here dependent types will help us a lot, and since the most adequate way to write seized code on a Haskel is to first write it on aqda or Idris, we will sharply change our shoes and we will write on Idris. Idris syntax is close enough to a Haskel: for example, the above function only needs to change the signature a little:
valid : List AddrType -> Bool
Now remember that in addition to the address classes, we also need their contents, and encode the dependence of the fields on the address class as GADT:
data AddrFields : AddrType -> Type where
PostFields : (city : String) -> (street : String) -> AddrFields Post
EmailFields : (email : String) -> AddrFields Email
OfficeFields : (floor : Int) -> (desk : Nat) -> AddrFields Office
That is, if we are given a value of fields
type AddrFields t
, then we know that t
- this is a kind of class AddrType
, and that fields
the set of fields corresponding to this particular class lies in it.
This is not the type-safe coding as GADT need not be injective, and it would be better to announce three separate data type PostFields
, EmailFields
, OfficeFields
and write function
addrFields : AddrType -> Type
addrFields Post = PostFields
addrFields Email = EmailFields
addrFields Office = OfficeFields
but this is too much writing, which for the prototype does not give a significant gain, but in Haskel there are still shorter and more pleasant mechanisms for this.
What is the entire address in this model? This is a pair from the address class and the corresponding fields:
Addr : Type
Addr = (t : AddrType ** AddrFields t)
Fans of the theory of types will say that this is an existential dependent type: if we are given a certain value of type Addr
, then this means that there is a value of t
type AddrType
and a corresponding t
set of fields AddrFields t
. Naturally, the addresses of different classes are of the same type:
someEmailAddr : Addr
someEmailAddr = (Email ** EmailFields "that@feel.bro")
someOfficeAddr : Addr
someOfficeAddr = (Office ** OfficeFields (-2) 762)
Moreover, if fields are given to us EmailFields
, then the only address class that fits is Email
, so you can omit it, the tipchecker will output it himself:
someEmailAddr : Addr
someEmailAddr = (_ ** EmailFields "that@feel.bro")
someOfficeAddr : Addr
someOfficeAddr = (_ ** OfficeFields (-2) 762)
Let us write an auxiliary function that, according to the list of addresses, gives the corresponding list of classes of addresses, and immediately summarizes it before working on an arbitrary functor:
types : Functor f => f Addr -> f AddrType
types = map fst
Here, the existential type Addr
behaves like a familiar couple: in particular, you can ask for its first component AddrType
(task with an asterisk: why ask for the second component so it won't work?).
Raise
Now we come to the key part of our story. So, we have a list of addresses List Addr
and some predicate valid : List AddrType -> Bool
, the execution of which for this list we want to guarantee at the level of types. How can we combine them? Of course, another type!
data ValidatedAddrList : List Addr -> Typewhere
MkValidatedAddrList : (lst : List Addr) ->
(prf : valid (types lst) = True) ->
ValidatedAddrList lst
Now we will understand what we have written here.
data ValidatedAddrList : List Addr -> Type where
means that the type is ValidatedAddrList
parameterized by the address list itself.
Let's look at the signature of the sole designer MkValidatedAddrList
of this type: (lst : List Addr) -> (prf : valid (types lst) = True) -> ValidatedAddrList lst
. That is, it takes some address list lst
and one more prf
type argument valid (types lst) = True
. What does this type mean? So it means that the value to the left of =
is equal to the value to the right of =
, that is, that valid (types lst)
, in fact, is equal to True.
How it works? The signature =
looks like (x : A) -> (y : B) -> Type
. That is, it =
takes two arbitrary values x
and y
(perhaps even different types A
and B
, which means that the inequality in Idris is heterogeneous, and that is somewhat ambiguous from the point of view of the theory of types, but this is a topic for a separate discussion). Due to what then equality is demonstrated? And due to the fact that the only constructor =
- Refl
with a signature almost(x : A) -> x = x
. That is, if we have a value of type x = y
, then we know that it was built using Refl
(because there are no other constructors), which means that it is x
in fact equal y
.
Note that this is why we will always pretend at best in Haskel that we are proving something, because there is one in Haskel undefined
that inhabits any type, so the above reasoning doesn’t work there: for any x
, the y
term type x = y
could be created through undefined
(or through infinite recursion, let's say that by and large the same thing from the point of view of type theory).
Note also that equality here is not meant in the sense of Haskele Eq
or some operator==
in C ++, but much more rigorous: structural, which, simplifying, means that two values have the same form . That is, to deceive him so simply will not work. But equality issues are traditionally pulled into a separate article.
In order to consolidate our understanding of equality, we write unit tests for the function valid
:
testPostValid : valid [Post] = True
testPostValid = Refl
testEmptyInvalid : valid [] = False
testEmptyInvalid = Refl
testDupsInvalid : valid [Post, Post] = False
testDupsInvalid = Refl
testPostEmailValid : valid [Post, Email] = True
testPostEmailValid = Refl
These tests are good because you don’t even need to run them; it’s enough that the TIP checker checked them. Indeed, let's replace True
with False
, for example, in the very first of them and see what will happen:
testPostValid : valid [Post] = False
testPostValid = Refl
Typcheker cuss
as expected. Wonderful.
Simplified
Now a little refute our ValidatedAddrList
.
First, the comparison pattern of a certain value with is True
quite common, so there is a special type for this So
: you can be taken So x
as a synonym for x = True
. Let's correct the definition ValidatedAddrList
:
data ValidatedAddrList : List Addr -> Typewhere
MkValidatedAddrList : (lst : List Addr) ->
(prf : So (valid $ types lst)) ->
ValidatedAddrList lst
In addition, So
there is a convenient helper function choose
, meaning to raise the check for the type level:
> :doc choose
Data.So.choose : (b : Bool) -> Either (So b) (So (not b))
Perform a case analysis on a Boolean, providing clients with a So proof
We need it when we write functions that modify this type.
Secondly, sometimes (especially with interactive development), the idris can find the appropriate value on prf
its own. In order in such cases it was not necessary to design it by hand, there is an appropriate syntactic sugar:
data ValidatedAddrList : List Addr -> Typewhere
MkValidatedAddrList : (lst : List Addr) ->
{auto prf : So (valid $ types lst)} ->
ValidatedAddrList lst
Braces mean that this is an implicit argument that idris will try to pull out of context, and auto
means that he will also try to construct it himself.
So what does this new one give us ValidatedAddrList
? And it gives such a chain of reasoning: let val
- the value of the type ValidatedAddrList lst
. This means that lst
- some address list, and, moreover, val
it was created with the help of the constructor MkValidatedAddrList
, to which we passed this same lst
and one more value of the prf
type So (valid $ types lst)
, which is almost valid (types lst) = True
. And in order for us to build prf
, we need, in fact, to prove that this equality holds.
And the most beautiful thing is that everything is checked by the tickpicker. Yes, it will be valid
necessary to perform the check at runtime (because the addresses can be read from the file or from the network), but the type checker guarantees that this check will be done: it is impossible to create without it ValidatedAddrList
. At least in Idris. In Haskel it is possible, alas.
We insert
To make sure that verification is inevitable, we will try to write a function to add an address to the list. First try:
insert : (addr : Addr) ->
ValidatedAddrList lst ->
ValidatedAddrList (addr :: lst)
insert addr (MkValidatedAddrList lst) = MkValidatedAddrList (addr :: lst)
Nope, the ticker gives on the fingers (although not very readable, the costs of being valid
too complicated):
How do we get a copy of this here So
? None other than the above choose
. Second attempt:
insert : (addr : Addr) ->
ValidatedAddrList lst ->
ValidatedAddrList (addr :: lst)
insert addr (MkValidatedAddrList lst) =
casechoose (valid $ types (addr :: lst)) ofLeft l => MkValidatedAddrList (addr :: lst)
Right r => ?rhs
It is almost secret check. “Almost” because it is not clear what to substitute instead rhs
. Rather, it is clear: in this case, the function must somehow report the error. So, we need to change the signature and wrap the return value, for example, in Maybe
:
insert : (addr : Addr) ->
ValidatedAddrList lst ->
Maybe (ValidatedAddrList (addr :: lst))
insert addr (MkValidatedAddrList lst) =
casechoose (valid $ types (addr :: lst)) ofLeft l => Just $ MkValidatedAddrList (addr :: lst)
Right r => Nothing
This is tipped and works as it should.
But now the following is not a very obvious problem, which was, in fact, in the original article. The type of this function does not interfere with writing such an implementation:
insert : (addr : Addr) ->
ValidatedAddrList lst ->
Maybe (ValidatedAddrList (addr :: lst))
insert addr (MkValidatedAddrList lst) = Nothing
That is, we always say that we could not build a new address list. Tips? Yes. Correctly? Well, hardly. Can this be avoided?
It turns out that it is possible, and we have all the necessary tools for this. If successful, insert
returns ValidatedAddrList
, which contains evidence of this very success. So add elegant symmetry and ask the function to return more and proof of failure!
insert : (addr : Addr) ->
ValidatedAddrList lst ->
Either
(So (not $ valid $ types (addr :: lst)))
(ValidatedAddrList (addr :: lst))
insert addr (MkValidatedAddrList lst) =
casechoose (valid $ types (addr :: lst)) ofLeft l => Right $ MkValidatedAddrList (addr :: lst)
Right r => Left r
Now we can not just take and always return Nothing
.
Similarly, you can do for address removal functions and the like.
Let's see how it all looks in the end now.
Let's try to create an empty address list:
It is impossible, the empty list is not valid.
How about a list of just one email address?
OK, let's try to insert the postal address into the list, in which there is already an postal address:
Let's try to insert an email:
In the end, everything works exactly as expected.
Ffuh. I thought it would be three lines, but it worked a little longer. So explore how far we can go to Haskel, we will be in the next article. In the meantime, a little
Ponder
What is the result of the profit of such a decision in comparison with the one given in the article to which we referred at the very beginning?
- It is, again, much more scalable. Complex validation functions are easier to write.
- It is more isolated. The client code is generally not obliged to know what is inside the validation function, whereas the form
ContactInfo
from the original article requires it to be tied up. - The logic of validation is written in the form of ordinary and familiar functions, so that you can immediately check it with thoughtful reading and test with time-tests, and not deduce the meaning of the test from the form of the data type representing the already verified result.
- It becomes possible to slightly more accurately specify the behavior of functions that work with the type of data that we are interested in, especially in the case of failure to check. For example, written as a result, it is
insert
simply impossible to write wrong . Similarly, one could writeinsertOrReplace
,insertOrIgnore
and the like, whose behavior is fully specified in the type.
What is the profit compared to the OOP solution in this spirit?
classValidatedAddrListClass
{
public:
ValidatedAddrListClass(std::vector<Addr> addrs)
{
if (!valid(addrs))
throw ValidationError {};
}
};
- The code is more modularized and secure. In the case of the above, verification is an action that is checked once and which was later forgotten. Everything rests on the word of honor and agreement that, if you have it
ValidatedAddrListClass
, its implementation once there made a check. The very fact of this check from the class as some value can not be picked out. In the case of values of some type, this value can be transferred between different parts of the program, used to build more complex values (for example, again, negating this test), explore (see the next paragraph) and generally do the same thing that we used to do. with values. - Such checks can be used with (dependent) pattern matching. True, not in the case of this function
valid
and not in the case of Idris, it is painfully complicated, and Idris is painfully stupid so thatvalid
information useful for patterns can be extracted from the structure . Nevertheless,valid
it is possible to rewrite it in a slightly more pattern-friendly style, but this is beyond the scope of this article and is generally nontrivial in itself.
What are the disadvantages?
I see only one serious fundamental flaw: valid
- too stupid function. It returns only one bit of information - validated or not. In the case of smarter types, we could achieve something more interesting.
For example, imagine that the requirement of unique addresses is missing from the TK. In this case, it is obvious that adding a new address to an existing list of addresses will not make the list invalid, so you could prove this theorem by writing a function with a type So (valid $ types lst) -> So (valid $ types $ addr :: lst)
and use it, for example, to write a type safe always successful
insert : (addr : Addr) ->
ValidatedAddrList lst ->
ValidatedAddrList (addr :: lst)
But, alas, theorems like recursion and induction, and in our problem there is no elegant inductive structure, therefore, in my opinion, the code with oak boolean is valid
also not bad.