A TypedList in Scala

Introduction

Lets try and write a typed list in Scala: a list whose size is known at compile time.
(all the source code available at gitHub)

While the idea is not exactly new, there aren’t so many tutorials on how to go about this. Above all though, it is a great excuse to practice with the type system, implicits and recursion.

So, a typed list, besides being parameterized by the type of elements it holds, like the common List[+A], it is also parameterized by its size as in TypedList[+A, Size].
This means there are some verifications that can be transferred from run-time to compile-time – basically, everything related with list size. Some examples:

Normal ListTyped List
EmptyList.head along with
max, min, last
Run-time exceptionDoes not compile
ListA(index)
i.e get element at index
Run-time IndexOutOfBoundsException
if index > than list size (or index < 0)
Does not compile
listA zip listBDrops extra elements of bigger list
List(1, 2, 3) zip List(10, 20)
// res: List( (1, 10), (2, 20) )
Does not compile if lists are of ≠ size
listA splitAt indexReturns an empty list when
index is out of bounds
List(1, 2, 3).splitAt(4)
// res: ( List(1, 2, 3) , List() )
Does not compile if index is > then the
list size
listA filter predicatePossibleImpossible (unless we return a normal list)

Notice there are some operations impossible to construct on typed lists, or rather, which will invariably have a different "api". For example, we can not expect to return a typed list on the filter method as the filtering is done at run-time and there is no information at compile-time of how many elements would satisfy the predicate. The signature on filter must therefore return a normal list.

At the end the api should be something along:

val foo: TypedList[Int, Nat5] = 11 :: 20 :: 34 :: 49 :: 54 :: TypedNil

foo.split[Nat2]
// res: (TypedList[Int, Nat2], TypedList[Int, Nat3]) = (TypedList(11, 20), TypedList(34, 49, 54) )

foo.get[Nat4]
// res: Int = 49

val bar: TypedList[Int, Nat2] = 1000 :: 2000 :: TypedNil
foo concat bar
// res: TypedList[Int, Nat7] = TypedList(11, 20, 34, 49, 54, 1000, 2000)

foo.map(elem => s"Foo-" + elem)
// res: TypedList[String, Nat5] = TypedList(Foo-11,Foo-20,Foo-34,Foo-49,Foo-54)

The Natural numbers

As the size of the list, it being the number of elements it holds, can only be 0 or greater and since we want to parameterize a list on its size, then the 1st step is to encode the natural numbers at the type level; such that there is a type for every natural number.

The following is a pivotal step of the whole process. The baseline was the encoding of natural numbers as presented here and here. It starts with:

sealed trait Natural

case object Zero extends Natural
case class Suc[N <: Natural]() extends Natural

The following diagram captures the type hierarchy.

Fig.1 – Type hierarchy of the encoding of natural numbers

This encoding is recursive. Suc (as in successor) is a type constructor that must take a concrete natural type to "construct" yet another natural type.
This infinite loop can only start at Zero as Zero.type is the only concrete natural (and so it can be passed to Suc) that is not build from a previous one.

Throughout the post we will be using aliases for the natural types:


type Nat0 = Zero.type  
type Nat1 = Suc[Nat0]
type Nat2 = Suc[Nat1]
type Nat3 = Suc[Nat2]  
...

This is cleaner and more intelligible than using Suc[Suc[Suc[Suc[Suc[Suc[Suc[..... to represent some number.
One might wonder then, the advantage of using such a fancy formulation if we need to manually type every single number.
I simpler formulation would be just:


sealed trait Natural
case object Zero extends Natural
case object One extends Natural
case object Two extends Natural
case object Three extends Natural
....

But notice that in this formulation the numbers have no relation between themselves. We would be unable to sum, subtract or multiply two numbers together as we will see further down. These operations with naturals being essential for a generic formulation.

Raw signature of the TypedList

With the above we now have:

sealed trait TypedList[+Element, Size <: Natural]

Where we have co-variance with Element as in a normal list.
Having co-variance on Size as well was my first approach, but it lead to an array of problems downstream as discussed on this SO post . I settled with invariance as Joe Barnes (minute 24:40).

sealed trait TypedList[+Element, Size <: Natural]{
  def head: Element
}

case object TypedNil extends TypedList[Nothing, Zero.type] {
  override val head: Nothing = throw new Exception("Boom!")
}

case class TypedCons[Element, N <: Natural](
  head: Element,
  tail: TypedList[Element, N]
) extends TypedList[Element, Suc[N]]

The following is another fundamental concept:
TypedCons[_, N] extends TypedList[_, Suc[N]].
Why ?
Wouldn’t it be more intuitive to extend TypedList[_, N] instead ?
Because the tail’s size must be one less than the current list!
With the current formulation, a TypedCons[_, Nat4] is a TypedList[_, Nat5/*(= Suc[Nat4])*/] with a tail of size Nat4. Very smart ! If the extension was on TypedList[_, N] how would we encode that property?

Fig.2 – TypedCons[A, Size] is a (subtype of) TypedList[A, Suc[Size]] with a tail: TypedList[A, Size]

Continuing, notice we are still able to access the head of an empty list:


val emptyList = TypedNil
val foo = emptyList.head       // compiles and 
java.lang.Exception: Boom!     // blows-up at compile time

But now we are powerful enough to solve this; with the help of phantom types:

sealed trait TypedList[+Element, Size <: Natural]{
  def head[PhantomType >: Size <: Suc[_ <: Natural]: Element
}

PhatomType (name irrelevant, could be BlaBla) above exists for a single purpose: To encode a type constraint.
We demand the compiler to find a type (the PhatomType) such that it must have Size( of the list) as lower-bound and Suc[_ <: Natural] as upper-bound. With the help of figure 1 you will conclude this is impossible if Size is Zero.type as in the TypedNil case; Now the compiler fails:

val emptyList = TypedNil
val foo = emptyList.head     // Does not compile with error:  

error: type arguments [Zero.type] do not conform to method head's type parameter bounds [PhantomType >: Zero.type <: Suc[_ <: Natural]]
       TypedNil.head
                ^

The .map and .zip operations

This are the simplest methods to implement and do not require any reasoning more advanced than their counterparts in a normal list.

The .map

We are interested in developing a .map method that should have the same meaning as in a normal list: apply the same function to every element. In our case, the returning list should have the same length as the original:

sealed trait TypedList[+Element, Size <: Natural] {
  def map[OtherType](f: Element => OtherType): TypedList[OtherType, Size]
}
case object TypedNil extends TypedList[Nothing, Zero.type] {
   override def map[OtherType](f: Nothing => OtherType): TypedList[Nothing, Zero.type] = this
}  
case class TypedCons[Element, Size <: Natural](
  override val _head: Element,
  tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {
  override def map[OtherType](f: Element => OtherType): TypedList[OtherElement, Suc[Size]] = TypedCons(f(head), tail.map(f))
}

The .zip

This case is a slightly more interesting.
So, you have two lists and you want to perform an operation on every pair – each element (of the pair) being provided by a different list but at the same index.*
On a normal list you cannot enforce the two lists to have the same size; when one is longer, its additional elements are ignored:


val foo = List(1, 2, 3, 4)
val bar = List(1, 2, 3)

foo.zip(bar)
//res: List((1,1), (2,2), (3,3))

Now, not only can we enforce that, it comes out as very natural.
On the signature at TypedList, zip must receive another list whose size must be the same as the current list:

sealed trait TypedList[+Element, Size <: Natural]{
  def zip[OtherType, C](that: TypedList[OtherType, Size], f: (Element, OtherType) => C): TypedList[C, Size]
}
case object TypedNil extends TypedList[Nothing, Zero.type]{
  override def zip[OtherType, C](that: TypedList[OtherType, Zero.type], f: (Nothing, OtherType) => C): TypedList[C, Zero.type] = this
}
case class TypedCons[Element, Size <: Natural](
  override val _head: Element,
  tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {
  override def zip[OtherType, C](that: TypedList[OtherType, Suc[Size]], f: (Element, OtherType) => C): TypedList[C, Suc[Size]] = that match {
    case TypedCons(thatHead, thatTail) => TypedCons(f(head, thatHead), tail.zip(thatTail, f))
  }
}

An interesting detour
Above, we must pattern match the that list on the TypedCons implementation. Not only that, but that match has only 1 case which is exhaustive! How come ?
We must pattern match because the list that is a TypedList and there is no tail defined there; only on TypedCons. At the same time, we need to have access to the tail in order to achieve the semantics of zip via recursion.
One would think that we would then have to specify logic for the 2 cases (TypedNil and TypedCons). Not necessary! since the compiler knows that a TypedList[OtherType, Suc[Size]] can never be TypedNil for any given Size <: Natural. This is kinda funny behavior from the compiler.

But could we take out the pattern match? Yes.
The difficulty is in defining a tail on the trait TypedList since we must restrict it (the tail) to have size 1 less than the TypedList. Recall for the TypedCons case this was embedded in the definition of TypedCons itself and the way it extended TypedList.
In other words, what should we substitute ??? below for ?

sealed trait TypedList[+Element, Size <: Natural] {
  def head: Element
  def tail: TypedList[Element, ???]
}

The answer lies in defining a type Previous for every Natural type.

sealed trait Natural {
  type Previous <: Natural
}

case object Zero extends Natural {
  override type Previous = Zero.type
}

case class Suc[N <: Natural]() extends Natural {
  override type Previous = N
}

Which allows us then to do:

sealed trait TypedList[+Element, Size <: Natural]{
  protected def _tail: TypedList[Element, Size#Previous]
  def tail[PhantomType >: Size <: Suc[_ <: Natural]] = _tail
}

With now a tail on the super trait, we don’t need the annoying pattern match on that:

case class TypedCons[Element, Size <: Natural](
  override protected val _head: Element,
  override protected val _tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {
  override def zip[OtherType, C](that: TypedList[OtherType, Suc[Size]], f: (Element, OtherType) => C): TypedList[C, Suc[Size]] =
    TypedCons(f(head, that.head), tail.zip(that.tail, f))
}

The .concat method

This is where things get more interesting.
The .concat should give us a new list with the elements from the left followed by the elements of the right. Something like

val foo: TypedList[String, Nat2] = "Foo" :: "Bar" :: TypedNil
val bar: TypedList[String, Nat2] = "Baz" :: "Qux" :: TypedNil

foo.concat(bar)
// res: TypedList[String, Nat4] = TypedList(Foo, Bar, Baz, Qux)

The resulting list should naturally also be typed and so the compiler must know how to sum two natural numbers:

sealed trait Natural {
  type Plus[That <: Natural] <: Natural
}

which is trivial for Zero.type:

case object Zero extends Natural {
  override type Plus[That <: Natural] = That
}

It took me a few minutes and pen and paper to convince myself that the recursive formulation embodies summation:

case class Suc[N <: Natural]() extends Natural {
  override type Plus[That <: Natural] = Suc[N#Plus[That]]
}

The recursive loop (at the tyle level) eventually hits Zero.type, ending and returning the desired type.

With this we can define the signature:

sealed trait TypedList[+Element, Size <: Natural]{
  def concat[OtherElement >: Element, OtherSize <: Natural](that: TypedList[OtherElement, OtherSize]): TypedList[OtherElement, Size#Plus[OtherSize]]
}

With the following implementations:

case object TypedNil extends TypedList[Nothing, Zero.type]{
  override def concat[OtherElement >: Nothing, OtherSize <: Natural](that: TypedList[OtherElement, OtherSize]): TypedList[OtherElement, OtherSize] = that
}
case class TypedCons[Element, Size <: Natural](
  override protected val _head: Element,
  override protected val _tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {
  override def concat[OtherElement >: Element, OtherSize <: Natural](that: TypedList[OtherElement, OtherSize]): TypedList[OtherElement, Suc[Size#Plus[OtherSize]]] =
    TypedCons(head, tail.concat(that))
}

The .flatMap

This method is useful to explain another operation on natural numbers: multiplication.
On a .flatMap, we apply a function to every element of the left list, such that the result (for each element) is another list, flattening everything at the end.

val foo: List[Int] = List(1, 2, 3)
def bar(i: Int): List[String] = i match {
  case 1 => List("Foo")
  case 2 => List("Bar", "Baz")
  case 3 => List("Qux", "Quux", "FooBar", "BarBaz")
  case _ => List()
}
foo.flatMap(bar)
//res: List[String] = List(Foo, Bar, Baz, Qux, Quux, FooBar, BarBaz)

The above cannot be achieved on our typed list. The size of the lists being returned for each case depends on values! Therefore the resulting list’ size could not be known at compile time.
We can however, formulate a .flatMap such that each returned list is of a constant/pre-determined size. Whether or not that is useful for anything is another question.
Before spoiling with the signature, understand that we need multiplication of naturals: If the original list has n elements, each of which is mapped to a list of m elements, the result after flattening shall have n*m elements.

sealed trait Natural {
  type Plus[That <: Natural] <: Natural
  type Mult[That <: Natural] <: Natural
}

Again the trivial case for Zero:

case object Zero extends Natural {
  override type Plus[That <: Natural] = That
  override type Mult[That <: Natural] = Zero.type
}

and the more elaborate

case class Suc[N <: Natural]() extends Natural {
  override type Plus[That <: Natural] = Suc[N#Plus[That]]
  override type Mult[That <: Natural] = (N#Mult[That])#Plus[That]
}

which leverages the distributive property of multiplication; note that at every iterative step one uses (N + 1) x M = N x M + M (where N + 1 is to be interpreted as Suc[N]).
Again, the formulation is recursive. It stops when, on a step, N corresponds to Zero.type.

We are now capable of knowing at the type level the result of the multiplication of two natural numbers. This is essential and enables us to easily come up with a signature to .flatMap:

sealed trait TypedList[+Element, Size <: Natural] {
  def flatMap[OtherElement, OtherSize](f: Element => TypedList[OtherElement, OtherSize]): TypedList[OtherElement, Size#Mult[OtherSize]
}
case object TypedNil extends TypedList[Nothing, Zero.type]{
  override def flatMap[OtherElement, OtherSize <: Natural](f: Nothing => TypedList[OtherElement, OtherSize]): TypedList[OtherElement, Zero.type] = this
}

But it remains a challenge to come up with the actual implementation for the TypedCons case. After some thinking I came up with:

case class TypedCons[Element, Size <: Natural](
  override protected val _head: Element,
  override protected val _tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {

 override def concat[OtherElement >: Element, OtherSize <: Natural](that: TypedList[OtherElement, OtherSize]): TypedList[OtherElement, Suc[Size#Plus[OtherSize]]] =
    TypedCons(head, tail.concat(that))

 override def flatMap[OtherElement, OtherSize <: Natural](f: Element => TypedList[OtherElement, OtherSize]): TypedList[OtherElement, (Size#Mult[OtherSize])#Plus[OtherSize]] =  
    f(head) concat tail.flatMap(f)
}

This recursive formulation is simple and elegant. It takes advantage of .concat and would do exactly what we want. Would! To my initial surprise, the compiler rejects it :

[error] type mismatch;
[error]  found   : TypedList[OtherElement,OtherSize#Plus[Suc[Size]#Previous#Mult[OtherSize]]]
[error]     (which expands to)  TypedList[OtherElement,OtherSize#Plus[Size#Mult[OtherSize]]]
[error]  required: TypedList[OtherElement,Size#Mult[OtherSize]#Plus[OtherSize]]
[error]     f(head) concat tail.flatMap(f)
[error]             ^

All this chained types might be a too convoluted for someone scrolling down.
Using more legible syntax, the compiler complains that it expects something of type (Size x OtherSize) + OtherSize but f(head) concat tail.flatMap(f) actually resolves to type OtherSize + (Size x OtherSize).
Summation is commutative. Because we are sure we coded summation of Nats correctly, this too should end up, for any Size and OtherSize, to the same type. Well, it turns out the compiler doesn’t know that !

I found two ways to solve this

1. Acceptable trick

While f(head) concat tail.flatMap(f) above resolves to OtherSize + (Size x OtherSize), if we revert the order to tail.flatMap(f) concat f(head) we get the type the compiler expects: (Size x OtherSize) + OtherSize.
But tail.flatMap(f) concat f(head) does not do solely what we want; it also reverts the order of the elements on the list – as is apparent from the tail appearing first:

val foo: TypedList[Int, Nat3] = 1 :: 2 :: 3 :: TypedNil
def bar(i: Int): TypedList[Int, Nat2] = TypedCons(i*1, TypedCons(i*10))
foo.flatMap(bar)
// res: TypedList[Int, Nat6] = TypedList(30, 30, 20, 20, 10, 10)

The trick, therefor, consists in reverting the list either before or after that .flatMap operation. Inefficient but effective.

package foo
case class TypedCons[Element, Size <: Natural](
  override protected val _head: Element,
  override protected val _tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {

  override def concat[OtherType >: Element, OtherSize <: Natural](that: TypedList[OtherType, OtherSize]): TypedList[OtherType, Suc[Size#Plus[OtherSize]]] =
    TypedCons(head, tail.concat(that))

  override def flatMap[OtherType, OtherSize <: Natural](f: Element => TypedList[OtherType, OtherSize]): TypedList[OtherType, Size#Mult2[OtherSize]#Plus[OtherSize]] =
    _flatMapInternal(f).reverse
  override private [foo] def _flatMapInternal[OtherType, OtherSize <: Natural](f: Element => TypedList[OtherType, OtherSize]): TypedList[OtherType, Size#Mult2[OtherSize]#Plus[OtherSize]] =
    tail._flatMapInternal(f) concat f(head)

  override def reverse: TypedList[Element, Suc[Size]] = tail.reverse :+ head

  override def :+[A1 >: Element](elem: A1): TypedCons[A1, Suc[Size]] = TypedCons(head, tail.:+(elem))
}

Notice it was necessary to develop a reverse operation which in turn required the append operation :+ as well.
More importantly, the actual "flatMapping" is done at _flatMapInternal. Being the responsibility of the api method flatMap just to call it and then reverse the result it gets. Now we obtain:

val foo: TypedList = 1 :: 2 :: 3 :: TypedNil
def bar(i: Int): TypedList[Int, Nat2] = TypedCons(i*1, TypedCons(i*10))
foo.flatMap(bar)
// res: TypedList[Int, Nat6] = TypedList(10, 10, 20, 20, 30, 30)

2. Formulate multiplication at the type level differently

The formulation of multiplication above is correct. In fact, it is the same as the one from these guys on slide 27, which do appear to know a lot about the issue.
To solve our problem however, I just tried to change the order of the operands:

case class Suc[N]() extends Natural {
  override type Mult[M <: Natural] = M#Plus[N#Mult[M]]
// instead of
// override type Mult[M <: Natural] = (N#Mult[M])#Plus[M]
}

The question is whether this formulation is equivalent. Does it also embody multiplication as required ? Yes, it does. All the below compile:

implicitly[Nat3#Mult[Nat2] =:= Nat6]
implicitly[Nat5#Mult[Nat2] =:= Nat10]

implicitly[Nat2#Mult[Nat9] =:= Nat2#Mult2[Nat9]]
implicitly[Nat3#Mult[Nat8] =:= Nat3#Mult2[Nat8]]

Where Mult2 corresponds to the initial "encoding" of multiplication.

With this new formulation for the type contructor field Mult, the compiller will glady accept:

case class TypedCons[Element, Size <: Natural](
  override protected val _head: Element,
  override protected val _tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {

 override def concat[OtherElement >: Element, OtherSize <: Natural](that: TypedList[OtherElement, OtherSize]): TypedList[OtherElement, Suc[Size#Plus[OtherSize]]] =
    TypedCons(head, tail.concat(that))

 override def flatMap[OtherType, OtherSize <: Natural](f: Element => TypedList[OtherType, OtherSize]): TypedList[OtherType, OtherSize#Plus[Size#Mult[OtherSize]]] =
    f(head) concat tail.flatMap(f)
}

Without the need for that internal flatMap function and without the need for the reverse operation. This approach is not only more elegant, but also more efficient.

The .split

At last the .split operation. The most difficult to implement.
We are interested in the following behaviour:

val foo: TypedList[String, Nat5] = "Foo" :: "Bar" :: "Baz" :: "Qux" :: "Quux" :: TypedNil

val (left, right) = foo.split[Nat2]
// res (left): TypedList[String, Nat2] = TypedList(Foo, Bar)
// res (right): TypedList[String, Nat3] = TypedList(Baz, Qux, Quux)

If we want the resulting lists to be typed, then we need to encode subtraction of natural numbers. Above, the compiler needed to know how to subtract 2 from 5.

Encoding multiplication is easy once you read how to encode summation.
Subtraction was trickier. With the help of @Alexey Romanov in these SO question, we have:

package foo
sealed trait Natural {
  type Previous <: Natural
  private [foo] type _Minus[M <: Natural] <: Natural
  type Minus[M <: Natural] = M#_Minus[Suc[Previous]]
}

This encoding might feel confusing. Reading the SO thread might shed some light.
This hack is very much the same as the "1st solution" for the flatMap problem discussed previously, where we had an internal _flatMap. This is the type-level counterpart.
The type constructor to be exposed is Minus[M <: Natural]. _Minus is internal and a way to achieve the semantics of subtraction.

case object Zero extends Natural {
  override type Previous = Zero.type
  override private [foo] type _Minus[M <: Natural] = M
}
case class Suc[N <: Natural]() extends Natural {
  override type Previous = N
  override private [foo] type _Minus[M <: Natural] = Previous#_Minus[M#Previous]
}

Notice that _Minus above, on its own, would only work if M was greater than Suc[N]:

Nat2#_Minus[Nat3] = 
Nat1#_Minus[Nat2] = 
Nat0#_Minus[Nat1] = Nat1
// Good

But

Nat3#_Minus[Nat2] = 
Nat2#_Minus[Nat1] = 
Nat1#_Minus[Nat0] = 
Nat0#_Minus[Nat0] = Nat0
// Bad

Therefore, we hide _Minus and wrap it within Minus, which, before calling _Minus inverts the operands (swap the the n and m in n - m)

We then obtain the desired behaviour:
// Subtracting n-m:

// Works when n>m
implicitly[Nat9#Minus[Nat5] =:= Nat4]   // compiles

// Gives 0 when n<m (more than fine. We are not interested nor model the negative numbers)
implicitly[Nat0#Minus[Nat5] =:= Nat0]   // compiles
implicitly[Nat2#Minus[Nat5] =:= Nat0]   // compiles

We have defined the types to be returned by the split method. Proceeding to defining the methods themselves.

sealed trait TypedList[+Element, Size <: Natural] {
  def split[At <: Suc[_ <: Natural]]: (TypedList[Element, At], TypedList[Element, Size#Minus[At])
}

For TypedNil case we will just thrown an exception because we will be protecting, further down, the method from access.
For TypedCons, my idea was:
To split TypedList("Foo", "Bar", "Baz", "Qux", "Quux") at index Nat3:

  1. Construct a TypedList[Natural, Nat3](nat1, nat2, nat3).
  2. Construct a TypedList[Natural, Nat2](nat4, nat5).
  3. Develop a get(index: Natural) method, on our TypedList, such that it will retrieve the element of the list at said index.
  4. Map lists 1. and 2. passing them function get.
case class TypedCons[Element, Size <: Natural](
  override protected val _head: Element,
  override protected val _tail: TypedList[Element, Size]
) extends TypedList[Element, Suc[Size]] {
  override def split[At <: Suc[_ <: Natural]](
    implicit n: At,
    leftList: TypedList[Natural, At],
    rightList: TypedList[Natural, Suc[Size]#Minus[At]]
): (TypedList[Element, At], TypedList[Element, Suc[Size]#Minus[At]]) = 
  (
    leftList.map(_get),
    rightList.map(i => _get(i.plus(n)))
  )
}

Requiring 3 implicits, this signature seems daunting.
Why are leftList and rightList relevant ?

  1. Because they provide typed lists of the the same size of the ones the function has to return! – half the job is done right here.
    And why are they implicit variables?
    • Is a way to have the compiler recursively build something as we will see below ( is there any other way even ?)
  2. Because their elements are the sequence of natural numbers starting at nat1.
    And why is this relevant?
    • because we also have a function _get that retrieves an element from a typed list if given an index!

Where we don’t show the implementation of _get – which is not difficult to formulate.
But how is the compiler able to find these implicits? Because we defined, on the implicit search scope, and for this usage only, specifically "engineered" implicits:


implicit def typedListOfNats[N <: Natural](implicit previousNatTypedList: TypedList[Natural, N], thisNat: Suc[N]): TypedList[Natural, Suc[N]] =
    previousNatTypedList :+ thisNat

implicit def emptyList[A]: TypedList[A, Zero.type] = TypedNil

It is hopefully not difficult to understand the above implicits allow for the behaviour:

implicitly[TypedList[Nat4]]
//res: TypedList[Natural, Nat4] = TypedList(nat1, nat2, nat3, nat4)

In retrospective, maybe I should have wrapped the 3 implicits within something like SplittableOps[At <: Natural, Size <: Natural] for more control. Imagine someone provided implicit instances of TypedList[Natural, Suc[N]] with a different usage in mind. I guess we would then be under the risk of a faulty implementation of split since the wrong leftList and rightList might be being injected.

Lastly, notice that we are still able to split a TypedList by an index greater than the list’s size:

val foo: TypedList[String, Nat5] = "Foo" :: "Bar" :: "Baz" :: "Qux" :: "Quux" :: TypedNil

foo.split[Nat6]
// res: java.lang.Exception: Boom!

Restraining calls to method split for indexes greater than the list’s size is more demanding than using a phatom type as done for protecting calls to head and tail on an empty list. Instead, we need to protect this calls with an implicit evidence.

def split[At <: Suc[_ <: Natural]](implicit ev: LowerOrEqual[At, Size])

The rationale goes as follows: Every time split[At] (from within given TypedList[Element, Size]) is called, the compiler must be able to find an instance for the concrete type LowerOrEqual[At, Size], where At is the concrete type with which split was called and Size is the concrete type of the list. If it is not able, it will fail with a compilation error.
Our job is therefore to come up with LowerOrEqual[N <: Natural, M <: Natural] and corresponding implicits such that they (the implicits), can only be found if N <= M.

import scala.annotation.implicitNotFound

@implicitNotFound("${N} is not lower or equal to ${M}")
sealed trait LowerOrEqual[N <: Natural, M <: Natural]

object LowerOrEqual {
  implicit def foo[N <: Natural, M <: Natural](implicit ev: LowerOrEqual[N, M]): LowerOrEqual[Suc[N], Suc[M]] = new LowerOrEqual[Suc[N], Suc[M]] {}
  implicit def bar[M <: Natural]: LowerOrEqual[Zero.type, M] = new LowerOrEqual[Zero.type, M] {}
}

Notice that:

  1. trait LowerOrEqual doesn’t need to have any fields (vals, vars, defs, ….) . It is the type itself (in this case a type constructor), represented by LowerOrEqual, and the implicit defs that matter for what we want to achieve.
  2. The names foo and bar are also pretty irrelevant.

We can "unit test" that the above encapsulates the constraint we want:

implicitly[LowerOrEqual[Nat5, Nat6]]    // compiles (i.e. implicit found)
implicitly[LowerOrEqual[Nat2, Nat2]]    // compiles (i.e. implicit found)

implicitly[LowerOrEqual[Nat6, Nat5]]    // does not compile (i.e. implicit not found)
// error: Nat6 is not lower or equal to Nat5

How exactly do those two foo and bar implicit defs achieve this behavior ? Via implicit recursion and clever engineering of the return types of the two functions. Lets study the possible cases:

case 1. LowerOrEqual[Nat0, NatY] (for some NatY)
The compiler realizes function bar can return the desired type (for the appropriate M in bar[M <: Natural]. Problem solved.
case 2. LowerOrEqual[NatX, Nat0] (for some NatX not Nat0)
The compiler realizes neither function bar or foo may ever return the necessary type. Compiler error.
case 3. LowerOrEqual[NatX, NatY] (where NatX and NatY are not Nat0)
The compiler understands function foo may potentially return the desired type. For that however, it needs to find yet another implicit: the argument ev in foo. And so this gives origin to a recursive implicit resolution by the compiler, which ultimately leads to either case 1, thus compiling, or case 2, thus not compiling.
You should be able to notice, by looking at the return type of function foo and the type of ev that:
a. Case 1 is reached if and only if N < M
b. Case 2 is reached if and only if N > M

The annotation implicitNotFound‘s purpose is for more informative errors for clients of library. We would get otherwise: could not find implicit value for parameter e: LowerOrEqual[Nat6,Nat5]. Which actually sheds more light on the internals of what is going on.


The above methods are the most interesting to develop and explain.
On the code on github, there are a couple more. But there was no attempt to recreate all methods on the normal List.

If you have any questions or comments about the post, I would be happy to hear. Just write on the comment section

Footnotes

  1. Here I extend the behavior of zip to applying any given function to a pair of elements, instead of just joining the two elements in a tuple.
  2. The motivation for developing this typed list and blog post was a very nice talk by Joe Barnes.
    Another essential source knowledge was this blog post from Yao Li, which I highly recommend. That said, I have derived everything on the lib on my own and added several new features, most important of these being the split operation and type constructor Minus on naturals.

References

1. Barnes, Joe. Typelevel Programming 101: The Subspace of Scala.
https://www.youtube.com/watch?v=_-J4YRI1rAw&t=5s

2. Li, Yao. Dependent Types in Scala.
https://lastland.github.io/pl-blog/posts/2017-09-03-dtscala.html

3. Zeiger, Stefan. Type-Level Computations in Scala.
http://slick.lightbend.com/talks/scalaio2014/Type-Level_Computations.pdf

Imposing invariants in Scala

Imagine a opera house allows reservations to a show. For each reservation two people can come: The buyer of the reservation himself, and his guest.
As in any show-house, there are many types of seats; some are standing, some normal and some on vip tribunes.

As a restriction from the opera house, both buyer and guest must be on the same seating area. For example, it is not permitted for one ticket to be standing while the other for the normal seated area.
I believe this is what people call an invariant. A requirement that must hold for the data to have meaning.  
How to enforce then this constraint ?   
Let us first define the data types relevant to the problem:  

sealed trait Seat

case class NormalSeat(row: Int, column: Int) extends Seat
case class VipTribune(tribuneId: String) extends Seat
case object Standing extends Seat

In what follows it will be useful to discern the type relation above on the following diagram.

Fig.1 – Type hierarchy for different “types” of seat available. 

We also have:

case class Ticket[SeatType <: Seat](
  verificationCode: String,
  price: Int,
  seat: SeatType
)
case class Reservation[SeatType <: Seat](
  userTicket: Ticket[SeatType],
  guestTicket: Ticket[SeatType]
)

The first thing to notice is that the above does not deliver what we want. The following compiles: 

val reservation = Reservation(
  usertTicket = Ticket("id1", 123, VipTribune("east")),
  guestTicket = Ticket("id2", 123, Standing)
)

The compiler tries to infer the type SeatTypefor reservation. From the user ticket on line 2 it thinks it is VipTribune.  The compiler is satisfied since this is a sub-type of Seat which is an upper type bound on Reservation.  
This is contradicted on the 3rd line, as we encounter type Standing for the guest ticket. Bad news since type SeatType on line 2 and SeatType on line 3 refer, well, to the same type!  
Before launching an error, the compiler realizes both Standing and VipTribune are actually sub-types of Seat; this means they can pretend to be their super-type (Seat). Because Seat respects the type bound Seat <: Seat, the compilation is successful with val reservation: Reservation[Seat].

There are at least 4 ways to enforce the restriction of the opera-house:  

  1. using require(condition) on the class constructor of Reservation
  2. Hiding the public constructor of the class, defining a public one that validates and returns Option[Reservation]
  3. Using generalized type constrains
  4. Using OOP – subtyping

1. Using require(condition) on the class constructor of Reservation

case class Reservation[SeatType <: Seat](
  userTicket: Ticket[SeatType],
  guestTicket: Ticket[SeatType]
){

  private def verifyTypeRestriction: Boolean = (userTicket.seat, guestTicket.seat) match {
    case (NormalSeat(_, _), NormalSeat(_, _)) => true
    case (VipTribune(_), VipTribune(_)) => true
    case (_: Standing.type , _: Standing.type ) => true
    case _ => false
  }

  require(verifyTypeRestriction)
}

On this approach we delegate to the users of case class Reservation the responsibility of never trying to instantiate a value that doesn’t conform to the rules.
If they do breach the rule, the code compiles but throws an exception at run time.  

While this is the least adequate of the solutions, we can have the peace of mind that, if it does blow up, it will do so on instantiation of the object. Meaning that some other block of code further down can expect the invariant to be respected and avoid dealing with the inconsistent cases (since if the block of code is being executed, the previous instantiation succeed). We have centralized the point of failure to the instantiation of the object.

Notice lastly that extending Reservation does not bypass the constructor, so that the require(condition) will always be executed.

2. Hiding the public constructor of the class, defining a public one that validates and returns Option[Reservation]

The idea here is to make it impossible for future users of Reservation to create an instance directly but only through one single method, engineered by us to take into account unlawful constructor parameters. The method ought to return a Reservation boxed in a data type that allows to represent something went wrong, like the Option/Try/Either.
It is then the responsibility of the user to deal with the returned boxed type accordingly.  

There are two ways to achieve this. Subchapter 19.2 in [1] is useful in what follows.

The first is by adding private before the parameter list of the class, and then defining an apply method on the companion object. 

class Reservation[SeatType <: Seat] private (
  userTicket: Ticket[SeatType],
  guestTicket: Ticket[SeatType]
)

While one doesn’t need to name the method apply, it must be on the companion object; otherwise one does not have access to the constructor which is defined as private.  Recall that the companion object has the same access rights as the class. 

object Reservation {
  def apply[SeatType <: Seat]( userTicket: Ticket[SeatType], guestTicket: Ticket[SeatType]): Option[Reservation[SeatType]] = (userTicket.seat, guestTicket.seat) match {
    case (NormalSeat(_, _), NormalSeat(_, _)) => Some(new Reservation(userTicket, guestTicket))
    case (VipTribune(_), VipTribune(_)) => Some(new Reservation(userTicket, guestTicket))
    case (_: Standing.type , _: Standing.type ) => Some(new Reservation(userTicket, guestTicket))
    case _ => None
  }
}

Notice that trying to extend Reservation on another file, possibly bypassing our apply method, will not work.

As an alternative, one can hide the case class altogether, i.e. not only the constructor but also the name (so that we cannot use it as a type). To that end one defines a trait which exposes the implementation that we want:

sealed trait Reservation[SeatType <: Seat] {
  val userTicket: Ticket[SeatType]
  val guestTicket: Ticket[SeatType]
}

and on the companion object we have the actual implementation:

object Reservation {

  private case class ReservationImpl[SeatType <: Seat](
    userTicket: Ticket[SeatType],
    guestTicket: Ticket[SeatType]
  ) extends Reservation[SeatType]

  def apply[SeatType <: Seat](userTicket: Ticket[SeatType], guestTicket: Ticket[SeatType]): Option[Reservation[SeatType]] = (userTicket.seat, guestTicket.seat) match {
    case (NormalSeat(_, _), NormalSeat(_, _)) => Some(ReservationImpl(userTicket, guestTicket))
    case (VipTribune(_), VipTribune(_)) => Some(ReservationImpl(userTicket, guestTicket))
    case (_: Standing.type , _: Standing.type ) => Some(ReservationImpl(userTicket, guestTicket))
    case _ => None
  }
}

Notice the relevance of declaring the trait sealed. It prevents the user from extending the trait by another class on some other file, which would allow him to bypass the factory method apply above. Equally important is to have ReservationImpl as private so that it cannot be extended on client code. In fact, I guess we could even ignore ReservationImpl altogether and instantiate an anonymous class instead.

This approach has a disadvantage concerning type erasure which also plagues the next approach using generalized type constraints. We will address the issue there.

3. Using generalized type constrains

This is the most novel of the approaches and, as far as I know, could not be achieved on Java, contrary to the remaining solutions. 

Let us recall the original problem. On

case class Reservation[SeatType <: Seat](
  userTicket: Ticket[SeatType],
  guestTicket: Ticket[SeatType]
)

the compiler can impose SeatType to have as upper bound type Seat. It cannot however impose SeatType to be one of the sub-types exclusively. Meaning that upon userTicket: Ticket[VipTribune] and guestTicket: Ticket[NormalSeat], both VipTribune and NormalSeat can pretend to be type Seat which would result in userTicket: Ticket[Seat] and guestTicket: Ticket[Seat] which in turn conforms with the upper bound constraint (Reservation[SeatType <: Seat]).   The issue really is that a given type A is a subtype of itself.

Fig. 1 (revisited) – Type hierarchy for different “types” of seat available. 

The trick on this approach is to use a generalized type constraint. These are not built-in features of the language (like type bounds) but rather smart ways some smart people have engineered to leverage the type system of Scala to accomplish some more impressive things.

There are 2 or 3 generalized type constraints on the standard library of Scala on package scala.Predef which is imported by default. One of the most famous is <:<.

The one we are going to use is found on the external library Shapeless by Miles Sabin.

It is written =:!=[A, B];this is a parameterized trait named  =:!= with two types. This can also be written A =:!= B

I will not try to explain the intricacies of these things. A detailed explanation on related type constraints can be found here. I hope to transmit a feeling on how to use them. 

Each of these type constraints is used with a specific goal. The goal of A =:!= B is to impose that types A and B are different, whatever they might be. This is accomplished via implicits. Specifically, you declare an implicit variable of type A =:!= B, where A and B are some two types you want to make sure are different. I would say that normally, either A, B or both are abstract/parameterized types, like type Ole below.

  import shapeless.=:!=
  def foo[Ole](a: Ole)(implicit ev: =:!=[Ole, Int]): A = a

  foo[String]("concurrency rules.")  /* Compiles */
  foo[Int](5)                        /* Does not compile */
  foo[Boolean](true)                 /* Compiles */

Briefly, if Ole and Int are different, the compiler will be able to (always) find the required implicit and succeed the compilation. If they are the same, something else happens and a compilation error is raised. How exactly using the syntax (implicit ev: =:!=[Ole, Int]) achieves that behavior is beyond this post; it should be accepted as the truth.

This can however be used to tackle our problem:

import shapeless.=:!=
case class Reservation[SeatType <: Seat](
  userTicket: Ticket[SeatType],
  guestTicket: Ticket[SeatType]
)(implicit ev: SeatType =:!= Seat)

The implicit evidence ev will demand the compiler to find proof that SeatType, whatever type it might be, must not be type Seat. If it cannot, it throws a compilation error.  With figure 1 as guidance, because abstract type SeatType has as upper bound the concrete type Seat, and since given the generalized type constraint SeatType must not be Seat, then it must surely be one of Standing, VipTribune or NormalSeat. Problem solved. So, in contrast with the other 2 solutions so far seen, an attempt to instantiate a reservation with unsound parameters is caught during compilation:

val standing = Standing
val normalSeated = NormalSeat(4, 5)
val tribuneSeat = VipTribune("east")

//  Compiles
val bar = Reservation(
  Ticket("id1", 123, normalSeated),
  Ticket("id2", 456, normalSeated)
)

//  Will never compile
val foo = Reservation(
  Ticket("id1", 123,  standing),
  Ticket("id2", 456, tribuneSeat)
)

//  Will never compile
val baz = Reservation(
  Ticket("id1", 123, normalSeated),
  Ticket("id2", 456, tribuneSeat)
)

When it does not compile we get the error:

ambiguous implicit values:
[error]  both method neqAmbig1 in package shapeless of type [A]=> A =:!= A
[error]  and method neqAmbig2 in package shapeless of type [A]=> A =:!= A
[error]  match expected type com.github.cmhteixeira.invariants.Seat =:!= com.github.cmhteixeira.invariants.Seat
[error] val foo = Reservation(
[error]                           ^

The error message is not very helpful, but one must live with it ¹

This approach, similarly to the 2 other approaches, suffers from an inconvenient. The type SeatType on a given reservation is lost at run time due to type erasure. We cannot pattern match on a given reservation as on the following example:

  def guideUsersToTheirSeat(reservation: Reservation[Seat]): String = reservation match {
    case _: Reservation[Standing.type] => "Go to gate 1. You and your guest must stand."
    case reser: Reservation[VipTribune] => s"Go to gate 2. Your tribune is named ${reser.guestTicket.seat.tribuneId}"
    case reser: Reservation[NormalSeat] => s"Go to gate 3. You and your guest are in a seats ${reser.userTicket.seat.column} and ${reser.guestTicket.seat.column} respectively"
  }

The method guideUsersToTheirSeat is useless as it will always follow the first case on the match.
The compiler will in fact warn us of this :

non-variable type argument com.github.cmhteixeira.invariants.Standing.type in type pattern com.github.cmhteixeira.invariants.Reservation[com.github.cmhteixeira.invariants.Standing.type] is unchecked since it is eliminated by erasure
[warn]     case a: Reservation[Standing.type] => "Go to gate 1"

Fortunately the type parameter SeatType on Reservation is not a phantom type. It is associated with the value of seat within each Ticket (userTicket and guestTicket) of the reservation. To avoid the above problem we must therefore pattern match on the seats of each ticket.

  def guideUsersToSeat(reservation: Reservation[Seat]): String = {
    val userSeat: Seat = reservation.userTicket.seat
    val guestTicket: Seat = reservation.guestTicket.seat

    (userSeat, guestTicket) match {
      case (_: Standing.type, _: Standing.type ) => "Go to gate 1"
      case (userSeat: VipTribune, _: VipTribune ) => s"Go to gate 2. Your tribune is named ${userSeat.tribuneId}"
      case (userSeat: NormalSeat, guestSeat: NormalSeat ) => s"Go to gate 3. You and your guest are in a seats ${userSeat.column} and ${guestSeat.column} respectively"
    }
  }

The code above will work well and do what we are looking for. But we are not quite there yet. The compiler still finds a warning:

match may not be exhaustive.
[warn] It would fail on the following inputs: (NormalSeat(_, _), Standing), (NormalSeat(_, _), VipTribune(_)), (Standing, NormalSeat(_, _)), (Standing, VipTribune(_)), (VipTribune(_), NormalSeat(_, _)), (VipTribune(_), Standing)
[warn]     (userSeat, guestTicket) match {
[warn]     ^

The compiler is concerned we might not be considering all possible combinations of seats within a reservation.   
We are not considering for the cross cases (e.g. (Standing, NormalSeat)) because with the above tricks, we are certain that any created instance of Reservation will always respect the invariant. 
The compiler isn’t. For it, a single Reservation[Seat] might contain tickets in different seating areas. We have to match on both tickets (user and guest) each of which, as far as the compiler goes, can be of any type (Standing, VipTribune, NormalSeated, Seat). Because we (arguably) know more than the compiler in this case, we can tell him not to be concerned with the non-exhaustive pattern match with an annotation (chapter 15.5 at [1]):

((userSeat, guestTicket): @unchecked) match {

Which will prevent the warning during compilation.

4. Using OOP – Subtyping

This is the approach most of us would follow. Probably the most pragmatic.

It requires having a bare trait Reservation and extending it for every sub-type of Seat. When there are many sub-types, this would become very repetitive.

sealed trait Reservation[+SeatType <: Seat]

case class ReservationSeated(
  userTicket: Ticket[NormalSeat],
  guestTicket: Ticket[NormalSeat]
) extends Reservation[NormalSeat]

case class ReservationStanding(
  userTicket: Ticket[Standing.type],
  guestTicket: Ticket[Standing.type ]
) extends Reservation[Standing.type]

case class ReservationVipTribune(
  userTicket: Ticket[VipTribune],
  guestTicket: Ticket[VipTribune]
) extends Reservation[VipTribune]

On the other hand, all the type information is carried on at run-time. We can for example pattern match with confidence:

def guideUsersToSeat(reservation: Reservation[Seat]): String = reservation match {
  case ReservationStanding(_, _) => s"Go to gate 1."
  case ReservationSeated(userTicket, guestTicket) => s"Go to gate 2. You and your guest are in a seats ${userTicket.seat.column} and ${guestTicket.seat.column} respectively "
  case ReservationVipTribune(userTicket, _) => s"Go to gate 3. You are on tribune ${userTicket.seat.tribuneId}"
  }

And because none of our sub-typed Reservation classes has tickets with seats of different types, the invariant is always assured.

Do you know any other approach ? I would be happy to hear it. Just write on the comment section.

Footnotes

1. There is an answer on StackOverflow by some Régis Jean-Gilles that attempts to solve that problem. Did not bother to analyse it. Curious also why Shapeless does not have it (assuming that it works) – https://stackoverflow.com/questions/6909053/enforce-type-difference

References

1. Odersky M., Spoon L., Venners B. Programming in Scala, 3rd Edition

2. Sabin M. https://github.com/milessabin/shapeless

3. Sabin M. https://milessabin.com/blog/2011/06/09/scala-union-types-curry-howard/

Play Framework in Scala – Action Composition

I write this post as a way to consolidate my knowledge on Actions.
Most of what I write is explained on the Play Framework docs, although I believe you can benefit from my explanation. The last chapter, Chaining ActionBuilders, is the most interesting. This topic is under-explained on the docs, making it the most valuable.

Overview

In an mvc framework, the Controllers are one of the 3 layers of the mvc. They are a bridge between the incoming requests and the business logic.
In the PlayFramework, each controller is a class which is able to generate various Actions. Actions are traits that define how a request should be dealt with. They can be viewed as a function from a Request to a Result.
Action Composition is useful when many of the Actions you want to build share a part of logic.

Imagine for instance, that no matter what request you receive to any provided endpoint, you want to log this request and perhaps persist it on a Database.
This logic should be centralized and easily extendable for each action that needs it:

Logging_ActionBuilder

The ActionBuilder

The trait ActionBuilder facilitates the creation of this Actions that share logic.
Referring to the figure above, for logging to be easily used by any other Action, one creates a custom ActionBuilder trait:

class LoggingActionBuilder[B](val bodyParser: BodyParser[B])
                             (implicit val executationContext: ExecutionContext) extends ActionBuilder[Request, B]

Don’t pay attention to bodyParser, executationContext or the type parameter B.
The main thing about this LoggingActionBuilder (and to any other ActionBuilder) is the method:

def invokeBlock[A](request: Request[A],
                   block: Request[A] => Future[Result]): Future[Result] = {
  Logger.debug(s"Just received the following request: $request")
  block(request)
}

This method is unimplemented on the ActionBuilder trait. It must be defined by us for each custom ActionBuilder we want to define.
The method is important because it is where you describe the common functionality that the Actions build through it must have.
In this case, the common functionality is:
Line 3: Log the Request.
Line 4: Do something else with the request.

Where the “something else” is not yet defined on the LoggingActionBuilder. It is each Action stemming from this ActionBuilder, which, by providing a specific block function, can materialize Line 4.

Consider the 2 following examples which are methods that return an Action and that could be inside a given controller. They would also be associated, on the conf/routes configuration file, with one HTTP method and URI pattern.

 
def action1(): Action[_] = LoggingActionBuilder { request =>
  val userId: Option[String] = request.headers.get("userId")
  userId.fold{
    BadRequest("The request did not have a userId on the headers")
  }{ userId =>
    Ok(s"The user id is $userId")
  }
}

def action2(): Action[_]  = LoggingActionBuilder { request =>
  val itemToBuy = request.body.\("itemToBuy").validate[String]
  itemToBuy.fold({
    _ => BadRequest("No valid item to buy on the body of request.")
  },{ item =>
    Ok(s"You are buying $item")
  })
}

About them, we can say:
1. They are Actions.
2. They are different Actions, i.e., they do different things in response to a given request.
3. They are built from the LoggingActionBuilder, via providing the block function which is referred to on the invokeBlock method of the LoggingActionBuilder.
4. Most importantly, both action1() and action2(), when called, will have their request logged.

Explanation of the syntax

Creating an Action from an ActionBuilder consists on calling the apply method of the ActionBuilder.

trait ActionBuilder[+R[_], B] {
  final def apply(block: R[B] => Result): Action[B] = ...
  ...
}

This apply method takes as argument the block function which “materializes” the Action. On the examples above, the curl parenthesis are a call to this apply method, and the function inside is the specific block of the Action at hand.

Changing the type parameter R[_]

The left type parameter of an ActionBuilder constrains the signature of the block function in the invokeBlock method:

trait ActionBuilder[+R[_], B] extends ActionFunction[Request, R] {
  def invokeBlock[A](request: Request[A],
                     block: R[A] => Future[Result]): Future[Result]

On the custom LoggingActionBuilder we used type Request, where we log a request and then apply the block function to that same request object:

def invokeBlock[A](request: Request[A],
                   block: Request[A] => Future[Result]): Future[Result] = {
  Logger.debug(s"Just received the following request: $request")
  block(request)
}

Often it is convenient that a custom ActionBuilder, via its invokeBlock method,
takes a request, does something with it, builds an instance of another type, and only then applies the block function to that new instance.
Authentication of requests is a good example. Here, a particular request, if authenticated, is normally associated with a User. It is therefore useful to obtain the corresponding instance of User and pass it along, meaning, applying the block function to that instance.
This is useful because now the Actions that are build through this ActionBuilder may count on the existence of an instance of User instead of a raw request.
This of course changes the signature of the block function that must be provided for the posterior creation of each Action.
A naive AuthenticationActionBuilder could be:
 


class AuthenticatedAction[B](val parser: BodyParser[B])
                            (implicit val executionContext: ExecutionContext) 
                            extends ActionBuilder[AuthenticatedUser, B] {

override def invokeBlock[A](request: Request[A],
                            block: AuthenticatedUser[A] => Future[Result]): Future[Result] = {
 val userId: Option[Int] = request.headers.get("user_id")
 val password: Option[String] = request.headers.get("password")

 val user: Option[User] = DataBase.validate(userId, password)

 user.fold {
   Future.successful(BadRequest("Authentication failed"))
 }{ user =>
   val authenticatedUser = AuthenticatedUser(user, request)
   block(authenticatedUser)
 }
 }
}

Notice above, that the type parameter of the ActionBuilder changed to AuthenticatedUser and correspondingly the signature of the block function on the invokeBlock method is block: AuthenticatedUser[A] => Future[Result] .
It is up to us to define type AuthenticatedUser[_]. In this case, it could something along:

case class AuthenticatedUser[A](user: User, request: Request[A]) extends WrappedRequest(request)

Where type User is our internal representation of a user.
AuthenticatedUser[_] must allow a type constructor, as demanded by the signature of the invokeBlock method of the trait ActionBuilder.
Also, as mentioned in the references, and explained further below, it is advantageous that this type parameter R[_] extends WrappedRequest[_]; which is a type provided by Play.

As a side observation, you can do a trick not to need a type constructor:

case class AuthenticatedUserHelper(user: User)
type AuthenticatedUser[_] = AuthenticatedUserHelper

With this discussion, I have gone through the most important points in creating custom ActionBuilders. This is summarized on the following sketch.

ActionBuilder

With the image above as guide, an ActionBuilder says:
– I make Actions. The Actions that I make share a common logic.
– That logic is described on my invokeBlock method.
– invokeBlock is a high-order method, which takes a Request[A] and function called block.
– The block function is what distinguishes every Action made by me. Defining a block function is therefore creating an Action. Although I do not know what the block function will be, I demand it must have the signature R[A] => Result.
– The syntax trough which I create an action is: ThisCustomActionBuilder.apply(block), which, given the syntactic sugar surrounding the apply method in Scala, is the same as ActionBuilder(block) or ActionBuilder { request: R[A] => Result }.

The syntax for creating default Actions

When you don’t need to define a custom ActionBuilder, Play allows you to write Actions with a very intuitive syntax.
 

def index(): Action[AnyContent] = Action { request =>
  Ok("`Action` is actually an ActionBuilder with no common functionality.")
}

But Action on line 1 is actually a call to a function with a suggestive name (def Action), which returns a default ActionBuilder[Request, AnyContent] which commes along with Play.
Its invokeBlock is just:

def invokeBlock[A](request: Request[A],
                   block: (Request[A]) => Future[Result]) = block(request)

Because the invokeBlock consists of an immediate call to block, this ActionBuilder describes no common functionality for the Actions created through it. It is just a helper for when you want to create an Action straight away.

Chaining ActionBuilders

So, we have a LoggingActionBuilder that logs the requests, and we have a AuthenticationActionBuilder that authenticates the requests.

What if I want to build Actions whose requests I want to both log an authenticate?

1. Do I need to create a third ActionBuilder that does both?
This has two downsides. Firstly it duplicates logic. If the way we authenticate a request changes, we would have to remember to change on both ActionBuilders.
Secondly, we would violate the principle of single responsibility, by having the third ActionBuilder both logging and authenticating.
2. Can I join ActionBuilders ad-hoc?
This is what we want. It would mean we develop custom ActionBuilders, which can be used separately and independently, but that could also be combined together at will.

Jumping to the result, it should be something like the following sketch:

ChainingActionBuilders

From my contact with Actions, I get the impression Play’s Actions were not planned for composition. The documentation is lacking and there is not really many developers discussing it on StackOverflow.
Still, the problem is underpinned by method andThen of the ActionBuilder.

trait ActionBuilder[+R[_], B] extends ActionFunction[Request, R] {
  ......
  override def andThen[Q[_]](other: ActionFunction[R, Q]): ActionBuilder[Q, B] = new ActionBuilder[Q, B] {
    ....
    def invokeBlock[A](request: Request[A], block: Q[A] => Future[Result]) =
      self.invokeBlock[A](request, other.invokeBlock[A](_, block))
  }
}

The first to notice is andThen returns another ActionBuilder. Meaning, no matter how many ActionBuilders you chain together, the result will also be an ActionBuilder.
This is fortunate since we already know how to deal with ActionBuilders.

The second thing is that the argument for the andThen is an ActionFunction.[Probably indicating ActionBuilder chaining is not supposed to be a thing]
But what we are really interested in is chaining an ActionBuilder with other ActionBuilder, since our motivation is to be able to group two components together who can also work independently and alone (namely the Logging and Authentication ActionBuilders).

But of course ActionBuilder extends ActionFunction:

trait ActionBuilder[+L[_], B] extends ActionFunction[Request, L]

So here is the main point:
Because on line 3 the left type parameter of the ActionFunction must be R[_], i.e. the type of the original ActionBuilder AND because an ActionBuilder is an ActionFuntion with the left type parameter to Request[_] then:

type R[_] = Request

In other words, for the argument other of the andThen method to be an ActionBuilder, the original/base ActionBuilder must have its type parameter R[_] restricted to a sub-type of Request[_].

Similarly and at the same time, on line 3, the ActionBuilder that results from this chaining has type Q[_], i.e. the type of the second ActionBuilder. Therefore, if the result is also intended to be chained with yet a 3rd ActionBuilder then their type parameter must be also Request[_].

This means all your ActionBuilders must have their type parameter R[_] to Request[_] for chaining.

The above this condensed on the picture below.

ChainingActionBuildersFinal

Notice in particular that the block function that must be provided by an Action created by the resulting ActionBuilder has its signature defined by the last on the chain.

Each of the Actions build via the resulting chained ActionBuilder will, upon a request, execute the logic of all the ActionBuilders in the order as they appear on the chain. The flow might stop at any given stage and will not continue forward. For example, if the incoming request is not Authenticated, the logic downstream will not be executed (although the logic upstream will).

Limitation

An ActionBuilder, whether simple or the result of chaining several others, does not have access to the body of the request. It abstract over the body of the request via the type parameter A of the invokeBlock method.
Only the block function has access to it. This means no common functionality can be described which needs to access the body of the request, only the headers, parameters and URI.

If you have any questions or comments about the post, I would be happy to hear. Just write on the comment section.

References

1. Action Composition Play Framework Scala 2.6.x Documentation

Random hyperplanes

On the previous post, when trying to explain LSH, there was a need to generate vectors on a d-dimensional space. The requirement was that each direction was equally likely.

To achieve this, the references used a normal distribution to obtain the weights along each of the d dimensions of the random \vec{r} vector.
At first I didn’t understand why they didn’t just draw a random number from [0, 1] for the weights. Intuitively one would assume the procedure would generate vectors evenly distributed across all directions.

I decided to understand why the normal Gaussian was required. For simplicity I considered a 2d-space.
I draw a large number of random pairs (X,Y) such that X and Y were evenly distributed across [-1, 1]. For each pair I computed the angle and plotted the distribution of all angles. I did the same thing with a normal distribution. This is what I got:

PDF_Angle_direction_uniformDist
Distribution of the angle.Left:  When X and Y are drawn from a uniform distribution. Right: When X and Y are drawn from a normal distribution. Continuous lines are ‘Seaborn-Python’ density fits.

On the left the vectors are biased towards 45º. This can’t be used.
After a quick search, I found someone who had had the similar question:
https://stackoverflow.com/questions/6283080/random-unit-vector-in-multi-dimensional-space

Out of curiosity I tried to proof this distribution for the left case, when:

f_{X}(x) = \begin{cases} \frac{1}{2} & \text{if } -1 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases} \qquad \textrm{and} \qquad f_{Y}(y) = \begin{cases} \frac{1}{2} & \text{if } -1 \leq y \leq 1 \\ 0 & \text{otherwise} \end{cases}

The angle is a function of the two random variables X and Y:

\Theta = arctan(\frac{Y}{X})

Following the method of transformations1 the p.d.f of \Theta can be computed:

\begin{cases} \Theta = arctan(\frac{Y}{X}) \\ W = X \end{cases} \Leftrightarrow \begin{cases} Y = W\tan(\Theta) = h_1(\Theta, W) \\ X = W =h_2(\Theta, W)\end{cases}

Where W = X is defined so that the method can be applied.
The method of transformations states:

f_{\Theta W} = f_{XY}(h_1(\theta, w), h_2(\theta, w))|J|

Where:
f_{\Theta W} is the joint PDF of the random variables \Theta and W.
f_{XY} is the joint PDF of the random variables X and Y.
|J| = det \left[\begin{smallmatrix}\frac{\partial h_1}{\partial \theta}&\frac{\partial h_1}{\partial w}\\\frac{\partial h_2}{\partial \theta}&\frac{\partial h_2}{\partial w}\end{smallmatrix}\right]

In these case |J| = |w|\sec^2(\theta).
Since X and Y are independent we also have:

f_{XY}(x, y) = f_{X}(x) f_Y(y)

So that we get:

f_{\Theta W} = f_{X}(W \tan\theta) f_{Y}(W)|w|\sec^2(\theta)

But what we want is the marginal PDF2 of \Theta:

f_{\Theta} = \int_{-\infty}^{\infty}f_{\Theta W} dw

f_{\Theta} = \frac{1}{2}\sec^2(\theta)\int_{-1}^{1}f_{X} (w\tan\theta \sqrt{x^2})dw

Using the substitution variable u = w\tan\theta we get:

f_{\Theta} = \frac{\sec^2(\theta)}{2\tan^2\theta}\int_{-\tan\theta}^{\tan\theta}f_{X} (u)\sqrt{u^2}du

Since X is uniformly distributed between -1 and 1 we get the two cases:

0 \leq \tan\theta \leq 1 or 0 \leq \theta \leq 45^{\circ}:

f_{\Theta} = \frac{\sec^2(\theta)}{2\tan^2\theta}\int_{-\tan\theta}^{\tan\theta}f_{X} (u)\sqrt{u^2}du = \frac{1}{4\cos^2\theta}

\tan\theta > 1 or \theta > 45^{\circ}:

f_{\Theta} = \frac{\sec^2(\theta)}{2\tan^2\theta}\int_{-1}^{1}\frac{1}{2}\sqrt{u^2}du = \frac{1}{4\sin^2\theta}

PDF_Angle_direction_fittedAnalytical.png
Density distribution of the angle of a vector in 2d. Computational results when X and Y are drawn from a uniform distribution. Green line represents the analytical derivation.

References

1. Functions of Two Continuous Random Variables
2. Joint Probability Density Function

Locality Sensitive Hashing (LSH)

During my work, the need arose to evaluate the similarity between text documents.
The problem had two parts. During training the documents had to be clustered and during evaluation a new document had to be “assigned” its most similar (from all the documents already on our DB).

One of the bottlenecks of the approach was the computational complexity of comparing a set of N documents between themselves for the clustering part, and also the time required to later determine, at run time of the algorithm, which document was the most similar to a new incoming document.

Comparing the similarity of every pair in N documents requires \binom nk \sim n^2 operations. If comparing one pair takes 1 microsecond, comparing 5 000 documents takes just over 10 seconds, but comparing every pair in 500 000 requires more than 1 day.

However, one might not be interested in determining the similarity between all pairs, but just between those that are above a certain threshold of similarity.  The amazing thing is, how to know which pairs are those without comparing them all to begin with?
LSH gives us the answer, at the cost of a small probability of error.

Regarding LSH, there are two important ideas that were badly explained in most references:
– Firstly, the threshold similarity s above which you consider two documents to be similar underpins everything else.
– Secondly, LSH is a generic approach and every similarity measure has its own specific implementation 1 as seen on the table below.

Distance/Similarity metric LSH implementation
Euclidean Distance Random Projection
Jaccard Similarity MinHash
Hamming Distance Bit Sampling
Cosine Similarity SimHash

There are even distance measures that do not have a possible implementation of LSH 2.

Choosing the metric

Normally one chooses the metric that most suits the notion of similarity of the problem and then uses the corresponding LSH. What the appropriate metric should be will not be discussed.
Although it also sounds reasonable to disregard a specific metric if it doesn’t have a LSH or lean towards another metric if its LSH implementation is particularly faster or easier somehow.

I will discuss two specific implementations: simHash and minHash.

Overall view

The figure below represents a high level view of LSH implementations.
The input is a set of documents and the output are tables (only one shown) whose entries contain the documents. The documents in the same entry are supposed to have a similarity between themselves of at least a threshold value s. Documents that are in distinct entries are supposed to have a similarity below s.
LSH is not exact. It has false positives and false negatives. Hence the term ‘supposed’.
False positives occur when a pair is on the same entry but is below the threshold.
False negatives occur when a pair is above the threshold but on distinct entries.

BirdsEyeLSH
Overall view of LSH applied to documents.

LSH is therefore a function/algorithm/procedure that maps similar documents to the same value and not-similar documents to different values. Where similar depends both on the metric being used and on the threshold s.

Notice that the resulting structure considerably reduces the computational complexity of the problems described in the beginning of the post.
For example, given a new incoming document X, we can apply the same LSH procedure and immediately get the documents (already on our DB) that are similar to it – the ones present in the same bucket. Then it would only be required to compute the similarity between those.

simHash

With SimHash, documents are represented as points in a multi-dimensional space. The similarity sim(x,y) of two documents x \textrm{ and } y is the cosine of the angle between them – the cosine distance.

sim(x, y) = \cos(\theta^{x,y})

How a document is represented as a vector, while very important overall, is not relevant for understanding the LSH. In my implementations I use the tf-idf scheme to determine the weights along each dimension. That is, each dimension is a word, and the weight of a document along that axis is given by the tf-idf scheme.

VectorSpaceModel
Representation of documents as points in space. Cosine similarity of two documents is the cosine of the angle between them.

So, know that the similarity metric has been explained, lets move to the actual LSH.
What is required is a function such that, when applied to all the documents, documents that are close together will have with high probability the same value, and documents that are far apart will have a different value.
Here, of course, close together is within the context of the similarity metric. In this case, documents are as close together the smaller the angle between them.
More formally, this is often expressed has:

if sim(x,y) \geq S_0 then P_h[h(x)=h(y)] \geq p1 and
if sim(x,y) \leq cS_0 then P_h[h(x)=h(y)] \leq p2

such that c < 1 and p_2 < p_1.

On the other, for the simHash case this is achieved by dividing the space with hyperplanes.
For a given plane, each document ends up in one of the two regions which the plane separates.
Intuitively, the closest two documents are, the more likely they will be on the same region for a random plane. For example, if two documents are almost on top of each other, then they will end up on the same region for most of the planes that you can come up with.

drawing
Partitioning of the space with a random plane in two areas (0 and 1).  Several planes shown.

For every random plane, all the documents on each region are assigned the value of 1 – if on the positive side – or 0 if on the negative side. This procedure is illustrated on the table below concerning the image above.

d^1 d^2 d^3
h_1 \textrm{ - blue} 1 0 0
h_2 \textrm{ - red} 1 0 1
h_3 \textrm{ - green} 1 1 0

Each random plane is seen as representing a function, whose value applied to a document is either 0 or 1. This is a pivotal concept.
For planes drawn at random, it can be mathematically proven that two documents x \textrm{ and } y hash to the same value with probability:

P[h(x)=h(y)] = 1 - \frac{\theta}{\pi}

The expression encapsulates our intuition: The closer two documents are (small \theta) the greater the chances they end up in the same region.
Notice that here h(x) \textrm{ and } h(y) are either 0 or 1.

LSH function

How to create random hyperplanes?
Better still, how to automatically compute the function values of each document for a given random hyperplane?

We select a random vector \vec{r} from the d-dimensional space where all the documents are represented. The weight of each dimension is drawn from a Gaussian distribution with \mathcal{N}(0,1).
This random vector represents a random hyperplane which cuts the dimensional space into two regions.
The LSH value of each document \vec{d} is then simply:

h_{\vec{r}}(\vec{d\,}) = \begin{cases} 1 & \text{if } \vec{r}\cdot\vec{d} \geq 0 \\ 0 & \text{if } \vec{r}\cdot\vec{d} < 0 \end{cases}

If you are curious why one must use a normal distribution for the coefficients of random vector \vec{r} check my next post.

So where are we?

LSH maps similar documents to the same value and different documents to different values. Right now, we have a function that kinda does that:

simHash_singleFunctionProb_v2
Probability to documents x and y hash to the same value (0 or 1) for a given random hash function – random hyperplane.

In particular notice that this mapping function respects the formal definition of a LSH function stated above.

If this were the end, i.e. if two documents were considered similar if they had the same value for a single function (single random hyperplane) there would be a great number of false positives and false negatives.
For example, there would be a 50% probability that a pair of documents with similarity of 0 would be considered similar.
It would be better if we could magnify the difference between p_1 and p_2. That is, to decrease the probability that “dissimilar” documents hash to the same value, and at the same time, increase the probability that similar documents hash to the same value.
The best scenario would be if we could somehow reach a step function, on which all the pairs above and below a threshold similarity s are and are not considered candidate pairs, respectively.

step_function
Idealized LSH, with no False Positives or False Negatives.

LSH amplification

To magnify this difference we proceed as follows:
1 -Instead of using only one hash function from the hyperspace to the space {0,1} we apply (to every document) k \times L distinct hash functions
2. We divide this k \times L hash functions in L bands/groups of k.
3. A pair of documents is considered a candidate pair, if and only if they have the same values on all the hash functions in a band for at least 1 band.

Table_amplification_edited
Applying k*L simHash functions to all documents. Division into L  bands. Magnification Step

In the example above, documents d^2 and d^N are considered candidate pairs via bands 1 and 3. Documents d^1 and d^3 hashed to the same values on all hash functions of band 2, and so they are also candidate pairs. Document d^4 has no candidate pair for the shown bands.

How does this help?

Previously, the probability that two documents hashed to the same value (and thus became similar pairs) was

\displaystyle P[h(x)=h(y)] = 1 - \frac{\theta}{\pi} = 1 - \frac{\arccos(\textbf{sim(x,y)})}{\pi} = p

With magnification, the probability is computed as follows:
p_1 = p^k \text{: probability the hashes agree for all hashes in a band (with k hashes)}
p_2 = 1- p^k \text{: probability that not all the hashes agree for all hashes in a band (with k hashes)}
p_3 = (1- p^k)^L \text{: probability that \textbf{not all} the hashes agree in each band for \textbf{all} the L bands.}
P = 1 - (1- p^k)^L \text{: probability that at least for one band all the hashes agree for a pair of documents.}

So now, the probability that two documents are considered similar pairs is:

P = 1 - (1- p^k)^L, with p = 1 - \frac{\arccos(\textbf{sim(x,y)})}{\pi}

The improvement is illustrated on the following plots:

simHash_Prob_BeforeAfter_Magn
Probability a (x, y) pair of documents is considered a candidate pair as function of their similarity (cosine distance).
Left: 1 single simHash function. Right: Concatenation of 680 simHash functions divided in 40 bands (amplification step).

The new probability plot on the right is much closer to the idealized case.
It is up to you to decide what is the threshold s above which you consider your documents to be similar. This similarity depends greatly on context. Then you play with parameters k and L such that the peek increase in probability occurs around that similarity s.
If the similarity threshold was 0.4, it would make no sense choosing as parameters the ones used on the right plot above. It would lead to many False Negatives, that is, many pairs of documents whose similarity was in fact greater than 0.4 but that would be ignored by the LSH.
The plot above would be adequate for a threshold of around s=0.8.

minHashing

minHash is the LSH implementation when the similarity between documents (or anything else really) is the Jaccard Similarity. In contrast with the cosine distance, documents are here represented as sets.
There are many ways one can represent a document as a set and that is very much relevant for the notion of similarity we want to calculate:
Examples:
doc.a : “the quick brown fox jumps over the lazy dog”
1-shingles: {the, quick, brown, fox, jumps, over, the, lazy, dog}
2-shingles: {the quick, quick brown,brown fox, fox jumps, jumps over, over the, the lazy, lazy dog}
However, for understanding and applying minHash, the key note is that documents are sets of elements, regardless of what elements those are (they could be 1-shingles or 2-shingles or something else.)

The Jaccard Similarity of two sets X and Y is the ratio of the size of the intersection between X and Y to the size of their union:

sim(x,y) = \frac{|X \cap Y|}{|X \cup Y|}

As an example:
doc.b= “the silver dog hunted a brown fox”

Documents_sets_Jaccard_Sim
Intersection between two sets.

Above, the Jaccard Similarity between documents A and B is \frac{4}{11} = 0.36

So, know that the similarity metric has been explained, lets move to the actual LSH.
Similarly to the simHash case above, and to any other LSH, we know require a function that is able to map each of our documents into another space, such that:

if sim(x,y) \geq S_0 then P_h[h(x)=h(y)] \geq p_1 and
if sim(x,y) \leq cS_0 then P_h[h(x)=h(y)] \leq p_2

with c < 1 and p_2 < p_1.

minHash_mapping_to_table
Applying minHash functions to the documents. Construction of the signature matrix.

I found it harder to understand and specially to implement such functions for minHashing than for simHashing.

The method for doing so can be found at [3], which I will explain next. The result is a minHash family of functions whose mapping has the property P_h[h(x)=h(y)] = sim(x,y). That is, if documents d¹ and d² have a jaccard similarity of 0.4, then for any given minHash function of this family P_h[h(d^1)=h(d^2)] = 0.4
Notice that such family of functions respects the aforementioned property, as the next plot illustrates:

LSH_property_minhash
Probability to documents x and y hash to the same value for a given random minHash function – random permutation.

After we have found such (family of) functions we proceed with augmentation, similarly to the simHash case. That is, we use k \times L such functions, dividing them into L bands of k rows each.

How to find such a function for the minHash case?

As an example, consider documents C and D that add no new elements to documents A and B.
The procedure is as follows:
Conceptually, one builds a table where the columns are the various documents being considered and each row is an element from the set of all elements. (the union of all the document-sets).
Each of this elements is present in some documents and not in others. We populate the table with 1’s and 0’s according with whether a given element is present or absent in the document respectively.

CharacteristicMatrix_minHashing
Random permutations of the rows of the Characteristic Matrix of the set of documents. Computing minHash values for the documents.

Then:
1 – Maintain the columns fixed and randomly permute the order of the rows.
2 – For this permutation, the minHash value of a document is the 1st row for which the respective element is present in the document (i.e. it has a 1 in the cell).

In the image above h_1(\text{doc.C}) = \text{brown} = h_1(\text{doc.D}) , h_2(\text{doc.C}) = \text{fox} and h_2(\text{doc.D}) = \text{lazy} .
It may yet not be apparent but for our purposes it is equivalent to say instead h_1(\text{doc.C}) = \text{5} = h_1(\text{doc.D}), h_2(\text{doc.C}) = \text{7} and h_2(\text{doc.D}) = \text{5}. There is a 1-to-1 relation in each permutation. On the actual implementation, the later form is the more practical.

Notice that, each random permutation of the rows gives us one minHash value for each document. It therefore defines a minHash function.
But we have seen that we need many minHash functions for the magnification step. Therefore:
3 – Repeat steps 1-2 as many times as the required number of minhash functions. Each permutation defines a minHash function and gives one minHash value for each document.

Notice also, that while for the simHash the possible LSH values are {0, 1}, in the minHash case, the codomain of the function is an integer from {0, Number-of-Elements}; where here the number of elements are the number of distinct words.

Why does this work?

For the simHash case, it was simply stated that using random hyperplanes (or equivalently random vectors) translated into:

P[h(x)=h(y)] = 1 - \frac{\theta}{\pi} = 1 - \frac{\arccos(\textbf{sim(x,y)})}{\pi}

In this case why does permuting the rows and selecting the first element present in the document for each document, translates into:

P_h[h(x)=h(y)] = sim(x,y) = J(x,y) ?

This is after-all, the core of LSH.
To avoid just copying the very good explanation, you can check sub-chapter 3.3.3 of Mining For Massive Datasets . The proof is simple and easy to understand.


Conceptually this is it. However, as explained in the reference, permutation of many elements becomes inconceivable as the number of rows (i.e. the universal set size) goes to millions.
In reality a trick is used to emulate the above procedure. This is where it became trickier for me:
1 – To emulate a permutation one uses an exact hash function that “throws” each row to a different position, the same way that in a true permutation any row can change positions.

Side notes:
1. The part ‘hashing’ in ‘minHashing’ doesn’t come from this exact hash function above.
2. Again, this exact hash functions are only there to emulate the permutation of rows.
3. Locality Sensitive Hashing and Exact Hashing are separate topics. In this case however minHashing requires the help of exact hashing for its implementation.

The references explain poorly how to come up with this exact hash functions. By understanding what is their purpose, the following solution (that I implemented and thoroughly tested) seems to be adequate. It might or not be the one actually used in production systems.

\text{(exact) Hash Family} = ( (ax + b) \quad \% p )\quad \% m

Where:
\% is the modulus operation
m is the number of rows (i.e. the size of the universal set)
p is any prime number bigger than m. (I normally choose the 1st prime after m)
a and b are random integers such that 1\leq a \leq p-1 and 0 \leq b \leq p-1
x is the row index. One applies each function to all rows.

m is defined by the problem.
p depends on m and remains fixed after defined.
m and p define a Hash Function Family.
Each (a, b) random pair defines one exact hash function (from the family) that we require.
I came up with this exact hashing function by looking at [4] and [5].

Overall, if we want n minHash functions then we require n random permutations. To emulate each permutation we need an exact hash function that is defined by a (a, b) pair. We need in total n (a, b) pairs.

The drawback is that there is a finite probability that two different (initial) rows are “thrown” to the same index by the same exact hash function. In those cases we are not emulating a true permutation.
This probability should be inversely proportional to the size of the universal set \frac{1}{m}. For problems with hundreds of thousand of elements this should of course not be a problem.
I found it instructive however to analyse what happens when m is within a few hundred.

Histogram_ExactHash_NewRows
Histogram. New row indexes of the initial rows when mapped by an exact hash function. Scenario when two different rows ‘hash’ to the same new row index.

Lets see how these collisions for a given exact hash function affect the associated minHash values. To that end,  I used the document dataset NEWS20.6

Considering only the documents 82781, 82785, 83798 and 83900 on directory ‘/NEWS20/20news-bydate-train/talk.religion.misc’ from the NEWS20 dataset. From this 4 the universal set has \text{887} elements and the pair 82781, 82785 has a Jaccard similarity of 0.17.
By running the above procedure for 200 hash functions, we have that \text{79} of the minHash values agree. Therefore the estimated similarity is \frac{79}{200} = 0.39. Quit fair from the real value.

minHashValues

By adding to these 4 all the remaining documents on that path of the NEWS20 dataset, the universal set grows to \text{22484} elements.
Now, for another 200 hash functions, the minHash values between the same two documents agree in 36. This amounts to a similarity of \frac{36}{200} = 0.18. Extremely close to the actual Jaccard similarity of 0.17.

Remember that the initial purpose was finding a function h (a minHash function) that mapped documents to another space, in such a way that for two documents x and y, the probability for their values to be the same is: P_h[h(x)=h(y)] = sim(x,y).
We summed the number of times the hash values agreed for all the minHash values at once, merely to make use of the law or large numbers. Considering just one hash function wouldn’t be sufficient to verify that in fact the procedure translated to P_h[h(x)=h(y)] = sim(x,y).

Buckets

Like the amplification, this step is common to both procedures.
So now, for both the minHash and simHash cases, we are able to build this tables which hold the LSH values for k \times L LSH functions for all documents. From them, following amplification, we can identify which documents are candidate pairs. Previously, for simHash, this was done by visually inspecting the data frame.

How could the LSH algorithm recognize which pairs are candidate-pairs? Looping on each band and comparing the value in each row seems inefficient and defeats the purpose of all this. The solution is to use exact hashing (again).
So, on each band, one uses a universal hash function and associated hash table . This hash functions map the LSH values on all rows of that band (for a given document) to an integer. This integer is the index in a hash table. Two documents that share the same LSH values on all rows are mapped to the same integer and index in the hash table. Documents in the same index in the table are candidate pairs. This procedure incorporates the previous definition of a candidate pair following the amplification step.

buckets_hashing
Exact hashing of each documents’ LSH values in a band. minHash case.

So the problem goes again to using an appropriate exact hash function. To avoid discussing exact hashing again, and although this step is crucial to the LSH procedure, I forward you to page 13 on [7], which explains this topic well enough in this context.

Still, the simHash case is very suitable to explaining this part. Because the LSH values are always 0 or 1, one can pretend the sequence of LSH values in a given band is a binary number, whose decimal representation is the index in the Table. This would be the same thing as a Direct-Adress Table instead of a Hash Table. But I think it would be practical only if the size of the bands were small. If we wanted k = 30, we would get tables with 2^{30} ~ 1 billion elements. For illustrating purposes however:

buckets_DirectAcessTable.png
Conceptual exact Hashing for the simHash case.

Conclusion

Imagine we have 1 million documents in a database.
For a new incoming document, we want to know which of this million is the most similar to it.
The brute force approach would be to make 1 million comparisons.
With LSH however, we would have these hash tables saved somewhere (one for each band). Their generation would have had to occur beforehand, but only once!
Then, we would apply the same LSH functions to this new document. For the simHash case this would require using the same random vectors ( and in the same order of course). For the minHash case this would require using the same permutations (also in the same order).
The new document would ultimately be mapped to a given index in the hash table. Finally, one would compare the new document only to the other documents present in that same entry of the hash table (for each hash table representing a band).

Similarly, if the objective is to compare all documents between themselves, then one would now only need to compute sim(x,y) for each (x, y) pair which shared an index in any of the hash tables (bands).

If you have any questions or comments about the post, I would be happy to hear. Just write on the comment section.

References

1. Ni, Y. & Chu, K. Locality Sensitive Hashing Design Doc
2. Charikar, M. Similarity Estimation Techniques from Rounding Algorithms
3. Mining of Massive Datasets, chapter 3
4. Coursera: Data Structures course by University of California, San Diego. Week: Hash Tables, Video: Universal Family
5. Coursera: Data Structures course by University of California, San Diego. Week: Hash Tables, Video: Hashing Integers
6. NEWS20 Dataset
7. Shrivastava, A. & Li, P. In Defense of MinHash Over SimHash
8. Andoni, A. & Indyk, P. E²LSH 0.1 User Manual