primetalk July 14, 2014 at 01:07

Strictly typed representation of incomplete data

The previous article, “Constructing Types,” described the idea of how you can construct types that look like classes. This makes it possible to separate stored data from meta-information and focus on the presentation of the properties of entities themselves. However, the described approach is quite complicated due to the use of the HList type. During the development of this approach, it came to be understood that for many practical problems a linear ordered sequence of properties, as well as the completeness of a set of properties, is not necessary. If we relax this requirement, the constructed types are greatly simplified and become very convenient to use.

Updated synapse-frames library hierarchical data structures are described exceptionally simply and any subsets of such structures are represented.

Bilaterally typed relationships

The property of an object is usually considered in relation to the object itself, and in this case the property has a data type. One type is only to limit the data that the property may contain. Therefore, it seemed logical to present the property as Slot[T]. However, the property is also tied to the type of object in which this property is declared, although not in a very explicit way. In the above article, to establish such a connection, a new surrogate type was constructed from a set of properties.

If we express the relation to the type of container directly in the type of the property itself, then this avoids creating a surrogate type and uses much more convenient means. So, imagine the property as a two-way relationship between two types:

sealed trait Relation[-L,R]
case class Rel[-L, R](name: String) extends Relation[L, R]

(the -L sign means “contravariance”, that is, the property will be available to descendants of type L. And the type R is declared invariant, because we plan to use getters and setters for the property)

The Rel class allows us describe the attributes available on type L. For example,

class Box
val width = Rel[Box, Int]("width")
val height = Rel[Box, Int]("height")

(the same properties will be available for descendants of the Box type).

In addition to just a name, you can attach any meta-information that the application needs to the property — a database domain, a text description of the property, a serializer / deserializer, a limit on the size of the stored data, the width of the column in the table, the display format (for dates), etc. Meta-information, if necessary, can also be bound by external linking using map.

For type L we need to have some real type. In the previous version, we designed this type as an HList over the properties included in this type. Here, as an L type, you can use an arbitrary type available in Scala. For example, any primitive type, or any type alias, you can use traits, abstract and final classes, object.type's. Due to the contravariance of L, we can use the inheritance relation between the types that we use as property carriers. It seems convenient to reflect the relation of inheritance in the form of a set of abstract classes, traits and final classes in accordance with the logic of the subject area.

abstract class Shape
trait BoundingRectangle
final class Rectangle extends Shape with BoundingRectangle
final class Circle extends Shape with BoundingRectangle
val width = Rel[BoundingRectangle, Int]("width")
val height = Rel[BoundingRectangle, Int]("height")
val radius = Rel[Circle, Int]("radius")

A separate attribute can be considered as one component, allowing you to move from the parent to the child. If the child has its own attributes, then you can navigate through any of them. A pair of such attributes can be combined into a path from “grandfather” to “grandson” and a new relation will be obtained (Rel2 (attr1, attr2)).

  case class Rel2[-L, M, R](_1: Relation[L, M], _2: Relation[M, R])
    extends Relation[L, R]

The `/` method has been added to DSL, which constructs Rel2, thereby composing the relations.

I would also like to note that such relations are an integral part of the triples that form the basis of RDF / OWL ontologies. Namely, relations are the middle component of the triple:
(identifier of an object of type L, identifier of the property Relation [L, R], identifier of the value of a property of type R).

Strongly typed identifiers

When using an incomplete description of an object through a set of attributes, it becomes very important to compare different sets of attributes with the same instance. It is necessary to reflect in some way the property of authenticity of the instance to itself. In OOP, the fact that attribute values belong to the same object can be used for this purpose. An identification method is usually used in a database. The equality of object identifiers allows us to deduce the authenticity of the objects in question.

We can also use identifiers to map attribute sets to a single instance. Since the attributes in our case are associated with the type of the object, then the identifier must be associated with the same type. This will allow at the compilation stage to check the consistency of the types of the identifiable object and the attributed attributes.

In the simplest case, we could use this type of identifier:

trait Id[T]

However, this method of identification is not universal. First, many objects are identified only within the parent objects; secondly, many types of objects can have several identification methods at once. To reflect the first phenomenon, we can use the Rel [-L, R] type described above, considering it already as a way of moving from the parent to a specific instance of the child. If we recall that child objects are often combined into typed collections, then the identifier of the child object turns out to be composite - first, a collection is selected, and then an element of this collection is selected by an integer index:

  val children = Rel[Parent, Seq[Children]]("children")
  case class IntId[T](id: Int) extends Relation[Seq[T], T]
  val child123 = children / IntId(123)

(the `/` DSL method is used here, combining two relations into one (composition of relations)).

This identification method allows you to unambiguously go from the parent to the desired child. What if we want to use an alternative method of identification? For example, we know that some property of a child object has the property of uniqueness within the parent object, and, therefore, can be used to select a child object. In this case, we can use the identification through the index:

  trait IndexedCollection[TId, T]
  case class Index[TId, T](keyProperty: Relation[T, TId])
    extends Relation[Seq[T],IndexedCollection[TId, T]]
  case class IndexValue[TId, T](value:TId)
    extends Relation[IndexedCollection[TId, T], T]

For instance:

  val name = Rel[Child, String]("name")
  val childByName = name.index
  val childVasya = parent / children / childByName / IndexValue("Vasya")

Thus, the Rel [-L, R] type, extended by the serial number in the collection and the index by the property of the child object, allows navigation in the hierarchical data structure.

To identify objects that are at the highest level and do not have a parent object, you can enter a special type of Global, which will contain all collections of high-level objects:

  final class Global
  val persons = Rel[Global, Seq[Person]]("persons")
  val otherTopLevelObjects =
    Rel[Global, Seq[OtherTopLevelObject]]("otherTopLevelObjects")

Data schema

Relationships themselves are bricks that allow you to build both the data structures themselves and the schemas of this data. To describe the data scheme, you can use the relational approach - entity-relationship. In this case, the diagram is a collection of descriptions of entities and a collection of descriptions of relationships between entities. For entities, a set of attributes is indicated, and for relationships - 1-0, 1-1, 1- *, * - *

You can also use the object-oriented approach that describes the entity, properties and collections of child objects, which, in turn, are described properties and collections.

The relational scheme, of course, is perfect for presenting data in a database, and object-oriented can be used to create object-oriented services (web-services?).

To describe the type T in an object-oriented version of the scheme, one of the descendants is used Schema[T].
SimpleSchema- for simple types that do not contain attributes;
RecordSchema- composite types containing the specified attributes;
CollectionSchema- for types Seq [T] allows you to bind the scheme of the elements of the collection.

Data storage

Meta-information itself does not contain data. For storage, you must use other structures. Such structures depend on the needs of the application:

ordinary classes with ordinary properties, accessed by reflection using the property names;
special classes for storing data that also contain meta-information - heirs Instance[T]( SimpleInstance, RecordInstance, CollectionInstance). These types simplify the work with data described by the scheme, as data storage directly corresponds to the scheme;
linear tuple, "list of lists" ( List[Any]). The hierarchical structure of nested Records can be decomposed into a linear structure - a sequence of primitive types. Nested collections turn into list-lists of the simplest types. Such a representation can be used for transmission over the network and for interaction with the database (since the tuple directly corresponds to the table row). A couple of operations align / unalign (flatten) are used to convert Instances to flat lists and vice versa;
DB tables, data from which are extracted using RecordSet;
JSON objects
XML

Data construction

When creating data instances, the most important limitation that we want to check at the compilation stage is that the properties can only be specified for the types for which they are declared (for the sake of this, basically, the property has a generic type for the left side relations). It follows that in the process of creating a data instance that satisfies the scheme, it is necessary to use special tools. For instance:

  val b1 = empty[Box]
	  .set(width, simple(10))
	  .set(height, simple(20))

An immutable type is used here Instance[Box], into which pairs are added - (property, value). If there is little data, this approach is sufficient. If you need to collect a lot of data, then it is more efficient to use a mutable builder, inside which the required set of attributes is gradually formed. At the end of the build, the builder is converted to Instance[Box]:

val boxBuilder = new Builder(boxSchema)
boxBuilder.set(width, simple(10))
boxBuilder.set(height, simple(20))
val b1 = boxBuilder.toInstance

The builder also provides two runtime checks -

inadmissibility of using properties that are not included in the scheme;
ensuring the completeness of the formed object.

To represent data in rows of tables in a database, it is necessary to convert nested Records into a flat structure. A pair of align / unalign methods is used for this.

Conclusion

The stated approach allows

describe complex subject areas with explicit preservation of meta-information;
operate with properties in a strongly typed manner (with type checking at the compilation stage);
represent arbitrary hierarchical data structures (like json) with type checking at all levels;
present incomplete data and check the degree of completeness (for example, you can have smallSchema[T]and fullSchema[T]with which to check data instances).

In contrast to the approach described in the previous article , we weaken the requirement of ensuring data completeness verification at the compilation stage. In return, a much simpler and more convenient approach is obtained. The admissibility of using a property on the specified type is checked by the compiler without building bulky surrogate types based on HList. At the same time, we are not constrained by the object-oriented approach in terms of representing data and limiting the composition of entity attributes.

Tags: