vlsergey April 9, 2010 at 19:50

Meta data. Toward Data Model Management Ideals

What is this post about

This is a post-review of data model management options known to the author based on experience, rumors, and reading instructions.
This post is also an attempt to classify existing data model management options.
Finally, the idea and initial touches in the implementation of the data model management system, which should not contain the shortcomings of the previous ones, are presented.

Definitions and limitations

It is assumed that the reader is (or someday will become) the developer of the Enterprise Application, who often needs to write quickly and efficiently, but not afraid to get into the jungle of JPA / JTA / RMI in order to “tweak especially thin places with a file”.

Data is what is stored in the application database. Data about customers, users, orders, etc.

Metadata - A description of the data structure. A description of what types of objects are stored in the database, what fields they have (attributes, elements), a description of the dependencies between the objects. In general, types can inherit attributes of the parent type, and one attribute in the general case can be present in two or more types that are not related by the inheritance relation.

Enterprise Application works using (most often) Application Server (WebLogic, JBOSS) and some RDBMS (Oracle, Informix, MySQL). Although the author does not see anything wrong with self-assembly of AS based on Tomcat / Hibernate / JOTM / DBCP / etc, it is very, very interesting, but beyond the scope of this topic.

The RDBMS is supposed to be one of those standard that is supported by Hibernate / OpenJPA.

The topic uses terms from XML Schema: namespace, type, attribute. The latter two, to some extent, correspond to the concepts of Java class (class object, bin) and property (property, aka get + set, also sometimes just a field, field).

Introduction Simplest case

Large applications - most often these are not only applications with a large amount of data. Most often, these are applications that work with a large amount of heterogeneous data, having a different structure from the point of view of business logic. (By the way, the last is important - the data structure can be different at the DBMS level, at the application level and even inside it)

The simplest case is to define a data model in the form of a set of classes and the corresponding set of tables in the database. Roughly speaking: one class is an ode table in a database. Each property of an object is represented by a property of a bean class and a column in the database. However, such a mechanism has disadvantages that appear when developing and using an Enterprise application:

Adding or changing a data model will require changing both the structure of the database and the program code, followed by recompilation, etc.
As a result, this cannot be done on the fly.
Complex changes, such as transferring an attribute from a child type to a parent type, will also require writing manual scripts (DDL + DML) to update the database structure
Changing the structure requires a specialist knowledge of SQL / Java

However, it is necessary to note the advantages of this approach:

Best price for good performance. Literally “out-of-the-box” we get the most transparent data storage structure, the most obvious for both the JPA layer (hibernate, etc) and the RDBMS (and its administrator).
From the point of view of a business logic programmer who does not change the data structure, the most convenient API we also get out-of-the-box

Notice in the last sentence an important clarification - “business logic”. It is a description of the processes of interaction of data structures, their change, etc. - that is, code that must know and knows about the data structure. But if, for example, we are talking about editing beans via the WEB interface (or in any other way), then to write an editor that can edit 80% of objects without knowing their structure in advance (the so-called generalized), we will have to deal with Reflection / Beans / etc and other, in principle, not very scary words. (Scary - at the end of the topic).

Modern design tools allow you to automate part of the processes associated with updating, for example, the database structure by code, or vice versa - generate or update code according to the description of the data structure. I’m not sure, but I think there are tools for creating both code and database structures at the same time based on some abstract data scheme written, for example, in the form of XML Schema. (The code can be generated so precisely - see XML Beans, etc.). However, all these tools work in offline mode and do not affect the running application (unless, of course, you do the update directly via live, but nothing good happens).

By the way, some of the auxiliary utilities can be made to draw shapes for each type of object.

Flexible data structures

The most flexible structure can be considered, in which each object is stored as a record in the database in the form, well, for example, XML. That is, a large-large table in which there are two columns - the object ID and its contents in the form of XML. As you may correctly guess, the main drawback of such a structure is the very low database performance at the moment when we need to calculate, for example, all customers from the city of Moscow. To do this, the database will have to parse each value.

To keep the structure flexible, but to load the database less, the object is divided into pieces and taken out into separate tables. For example,
- Objects: ID, required field 1, required field 2
- Values: object ID, attribute identifier, value

You can go further and, without limiting flexibility, separate attributes of different types into different tables or columns. A similar scheme is successfully used in the application (cut out) to process data in a few terabytes.

Still disadvantages:
You have to pay for flexibility. Firstly, the data layer will have to be written independently. Secondly, there is a big desire to save and leave an API for business logic that would reflect the database structure:
- give an ID object such and such - give an ID
attribute such and such
- update the value
- write down the attribute ID of such and such an object then
- update the version of the object (+1)

Of course, from the point of view of the generalized data editor programmer, it’s very convenient to have methods like getAllAttributes (). However, from the point of view of business logic, this is inconvenient, especially if you need to remember all the IDs of the necessary attributes (they can also be numeric).

It should be noted, however, that the API in general is not required to match the structure of the database. The main thing is that 80% of the actions are performed in the simplest and most obvious way. That is, if we have clients in our database, getting the client’s name or address should have one line of code like client.getAddress (). However, for flexible structures, writing such shells can greatly undermine performance, and secondly, structures tend to change ...

However, if those who are responsible for writing data access procedures do not write such shells, be prepared that in a couple of years you will have as many “simplified” data access shells as many initiative programmers work with the “standard” API.

Disabled Structures

In this section, I want to talk about another approach that is used in one little-known CMS .

From a code point of view, accessing the attributes of an object is done in the same way as with flexible structures — through methods like getAttribute / getAllAttributes / etc. However, for the CMS, the main task of which is to edit the objects separately (without relations between the objects), and also simply output the object in XML for further processing - this API is enough.

Interestingly, the list of data types is stored in some configuration file. Also in this file for each type is stored a list of attributes and their type. Based on the configuration file, a table structure is created or updated at startup. Subsequently, on the fly, when the data structure is changed, the tables are updated.

Pros:
- obvious data model for DBMS
- on-the-fly flexibility
Cons
- from the point of view of business logic, the API is too flexible (see the previous section)
- you need to write your own data access system, which at the moment, unfortunately, is unlike system objects (users, groups, etc) ignore transactions, caches and other delights

Classification ... attempt

If we are considering a metamodel, then to describe it you need to answer the following questions:

What is the starting point for describing the model? (of course, this should be one point) Where is the information about objects and their attributes stored?
How is data storage organized in the database?
- Are the requirements of the first normal form met ?
- Two different simple (not multiple) attributes are stored:
  - in the form of two different columns
  - as two different lines
How is data access organized at the Application Server level?
- standard JPA methods are used (EntityManager, etc)
- their data access classes are used
How is data access organized at the level of business logic?
- standard methods are used like getName (), getAddress (), etc
- non-standard APIs like getAttribtute (ID ...) are used
Is there access to meta data from the program?
- is, and you can even change
- there is
- only through Reflections / Hibernate Mapping / etc

I want ... perfect for the author

The requirements for an ideal (from the author’s point of view) system for describing and operating data models are easily derived from the previous paragraph:
- the description of the data structure should be in the database, which will allow you to quickly change the model description, possibly - through the application itself
- the data itself should be stored in a normalized (up to 3-4 forms) database, where each type has its own data table. The management system itself must take care of maintaining the database schema in accordance with the meta data.
- data access should be through standard JPA / EntityManager interfaces.
- from the point of view of business logic, the main fields of the main object types should be accessible through a simple API without additional resolving / casting / narrowing (i.e. immediately after loading from the EntityManager)
- but the system should also provide access to meta data. Including for a specific object - getting a list of all fields.

At present, the author deals with the writing of such a system, using:
- the Hibernate - the driver access to the data
- CGLIB / ASM - dynamic design classes based on their description, including the annotations for the Hibernate
- the XML Schema - to describe the data types and their attributes

, but on this next time.

Tags: