Android Database Design Methods
As an Android developer, I had to face two different approaches to designing databases in mobile applications. Perhaps the statement presented here will seem obvious to someone, or perhaps it will give someone a new idea or save them from mistakes. In general, without a long introduction, we get down to business ...
As you know, universities teach to build databases in accordance with all the rules: decompose a subject area into entities, select attributes and define primary keys, define relationships between entities, bring all this to at least 3 normal form, etc. One of the “side” effects of this approach is a drop in performance on read operations, with a fairly strong decomposition and normalization, since in queries it is necessary to execute a larger number of joins. And the more records you have in the tables, the longer they last.
We add here the very limited hardware capabilities of mobile platforms, in particular a tiny amount of RAM. It is already not enough, so in addition to this, Android limits the amount of available RAM per process, depending on the OS version, from 16 to 48 MB. And even out of these few megabytes, the DBMS receives only a fraction, because there is also the application itself. Well, in conclusion, SQLite itself, in view of its features, supports only two levels of transaction isolation. They are either serialized or disabled altogether!
In a situation when the application performance starts to rest on the DBMS performance to the rescue and an alternative approach may come, let's call it key-value oriented. Instead of decomposing the entity into attributes and creating separate fields in the table for each attribute, the entity is stored “as is” in a single field of type BLOB, in other words, it is serialized.
Let's look at an example for complete clarity. Let our data model in Java code look like this:
Thus, in the “standard” version, we get two tables with corresponding sets of attributes.
In this project, there are much more entities and attributes, plus various service fields are added here, such as the date of the last synchronization with the server or a flag flag, whether sending the entity to the server to update the changed data, etc.
When applying the key-value approach, the tables will look like this
at the same time, groups and students are serialized separately according to different tables. Or generally like this:
when a group is serialized directly with all students in one table.
Consider the advantages and disadvantages of both approaches and what benefits can be drawn from this.
Using the standard approach, we get all the advantages that we are so accustomed to using the relational approach, namely the SQL language for convenient selection, filtering and sorting of data, as well as modifying the database schema. In order to get a collection of entities we only need to form the required condition and take our data from the database. In the key-value approach, the task of filtering or organizing data lies with the developer.
When using the standard approach, the database file is usually smaller. This is due to the lack of redundancy during data storage, as a result of normalization. In theory, the higher the degree of normalization, the less redundancy, however, the load on the database increases when reading this data. Significant resources are spent on table joins. When using the key-value approach, the degree of data redundancy is higher, since, as a rule, the level of normalization is much lower, which leads to an increase in the database file size.
Usually, with the development of the project, the database schema is converted more than once, new fields are added, previously used fields are deleted, entities can be split into several new ones or, on the contrary, they can be denormalized and several tables combined into one. If, when updating the scheme, we can sacrifice the data stored in the database, then everything is simple: we create a new database file each time we update the scheme, and delete the old one. But what if the data needs to be saved and converted to a new format?
In this embodiment, the standard approach has advantages. It is enough to write the appropriate update scripts that convert the database schema to the required form and update the new fields with default values or calculate them using one or another logic. When using serialization, updating the database schema is no longer such an easy task. It is necessary to transform the scheme with saving all the data, as well as update the data itself, desirealizing them, initializing new fields and serializing back. Both the logical complexity of the operation and the time required to update it increase.
One of the main drawbacks of the key-value approach, it seems to me, is that in order to change just one field in essence, we need to deserialize the whole object. This greatly complicates access to objects. For example, in the case when a group is serialized into the database together with all students, in order to change the name of one of the students, we need to remove the entire group from the database, change one name and save it back. If the application has several streams, services and / or content providers that can work with the same entities, the task is many times more complicated. The more potential “writers”, the more locks will occur and the more difficult it will be for us to synchronize access to objects. In the case of a standard approach, this problem is solved at the DBMS level.
On the one hand, the key-value approach allows for higher performance when fetching small amounts of data. The number of joins is reduced, a specific request and the DBMS as a whole are faster. On the other hand, with large amounts of data, if we need to filter or sort this data by a field that is serialized together with the entire object, then for this operation we will first need to read all entities, and only then filter out all the excess, which can lead to not to a performance gain, but to its further deterioration. Alternatively, you can store the fields involved in the filtering or sorting request by the standard approach, and the rest of the entity as a BLOBa, but then it will be difficult to maintain such a mess.
In the standard approach, the amount of SQL code, various scripts for creating and modifying the database schema, queries and conditions, DAO objects, etc. increase. In key-value, the amount of such code is reduced, but the amount of code that performs various sorting, grouping and filtering by conditions is increasing because all this has to be done “manually” when the DBMS does it in the standard approach, and we only need to write the required query.
A minus of the key-value approach may consist in performance degradation associated with the use of standard Java serialization / deserialization, which, as you know, is not very fast. Here, as an alternative, you can use one of the libraries that solve this problem, for example, protobuf from Google. In addition to speed, an additional plus, in the case of using protobuf, will be versioning, because this protocol supports versioning of objects.
It turned out a little messy, but in general, I wanted to say: both approaches are good, you need to choose according to the situation, considering all the pros and cons listed above. As a rule, if there are no performance problems, it is better to use a standard approach that has more flexibility. If these problems begin to occur, try using denormalization. Perhaps if there are only a few critical areas in the program, then this can solve everything. If you encounter persistent performance problems, when denormalization does not save, it is worth taking a closer look at the key-value approach.
Two views on the problem
As you know, universities teach to build databases in accordance with all the rules: decompose a subject area into entities, select attributes and define primary keys, define relationships between entities, bring all this to at least 3 normal form, etc. One of the “side” effects of this approach is a drop in performance on read operations, with a fairly strong decomposition and normalization, since in queries it is necessary to execute a larger number of joins. And the more records you have in the tables, the longer they last.
We add here the very limited hardware capabilities of mobile platforms, in particular a tiny amount of RAM. It is already not enough, so in addition to this, Android limits the amount of available RAM per process, depending on the OS version, from 16 to 48 MB. And even out of these few megabytes, the DBMS receives only a fraction, because there is also the application itself. Well, in conclusion, SQLite itself, in view of its features, supports only two levels of transaction isolation. They are either serialized or disabled altogether!
In a situation when the application performance starts to rest on the DBMS performance to the rescue and an alternative approach may come, let's call it key-value oriented. Instead of decomposing the entity into attributes and creating separate fields in the table for each attribute, the entity is stored “as is” in a single field of type BLOB, in other words, it is serialized.
Let's look at an example for complete clarity. Let our data model in Java code look like this:
class Group {
private Long _id;
private String number;
private List students;
// getters and setters
...
}
class Student {
private Long _id;
private String name;
private String surname;
private Group group;
// getters and setters
...
}
Thus, in the “standard” version, we get two tables with corresponding sets of attributes.
create table Group(
_id primary key integer autoincrement,
number text);
create table Student(
_id primary key integer autoincrement,
name text,
surname text,
group_id integer foreign key);
In this project, there are much more entities and attributes, plus various service fields are added here, such as the date of the last synchronization with the server or a flag flag, whether sending the entity to the server to update the changed data, etc.
When applying the key-value approach, the tables will look like this
create table Group(
_id primary key integer autoincrement,
value blob);
create table Student(
_id primary key integer autoincrement,
value blob,
group_id integer foreign key);
at the same time, groups and students are serialized separately according to different tables. Or generally like this:
create table Group(
_id primary key integer autoincrement,
value blob);
when a group is serialized directly with all students in one table.
Consider the advantages and disadvantages of both approaches and what benefits can be drawn from this.
Comparison of approaches, pros and cons
Features of Relational Algebra
Using the standard approach, we get all the advantages that we are so accustomed to using the relational approach, namely the SQL language for convenient selection, filtering and sorting of data, as well as modifying the database schema. In order to get a collection of entities we only need to form the required condition and take our data from the database. In the key-value approach, the task of filtering or organizing data lies with the developer.
DB file size
When using the standard approach, the database file is usually smaller. This is due to the lack of redundancy during data storage, as a result of normalization. In theory, the higher the degree of normalization, the less redundancy, however, the load on the database increases when reading this data. Significant resources are spent on table joins. When using the key-value approach, the degree of data redundancy is higher, since, as a rule, the level of normalization is much lower, which leads to an increase in the database file size.
Flexibility when changing the database schema
Usually, with the development of the project, the database schema is converted more than once, new fields are added, previously used fields are deleted, entities can be split into several new ones or, on the contrary, they can be denormalized and several tables combined into one. If, when updating the scheme, we can sacrifice the data stored in the database, then everything is simple: we create a new database file each time we update the scheme, and delete the old one. But what if the data needs to be saved and converted to a new format?
In this embodiment, the standard approach has advantages. It is enough to write the appropriate update scripts that convert the database schema to the required form and update the new fields with default values or calculate them using one or another logic. When using serialization, updating the database schema is no longer such an easy task. It is necessary to transform the scheme with saving all the data, as well as update the data itself, desirealizing them, initializing new fields and serializing back. Both the logical complexity of the operation and the time required to update it increase.
Synchronization of access to entity instances
One of the main drawbacks of the key-value approach, it seems to me, is that in order to change just one field in essence, we need to deserialize the whole object. This greatly complicates access to objects. For example, in the case when a group is serialized into the database together with all students, in order to change the name of one of the students, we need to remove the entire group from the database, change one name and save it back. If the application has several streams, services and / or content providers that can work with the same entities, the task is many times more complicated. The more potential “writers”, the more locks will occur and the more difficult it will be for us to synchronize access to objects. In the case of a standard approach, this problem is solved at the DBMS level.
Performance
On the one hand, the key-value approach allows for higher performance when fetching small amounts of data. The number of joins is reduced, a specific request and the DBMS as a whole are faster. On the other hand, with large amounts of data, if we need to filter or sort this data by a field that is serialized together with the entire object, then for this operation we will first need to read all entities, and only then filter out all the excess, which can lead to not to a performance gain, but to its further deterioration. Alternatively, you can store the fields involved in the filtering or sorting request by the standard approach, and the rest of the entity as a BLOBa, but then it will be difficult to maintain such a mess.
Code volume
In the standard approach, the amount of SQL code, various scripts for creating and modifying the database schema, queries and conditions, DAO objects, etc. increase. In key-value, the amount of such code is reduced, but the amount of code that performs various sorting, grouping and filtering by conditions is increasing because all this has to be done “manually” when the DBMS does it in the standard approach, and we only need to write the required query.
Serialization
A minus of the key-value approach may consist in performance degradation associated with the use of standard Java serialization / deserialization, which, as you know, is not very fast. Here, as an alternative, you can use one of the libraries that solve this problem, for example, protobuf from Google. In addition to speed, an additional plus, in the case of using protobuf, will be versioning, because this protocol supports versioning of objects.
Conclusion
It turned out a little messy, but in general, I wanted to say: both approaches are good, you need to choose according to the situation, considering all the pros and cons listed above. As a rule, if there are no performance problems, it is better to use a standard approach that has more flexibility. If these problems begin to occur, try using denormalization. Perhaps if there are only a few critical areas in the program, then this can solve everything. If you encounter persistent performance problems, when denormalization does not save, it is worth taking a closer look at the key-value approach.