gricom May 26, 2014 at 11:59

PHP + Java, or In-memory cluster is now also for PHP developers

Intro

PHP + Java. The picture is taken from here.

In this commentary on the article entitled “Write the code every day,” I said that I would soon show my project for which I allocated 1 hour daily (except weekends). Since recently, my work is related to writing distributed Java applications that use in-memory data grid (IMDG) as a data warehouse, my project is connected with this.

You can read more about IMDG in my previous articles ( 1 , 2 ). But in short, this is a clustered distributed storage of objects by keys, which holds all the data in memory, due to which a high speed of access to data is achieved. Allows you to not only store, but also process data without extracting them from the cluster.
And if each specific IMDG has its own interface for data processing, then the data access interface is usually identical to the hash table.

What is this article about

Most IMDGs are written in Java and support the APIs for Java, C ++, C #, while the API for web programming languages (Python, Ruby, PHP) is not supported, and the protocol for writing clients is very limited. It is this fact that I consider to be the main obstacle to the penetration of IMDG into the masses - the lack of support for the most popular languages.

Since IMDG manufacturers do not yet provide support for web languages, web programmers do not have the ability to scale applications as easily as Java server developers have. Therefore, I decided to do something similar on my own and put it in open source, taking JBoss Infinispan as an open source IMDG engine (the Red Hat-owned JBoss company is quite well known among the java developers). My project is called Sproot Grid, while it is available only for PHP, but if the community has interest, I will also integrate with Ruby and Python.

In this article, I will once again talk about in-memory data grid and how to configure, run and use Sproot Grid.

Why do i need IMDG?

The bottleneck of many highly loaded projects is the data warehouse, in particular the relational database. To combat the disadvantages of traditional databases, 2 approaches are mainly used:

1) Caching
pluses :

high speed data access

cons :

real cluster solutions are very rare, mainly the user has to deal with the distribution of data among servers, and when accessing data, determine the server on which this data is located. It is difficult to achieve uniformity of occupation of all cluster nodes in such a system
requires a compromise between the relevance of the data and the speed of access, because data in the cache may become outdated, and deleting old data from the cache and then caching new ones is additional delays and system load
Usually, data is cached not in the form of domain objects that are used in the application, but in the form of BLOB or strings, i.e. when using data obtained from the cache, you must first construct the necessary objects

2) NoSQL solutions
pros :

good horizontal scalability

cons :

not so fast results when using a disk
it is almost impossible to ensure the operation of internal corporate software that is focused on working with a specific relational database

IMDG combines the advantages of both approaches and at the same time has several advantages over the solutions mentioned above:

good horizontal scalability
high speed access
true clustering (you can put data on any node, you can also request data on any node of the cluster), automatic balancing of data between nodes
the cluster knows about all the fields of the object; therefore, you can search for objects not only by keys, but also by field values
it is possible to create indexes by fields or by combination of them
when using read-through and write-behind (or write-through) mechanisms, the data will be synchronized with the database, which will allow other applications (or other application modules) to continue to use the traditional database (MySQL or Mongo - it doesn’t matter)
When using the working scheme from the previous paragraph, the problem of updating the data in the cache disappears, because they will always be there the same as in the database

Let's take a closer look at these 2 interesting mechanisms: read-through and write-behind (write-through)

read-through

Read-through is a mechanism that allows you to pull data from the database during a request.
For example, you want to get an object using the key ' key ' from the cache , and it turns out that there is no object with such a key in the cluster, then this object will be automatically read from the database (or any other persistence storage), then put into the cache, after which will be returned as a response to the request.
In the absence of such an object in the database, null will be returned to the user.
Naturally, the necessary sql query, as well as the mapping of the results of the query on the object, lies on the shoulders of the user

write-behind (write-through)

To optimize the write speed, you can write not to the database, but directly to the cache. It sounds strange at first glance, but in practice it offloads the database well and increases the speed of the application.
It looks something like this:

The user makes a call to cache.put (key, value) , the object ' value ' is stored in the cache by the key ' key '
The handler of this event is triggered in the cluster, the sql-request is compiled for writing data to the database and its execution
Management is returned to the user

This interaction scheme is called write-through . It allows you to synchronize updates with the database simultaneously with updates in the cluster. As you can see, this approach does not speed up the process of writing data, but ensures data consistency between the cache and the database. Also, with this type of record, the data goes to the cache, which means that read access to them will still be higher than a query to the database.

If simultaneous writing to the database is not a critical condition, then you can use the more popular write-behind mechanism , it allows you to organize a pending write to the database (any other store). Like that:

The user makes a call to cache.put (key, value) , the object ' value ' is stored in the cache by the key ' key '
Management is returned to the user
After a while (user configurable), a cache write event handler fires
The handler collects the entire pack of objects that have been changed since the previous handler
The pack is sent to the database for recording

When using write-behind, the write operation is significantly accelerated, because the user does not wait until the update reaches the database, but simply puts the data in the cache, and all updates of the same object will be merged into one resulting update, while writing to the database in batches, which also positively affects the loading of the database server.
Thus, you can configure your IMDG so that every 3 seconds (either 2 minutes or 50 ms) all data updates are sent asynchronously to the database.

What of this is in the Sproot Grid?

In the first version, I decided not to immediately implement everything that I talked about above, because it would take a lot of time, but I would like to quickly get feedback from users.
So, what is available in Sproot Grid 1.0.0:

Horizontal scalability and fair clustering, balancing the amount of data between cluster nodes
Ability to store both built-in PHP types and domain objects
The ability to build an index by field and search on this index

Getting started

First you need to download the distribution kit from here and unzip it.

Installing the necessary software

Since JBoss Infinispan is a Java application, it was necessary to choose a way of interaction between Java and PHP. Apache Thrift was chosen as such a link (the protocol was developed for serialization and transport between nodes in Cassandra), so in order for Sproot Grid to work on your system, you need to install the following:

Java
Thrift - installation in production is not required, installation is needed only on the development machine (for details, see Generating code ). When deploying to production, you only need to copy the .php files of the Thrift library and the java library in .jar format
PHP (if not already installed)

Installation instructions are located on the project wiki

Configuration

The configuration file should be in $ deploymentFolder / sproot-grid / config / definition.xml, where deploymentFolder is the path to the directory where you unpacked the distribution

Configuration Example:

You can read more about the configuration on the project wiki.

As you can see from the configuration, for each type of object we can write a cache name (or we can not register it if we do not want to store such objects in a separate cache). Cache is a hash table distributed across a cluster; there can be as many caches in a cluster as possible. Only objects of the same type can be stored in one cache.
All caches should be described in the section..
The configuration has a separate section for describing the cluster structure and a list of caches that will be stored in it.

- a description of the types that will be stored in your cluster. You can use both built-in PHP types and custom ones. As you can see, for each type of object we can write a cache name (or we can not register it if we do not want to store such objects in a separate cache)

- a description of the cluster structure and a list of caches that will be stored in it.
describes caches. The cache name must be unique, the backup-count parameter determines how many nodes in the cluster you can lose without losing data. The more backup-count matters , the more reliable your cluster is, but the more memory it consumes. You can also configure eviction (automatic removal of objects from the cache), more about this on the wiki page
defines the multicast address that will be used to build the cluster. As you know, for multicast only class D networks are available (224.0.0.0 - 239.255.255.255)
describes the number and types of cluster nodes. Now there are only 2 types of nodes: storage-only - only deals with data storage and internal service requests - it not only stores data, but also processes external requests, so for nodes of this type you need to specify the port on which requests from PHP clients will be received.

Code generation for integration with your application

For efficient operation, the cluster needs to generate code specific to your application (your domain model) and compile its Java part, as this works faster than accessing objects through reflection. To generate and compile all the necessary code, you must:

	1) cd $deploymentFolder/sproot-grid/scripts
	2) build.sh(or build.cmd)

, where $ deploymentFolder is the directory into which you unpacked the distribution
code Generation of the code should be done only in case of changing the description of the domain model, i.e. if your model is stable, then you will have to perform this operation only once, after that the generated php sources can be stored in the code repository, and the java part will be compiled into a library. Those. you don’t need to generate anything 10 times before you deploy your application, this is done only 1 time at the development stage.
After completing the code generation, copy the folder with .php files from $ deploymentFolder / sproot-grid / php / org to the root of your application

Launch

        1) cd $deploymentFolder/sproot-grid/scripts
        2) run.sh(run.cmd) nodeId memorySize

where nodeId is the value of the id attribute of the section in the configuration file,
memorySize is the amount of memory (in MB or GB) that you want to allocate to the node

For example:

run.sh 1 256m

run.cmd 2 2g

Inside use

In the code generation step, you got everything you need to integrate with your application. It remains only to copy this code into your application, for this, copy everything from the $ deploymentFolder / sproot-grid / php folder to the root of your application
That's it! Now you can use the cluster from your application.

Code example:

 cacheSize('user-cache');
    $user = new User();
    $user->setName('SomeUser');
    $user->setId(1234);
    $client->put('user-cache', '1234', $user);
    echo $client -> cacheSize('user-cache');
?>

You can find the description of the API here , but in short, the API is now like this:

get ($ cacheName, $ key)
getAll ($ cacheName, array $ keys)
cacheSize ($ cacheName)
cacheKeySet ($ cacheName)
containsKey ($ cacheName, $ key)
search ($ cacheName, $ fieldName, $ searchWord)
remove ($ cacheName, $ key)
removeAll ($ cacheName, array $ keys)
put ($ cacheName, $ key, $ domainObject)
putAll ($ cacheName, array $ domainObjects)
clearCache ($ cacheName)

Conclusion

Sproot Grid is published under the MIT license.
Sources
Wiki
Distribution

Only registered users can participate in the survey. Please come in.

What would you like to see in the next release (you can choose several options)?

62.1% adding read-through / write-behind functionality 46
10.8% Ruby 8 Support
16.2% Python 12 Support
41.8% Distributed data processing within a cluster 31
25.6% Ability to Deploy a Cluster on Amazon EC2 19

Tags: