Access ClickHouse Using JDBC
Hi Habr! Not so long ago, I had the pleasure of attending a PyData Moscow meeting at Yandex. I can not call myself a python developer, but I have interests in the field of analytics and data analysis. Having visited this event, I learned about the existence of the ClickHouse DBMS, developed in Yandex and published on GitHub under an open license. Column SQL DBMS with domestic roots aroused interest in me. In this article, I will share the experience of installing and configuring ClickHouse, as well as trying to access it from a Spring application using Hibernate.
I started acquaintance with the DBMS from the documentation page .
Very surprised by the recommendations on the use of RAM. Developments recommend using as much memory as you have data (in any case, while the data volume is less than 200 GB). It is not entirely clear whether we are talking about the total memory of all machines in the cluster or about the memory of each machine. In any case, I was able to be content with only 8 GB of memory for my virtual machine. The size of the test data set was more than 60 GB in uncompressed form.
The first difficulty I had to deal with was the recommendation to use Ubuntu and the absence of rpm packages (UPD: later I discovered this repository ). I prefer rpm systems and decided to build the DBMS from the source code for Centos 7.
The guys from Yandex have prepared us instructions for assembly. There are some interesting points here:
Building the DBMS itself took about 40 minutes. The configuration of clickhouse in my case consisted in editing the configuration
Check DBMS operation using the example from this post, downloading data on air travel to the USA from 1987 to 2015. The download process and sample requests are given above. It took me 24 minutes to load, the size of the database on the disk was 14 GB with the volume of downloaded data at 61.6 GB.
The execution time of the test requests from the article above on my machine was:
Once again, I note that the memory of my machine was only 8 GB, which is almost 8 times less than recommended by the developers. In this regard, the numbers above are nothing more than a very rough estimate of how the clickhouse will behave on the developer's laptop than in production. In general, the DBMS worked stably and, in my opinion, quickly enough. In the analyzed table there were 166 million records.
The next step in my study of clickhouse is to try to access it from Java. The guys from Yandex uploaded the jdbc driver .
Connecting it to a maven project is very simple:
When using, there are only a few points that, unknowingly, can cause difficulties. First of all, clikchouse supports several protocols. By default, the native binary protocol uses port 9000, and the http protocol uses 8123. The JDBC driver can only work with http, so we specify port 8123. The default database name is default. The default username and password do not need to be set. For clickhouse to allow connections from remote machines, add a line in the configuration file
Further, everything is as in regular JDBC:
As you can see, nothing unusual. The full sample code is available here .
My next step was an attempt to connect ORM. Naively believing that since there is JDBC, then there will be ORM, I reached for Spring Data JPA. But, unfortunately, I ran into a number of problems here. First of all, when connecting to the database, you must specify the SQL Dialect. I could not find the implemented dialect for ClickHouse.
As a result, I simply created a descendant of the standard
Now connect to the database becomes possible. The problem starts when executing queries. The thing is, clickhouse does not support alias for table names. In other words, a query of the form:
for clickHouse is not valid.
The problem is that apparently hibernate does not allow disabling alias generation for tables by defining SQL Dialect. In any case, I could not find how to achieve this.
Having rummaged a bit in the hibernate code, I began to come to the conclusion that this functionality is wired quite deep in its core. However, my experience with hibernate is not that great. I would be glad if someone shares a known way to get around this restriction.
Exploring the clickhouse was an exciting experience. DBMS has shown itself on the good side. I managed to compile it from the source codes, execute queries using the console client and jdbc. Unfortunately, the incomplete support of ANSI SQL constructs did not make it easy to connect Hibernate. However, the purpose of the DBMS should be taken into account. The main niche of clickhouse is complex analytical queries, most of which will still have to be written in whole or in part by hand. Although having the ability to simply view and select data “out of the box” using ORM would not be out of place.
I believe it will be extremely interesting to observe the development of the project, especially given the growing attention to the clikchouse in the community recently and the active support of the project from Yandex.
Assembly and installation
I started acquaintance with the DBMS from the documentation page .
Very surprised by the recommendations on the use of RAM. Developments recommend using as much memory as you have data (in any case, while the data volume is less than 200 GB). It is not entirely clear whether we are talking about the total memory of all machines in the cluster or about the memory of each machine. In any case, I was able to be content with only 8 GB of memory for my virtual machine. The size of the test data set was more than 60 GB in uncompressed form.
The first difficulty I had to deal with was the recommendation to use Ubuntu and the absence of rpm packages (UPD: later I discovered this repository ). I prefer rpm systems and decided to build the DBMS from the source code for Centos 7.
The guys from Yandex have prepared us instructions for assembly. There are some interesting points here:
- We need gcc as much as version 6, when in centos only 4.8 comes out of the box. Curious what this limitation is related to? I did not spend time studying clickHouse code, perhaps the new C ++ standard is used, which is not supported in the 4th version. One way or another, GCC 6 also had to be compiled from source code. The clickhouse build instructions describe the gcc build process, but again it applies primarily to Ubuntu. The following article helped . Building the compiler took me 1.5 hours.
- The second interesting feature of the assembly was the dependence of clickhouse on libmysqlclient. I wonder how and why clikchouse uses MySQL client functions?
Building the DBMS itself took about 40 minutes. The configuration of clickhouse in my case consisted in editing the configuration
/usr/local/etc/clickhouse-server/config.xml
, since I decided to store the database in a separate section. I was able to start the server using sudo /usr/local/bin/clickhouse-server --config /usr/local/etc/clickhouse-server/config.xml
Check DBMS operation using the example from this post, downloading data on air travel to the USA from 1987 to 2015. The download process and sample requests are given above. It took me 24 minutes to load, the size of the database on the disk was 14 GB with the volume of downloaded data at 61.6 GB.
The execution time of the test requests from the article above on my machine was:
Inquiry | Time s |
---|---|
which destinations were the most popular in 2015 | 2.067 |
from which cities more flights depart | 3.744 |
from which cities you can fly in the maximum number of directions | 7.012 |
how does the delay of departure of flights depend on the day of the week | 3.195 |
from which cities, planes are more often delayed with a departure of more than an hour | 3.392 |
what are the longest flights | 12.466 |
arrival delay distribution by airline | 4.596 |
which airlines stopped flights | 1.653 |
which cities began to fly more in 2015 | 0.380 |
flights to which cities are more dependent on seasonality | 8.806 |
Once again, I note that the memory of my machine was only 8 GB, which is almost 8 times less than recommended by the developers. In this regard, the numbers above are nothing more than a very rough estimate of how the clickhouse will behave on the developer's laptop than in production. In general, the DBMS worked stably and, in my opinion, quickly enough. In the analyzed table there were 166 million records.
Friends of ClickHouse and Java
The next step in my study of clickhouse is to try to access it from Java. The guys from Yandex uploaded the jdbc driver .
Connecting it to a maven project is very simple:
ru.yandex.clickhouse clickhouse-jdbc ${clickhouse-jdbc-version}
When using, there are only a few points that, unknowingly, can cause difficulties. First of all, clikchouse supports several protocols. By default, the native binary protocol uses port 9000, and the http protocol uses 8123. The JDBC driver can only work with http, so we specify port 8123. The default database name is default. The default username and password do not need to be set. For clickhouse to allow connections from remote machines, add a line in the configuration file
::
. Further, everything is as in regular JDBC:
private static final String DB_URL = "jdbc:clickhouse://localhost:8123/default";
private final Connection conn;
/**
* Creates new instance
* @throws SQLException in case of connection issue
*/
public ClickHouseJDBCDemo() throws SQLException {
conn = DriverManager.getConnection(DB_URL);
}
/**
* Queries db to get most popular flight route for ths given year
* @param year year to query
* @throws SQLException in case of query issue
*/
public void popularYearRoutes(int year) throws SQLException {
String query = "SELECT " +
" OriginCityName, " +
" DestCityName, " +
" count(*) AS flights, " +
" bar(flights, 0, 20000, 40) AS bar " +
"FROM ontime WHERE Year = ? GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20";
long time = System.currentTimeMillis();
try (PreparedStatement statement = conn.prepareStatement(query)) {
statement.setInt(1, year);
try (ResultSet rs = statement.executeQuery()) {
Util.printRs(rs);
}
}
System.out.println("Time: " + (System.currentTimeMillis() - time) +" ms");
}
As you can see, nothing unusual. The full sample code is available here .
What about Hibernate?
My next step was an attempt to connect ORM. Naively believing that since there is JDBC, then there will be ORM, I reached for Spring Data JPA. But, unfortunately, I ran into a number of problems here. First of all, when connecting to the database, you must specify the SQL Dialect. I could not find the implemented dialect for ClickHouse.
As a result, I simply created a descendant of the standard
org.hibernate.dialect.Dialect
:public class ClickHouseDialect extends Dialect {
}
Now connect to the database becomes possible. The problem starts when executing queries. The thing is, clickhouse does not support alias for table names. In other words, a query of the form:
SELECT alias.column from tablename alias
for clickHouse is not valid.
The problem is that apparently hibernate does not allow disabling alias generation for tables by defining SQL Dialect. In any case, I could not find how to achieve this.
Having rummaged a bit in the hibernate code, I began to come to the conclusion that this functionality is wired quite deep in its core. However, my experience with hibernate is not that great. I would be glad if someone shares a known way to get around this restriction.
Total
Exploring the clickhouse was an exciting experience. DBMS has shown itself on the good side. I managed to compile it from the source codes, execute queries using the console client and jdbc. Unfortunately, the incomplete support of ANSI SQL constructs did not make it easy to connect Hibernate. However, the purpose of the DBMS should be taken into account. The main niche of clickhouse is complex analytical queries, most of which will still have to be written in whole or in part by hand. Although having the ability to simply view and select data “out of the box” using ORM would not be out of place.
I believe it will be extremely interesting to observe the development of the project, especially given the growing attention to the clikchouse in the community recently and the active support of the project from Yandex.