How to use HDF5 files in Python

Original author: Aquiles Carattino
  • Transfer
Hello!

The launch of the course “Web-developer in Python” is approaching , respectively, we still share interesting articles and invite you to our open lessons where you can watch interesting material, meet teachers and ask questions.

Go!

HDF5 allows you to efficiently store large amounts of data

When working with large amounts of data, whether experimental or simulated, storing them in several text files is not very efficient. Sometimes you need to access a particular subset of data, and you want to do it quickly. In these situations, the HDF5 format solves both problems thanks to a very optimized built-in library. HDF5 is widely used in scientific environments and has excellent implementation in Python, designed to work with NumPy right out of the box.

HDF5 format supports files of any size, and each file has an internal structure that allows you to search for a specific data set. This can be represented as a separate file with its own hierarchical structure, as well as a set of folders and subfolders. By default, data is stored in binary format, and the library is compatible with different data types. One of the most important choices of the HDF5 format is that it allows you to attach metadata to each structure element, which makes it ideal for creating standalone files.


In Python, an interface with the HDF5 format can be built using the h5py package. One of the most interesting features of this package is that data is read from a file only when necessary. Imagine that you have a very large array that does not fit into your available RAM. For example, you could generate an array on a computer with different specifications, as opposed to the one you use for data analysis. The HDF5 format allows you to choose which elements of the array should be read with a syntax equivalent to NumPy. Then you can work with the data stored on your hard drive, not in RAM, without significant changes to the existing code.

In this article we will look at how you can use h5py to store and retrieve data from a hard disk. We will discuss various ways to store data and how to optimize the reading process. All examples that appear in this article are also available in our Github repository .

Installation

The HDF5 format is supported by the HDF Group , and it is based on open source standards, which means that your data will always be available even if the group disappears. Python support is provided through the h5py package , which can be installed via pip. Remember that you must use a virtual environment to conduct tests:

pip install h5py

this command will also install NumPy if it is not in your environment.

If you are looking for a graphical tool to examine the contents of your HDF5 files, you can install HDF5 Viewer . It is written in Java, so it should work on almost any computer.

Basic data storage and reading

Let's move on to using the HDF5 library. We will create a new file and save a random NumPy array into it.

import h5py
import numpy as np
arr = np.random.randn(1000)
with h5py.File('random.hdf5', 'w') as f:
    dset = f.create_dataset("default", data=arr)

The first few lines are pretty simple: we import the h5py and NumPy packages and create an array with random values. We open the file random.hdf5 with permission to write w, this means that if a file with that name already exists, it will be overwritten. If you want to save the file and still be able to write to it, you can open it with the a attribute instead of w. We create a dataset named default and set the data as a random array created earlier. Data sets (dataset) are the custodians of our data, mainly building blocks of the HDF5 format.

Note

If you are not familiar with the operator with, I must point out that this is a convenient way to open and close files. Even if insidewithan error occurs, the file will be closed. If for some reason you are not using with, never forget to add a command to the end f.close(). The operator withworks with any files, not only with HDF files.

We can read the data in almost the same way as we read the NumPy file:

with h5py.File('random.hdf5', 'r') as f:
   data = f['default']
   print(min(data))
   print(max(data))
   print(data[:15])

We open the file with the read attribute r and restore the data, directly accessing the data set with the name default. If you open a file and do not know which datasets are available, you can get them:

for key in f.keys():
   print(key)

Once you have read the dataset you wanted, you can use it as if you were using any NumPy array. For example, you can find out the maximum and minimum values ​​or select the first 15 values ​​of the array. These simple examples, however, hide many things that happen under the hood, and they need to be discussed in order to understand the full potential of HDF5.

In the example above, you can use the data as an array. For example, you can refer to the third element by entering data [2], or you can get the range of data [1: 3]. Note: the data is not an array, it is a data set. You can see it by typingprint(type(data)). Data sets work completely differently from arrays, because their information is stored on the hard disk and they do not load it into RAM if we do not use them. The following code, for example, will not work:

f = h5py.File('random.hdf5', 'r')
data = f['default']
f.close()
print(data[1])

The error that appears is a bit cumbersome, but the last line is very useful:

ValueError: Not a dataset (not a dataset)

The error means that we are trying to access a dataset to which we no longer have access. This is a bit confusing, but this is because we have closed the file and therefore we are no longer allowed to access the second value in the data. When we assign f ['default'] to the variable data, we actually do not read the data from the file, instead we generate a pointer to where the data is on the hard disk. On the other hand, this code will work:

f = h5py.File('random.hdf5', 'r')
data = f['default'][:]
f.close()
print(data[10])

Please note that the only difference is that we added [:] after reading the data set. Many other guides stop at such examples, without even demonstrating the full potential of the HDF5 format with the h5py package. Because of the examples we have reviewed so far, you might be wondering: why use HDF5 if saving NumPy files gives you the same functionality? Let's dive into the features of the HDF5 format.

Selective reading from HDF5 files

Until now, we have seen that when we read a dataset, we still do not read the data from the disk, instead we create a link to a specific location on the hard disk. We can see what happens if, for example, we explicitly read the first 10 elements of a data set:

with h5py.File('random.hdf5', 'r') as f:
   data_set = f['default']
   data = data_set[:10]
print(data[1])
print(data_set[1])

We divide the code into different lines to make it more explicit, but you can be more synthetic in your projects. In the lines above, we first read the file, and then we read the default data set. We assign the first 10 items of the data set to the data variable. After closing the file (when it ends), we can access the values ​​stored in data, but the data_set will generate an error. Note that we only read from disk when we explicitly refer to the first 10 items of the data set. If you look at the data and data_set types, you will see that they are really different. The first is the NumPy array, and the second is the h5py DataSet.

The same behavior is relevant in more complex scenarios. Let's create a new file, this time with two sets of data, and let's select the elements of one of them based on the elements of the other. Start by creating a new file and storing data; This part is the simplest:

import h5py
import numpy as np
arr1 = np.random.randn(10000)
arr2 = np.random.randn(10000)
with h5py.File('complex_read.hdf5', 'w') as f:
    f.create_dataset('array_1', data=arr1)
    f.create_dataset('array_2', data=arr2)

We have two data sets called array_1 and array_2, each of which contains a random NumPy array. We want to read the values ​​of array_2, which correspond to elements where the values ​​of array_1 are positive. We can try to do something like this:

with h5py.File('complex_read.hdf5', 'r') as f:
    d1 = f['array_1']
    d2 = f['array_2']
    data = d2[d1>0]

but it won't work. d1 is a data set and cannot compare with an integer. The only way is to actually read the data from the disk and then compare it. So we get something like this:

with h5py.File('complex_read.hdf5', 'r') as f:
    d1 = f['array_1']
    d2 = f['array_2']
    data = d2[d1[:]>0]

The first d1 dataset is fully loaded into memory when we do d1 [:], but from the second d2 dataset we take only some elements. If the d1 data set would be too large to load into the entire memory, we could work inside the loop.

with h5py.File('complex_read.hdf5', 'r') as f:
    d1 = f['array_1']
    d2 = f['array_2']
    data = []
    for i in range(len(d1)):
        if d1[i] > 0:
            data.append(d2[i])
print('The length of data with a for loop: {}'.format(len(data)))

Of course, there are problems with the efficiency of element-by-element reading and adding elements to the list, but this is a very good example of one of the biggest advantages of using HDF5 over text or NumPy files. Inside the loop, we load only one item into memory. In our example, each element is simply a number, but it could be anything from text to image or video.

As always, depending on your application, you will have to decide whether you want to read the entire array into memory or not. Sometimes you run simulations on a specific computer with a large amount of memory, but you do not have the same characteristics on your laptop, and you are forced to read pieces of your data. Remember that reading from a hard disk is relatively slow, especially if you use a HDD instead of SDD disks or even longer if you read from a network drive.

Selective recording to HDF5 files

In the examples above, we added data to the dataset as soon as it was created. However, for many applications you need to save data during generation. HDF5 allows you to save data in almost the same way as you read it. Let's see how to create an empty dataset and add some data to it.

arr = np.random.randn(100)
with h5py.File('random.hdf5', 'w') as f:
   dset = f.create_dataset("default", (1000,))
   dset[10:20] = arr[50:60]

The first two lines are the same as before, except for create_dataset. We do not add data when it is created, we simply create an empty data set that can hold up to 1000 elements. With the same logic as before, when we read certain elements from a dataset, we actually write to disk only when we assign values ​​to certain elements of the dset variable. In the example above, we assign values ​​only to a subset of the array, with indices from 10 to 19.

Warning

It is not entirely true that you write to disk when you assign values ​​to a data set. The exact time depends on several factors, including the state of the operating system. If the program closes too early, it may happen that not everything will be recorded. It is very important to always use the methodclose()and in case you write in stages, you can also use flush()to force a recording. Using with prevents many recording problems.

If you read the file and print out the first 20 values ​​of the data set, you will see that they are all zeros, with the exception of indexes 10 to 19. There is a common mistake that can lead to a tangible headache. The following code will not save anything on disk:

arr = np.random.randn(1000)
with h5py.File('random.hdf5', 'w') as f:
   dset = f.create_dataset("default", (1000,))
   dset = arr

This error always delivers a lot of problems, because you will not understand that you have not recorded anything until you try to read the result. The problem here is that you do not specify where you want to store the data, you simply overwrite the dset variable with a NumPy array. Since the dataset and the array are the same length, you must use dset [:] = arr. This error happens more often than you think, and since it is not technically incorrect, you will not see any errors displayed on the terminal, and your data will be zeros.

Until now, we have always worked with one-dimensional arrays, but we are not limited to them. For example, suppose we want to use a 2D array, we can simply do:

dset = f.create_dataset('default', (500, 1024))

which will allow us to store data in an array of 500x1024. To use a dataset, we can use the same syntax as before, but taking into account the second dimension:

dset[1,2] = 1
dset[200:500, 500:1024] = 123

Specify data types to optimize space.

So far, we have looked only at the tip of the iceberg of what HDF5 has to offer. In addition to the length of the data you want to save, you can specify the type of data to optimize the space. The h5py documentation contains a list of all supported types, here we show only a couple of them. At the same time, we will work with several data sets in one file.

with h5py.File('several_datasets.hdf5', 'w') as f:
   dset_int_1 = f.create_dataset('integers', (10, ), dtype='i1')
   dset_int_8 = f.create_dataset('integers8', (10, ), dtype='i8')
   dset_complex = f.create_dataset('complex', (10, ), dtype='c16')
   dset_int_1[0] = 1200
   dset_int_8[0] = 1200.1
   dset_complex[0] = 3 + 4j

In the example above, we created three different data sets, each of which has a different type. Integers from 1 byte, integers 8 bytes and complex numbers from 16 bytes. We store only one number, even if our data sets can contain up to 10 elements. You can read the values ​​and see what was actually saved. It should be noted here that an integer of 1 byte should be rounded to 127 (instead of 1200), and an integer of 8 bytes should be rounded to 1200 (instead of 1200.1).

If you have ever programmed in languages ​​like C or Fortran, you probably know what the different data types mean. However, if you have always worked with Python, you may not have encountered any problems without explicitly declaring the type of data you are working with. It is important to remember that the number of bytes tells you how many different numbers you can save. If you use 1 byte, you have 8 bits, and therefore you can store 2 ^ 8 different numbers. In the above example, the integers are both positive, negative, and 0. When you use 1 byte integers, you can store values ​​from -128 to 127, for a total of 2 ^ 8 possible numbers. This is equivalent to using 8 bytes, but with a large range of numbers.

The type of data selected will affect its size. First, let's see how this works with a simple example. Create three files, each with one data set for 100,000 items, but with different data types. We will keep the same data in them, and then compare their sizes. We create a random array to assign each data set to fill the memory. Remember that the data will be converted to the format specified in the data set.

arr = np.random.randn(100000)
f = h5py.File('integer_1.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='i1')
d[:] = arr
f.close()
f = h5py.File('integer_8.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='i8')
d[:] = arr
f.close()
f = h5py.File('float.hdf5', 'w')
d = f.create_dataset('dataset', (100000,), dtype='f16')
d[:] = arr
f.close()

When you check the size of each file, you will get something like:

FileSize (b)
integer_1102144
integer_9802144
float1602144

The relationship between size and data type is quite obvious. When you go from integers from 1 byte to 8 bytes, the file size increases 8 times, similarly, when you go to 16 bytes, it takes about 16 times more space. But space is not the only important factor to consider, you must also consider the time required to write data to a disk. The more you have to write, the longer it will take. Depending on your application, it may be extremely important to optimize reading and writing data.

Please note: if you use the wrong data type, you may also lose information. For example, if you have an integer of 8 bytes, and you store them as integers of 1 byte, their values ​​will be truncated. When working in the laboratory, there are very often devices that create different types of data. Some DAQ cards have 16 bits, some cameras work with 8 bits, but some of them may work with 24. It is important to pay attention to data types, but this is also something that Python developers may not take into account, because you don’t need to explicitly declare type

It is also interesting to remember that the default NumPy array will be initialized to float 8 bytes (64 bits) per element. This can be a problem if, for example, you initialize an array with zeros for storing data, which should be only 2 bytes. The type of the array itself will not change, and if you save the data when creating the dataset (by adding data = my_array), the default format will be “f8”, which is an array, but not real data,

Thinking about data types is not something that happens regularly if you work with Python in simple applications. However, you should know that there are data types and how they can affect your results. You may have large hard drives, and you don’t particularly care about file storage, but when you care about the speed with which you save, there is no other way than to optimize every aspect of your code, including data types.

Data compression

When saving data, you can choose compression using different algorithms. The h5py package supports several compression filters, such as GZIP, LZF and SZIP. When using one of the compression filters, the data will be processed on its way to the disk, and when read, they will be unpacked. Therefore, there are no special changes in the operation of the code. We can repeat the same experiment, keeping different types of data, but using a compression filter. Our code looks like this:

import h5py
import numpy as np
arr = np.random.randn(100000)
with h5py.File('integer_1_compr.hdf5', 'w') as f:
    d = f.create_dataset('dataset', (100000,), dtype='i1', compression="gzip", compression_opts=9)
    d[:] = arr
with h5py.File('integer_8_compr.hdf5', 'w') as f:
    d = f.create_dataset('dataset', (100000,), dtype='i8', compression="gzip", compression_opts=9)
    d[:] = arr
with h5py.File('float_compr.hdf5', 'w') as f:
    d = f.create_dataset('dataset', (100000,), dtype='f16', compression="gzip", compression_opts=9)
    d[:] = arr

We chose gzip because it is supported on all platforms. The compression_opts parameters set the level of compression. The higher the level, the less space is occupied by the data, but the longer the processor should work. The default compression level is 4. We can see the differences in our files based on the compression level:

Type ofWithout compressionCompression 9Compression 4
integer_11021442801630463
integer_88021444332957971
float160214414695801469868

The effect of compression on whole data arrays is much more noticeable than on floating point data sets. I leave it to you to figure out why compression worked so well in the first two cases, but not in the last. As a hint: you should check what data you actually save.

Reading compressed data does not change any of the code described above. The main library of HDF5 will take care of extracting data from compressed data sets using the appropriate algorithm. Therefore, if you implement compression to save, you do not need to change the code you use to read.

Data compression is an additional tool that you must consider along with all other aspects of data processing. You need to consider additional processor time and effective compression to evaluate the benefits of data compression within your own application. The fact that it is transparent to the downstream code makes it incredibly easy to test and find the optimal solution.

Resizing datasets

When you are working on an experiment, it is sometimes impossible to know how big your data will be. Imagine that you are recording a movie, perhaps you will stop it in one second, perhaps in an hour. Fortunately, HDF5 allows you to resize data sets on the fly with little computational overhead. The length of the data set can be exceeded up to the maximum size. This maximum size is specified when creating a dataset using the maxshape keyword:

import h5py
import numpy as np
with h5py.File('resize_dataset.hdf5', 'w') as f:
    d = f.create_dataset('dataset', (100, ),  maxshape=(500, ))
    d[:100] = np.random.randn(100)
    d.resize((200,))
    d[100:200] = np.random.randn(100)
with h5py.File('resize_dataset.hdf5', 'r') as f:
    dset = f['dataset']
    print(dset[99])
    print(dset[199])

First you create a data set to store 100 values ​​and set the maximum size to 500 values. After you have saved the first batch of values, you can expand the data set to save the next 100. You can repeat the procedure until you get a data set with 500 values. The same is true for arrays of various shapes; any size of an N-dimensional matrix can be changed. You can verify that the data was saved correctly by reading the file and typing two items on the command line.

You can also change the size of the data set at a later stage; you do not need to do this in the same session as when you created the file. For example, you can do something like this (note that we open a file with an attribute so as not to destroy the previous file):

with h5py.File('resize_dataset.hdf5', 'a') as f:
    dset = f['dataset']
    dset.resize((300,))
    dset[:200] = 0
    dset[200:300] = np.random.randn(100)
with h5py.File('resize_dataset.hdf5', 'r') as f:
    dset = f['dataset']
    print(dset[99])
    print(dset[199])
    print(dset[299])

In the example above, you can see that we open the dataset, change its first 200 values, and add new values ​​to the items at position 200 to 299. Reading the file and printing some values ​​proves that it worked as expected.

Imagine that you are buying a film, but you do not know how long it will go. The image is a 2D array, each element of which is a pixel, and the film is nothing but the stacking of several 2D arrays. To store movies, we have to define a 3-dimensional array in our HDF file, but we don’t want to set a limit on the duration. To be able to expand the third axis of our data set without a fixed maximum, we can do the following:

with h5py.File('movie_dataset.hdf5', 'w') as f:
   d = f.create_dataset('dataset', (1024, 1024, 1),  maxshape=(1024, 1024, None ))
   d[:,:,0] = first_frame
   d.resize((1024,1024,2))
   d[:,:,1] = second_frame

The dataset contains square images with a size of 1024x1024 pixels, and the third dimension gives us the time stacking. We assume that the images do not change in shape, we would like to fold one by one without setting a limit. That is why we set the maxshape of the third dimension to None.

Saving data in chunks

To optimize data storage, you can store it in chunks. Each chunk will be contiguous on the hard disk and will be stored as a block, i.e. the entire fragment will be recorded immediately. When you read a piece, the same thing happens, the piece of data will be loaded entirely. To create a piecewise data set, use the command:

dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100))

This command means that all data in dset [0: 100.0: 100] will be saved together. This is also true for dset [200: 300, 200: 300], dset [100: 200, 400: 500], etc. According to h5py, when using chunks, there are some performance implications:

Chunking has performance implications. It is recommended to keep the total size of your chunks between 10 KiB and 1 MiB, or more for large data sets. Also keep in mind that when referring to any element in the piece, the entire piece is read from the disk.

There is also the possibility of including automatic fragmentation (auto-chunking), which will automatically select the optimal size. Automatic fragmentation is enabled by default if you use compression or maxshape. You explicitly allow it like this:

dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)

Data organization groups (Groups)

We have seen many different ways to store and read data. Now we have to consider one of the last important topics of HDF5, which is how to organize the information in a file. Data sets can be placed inside groups (groups), which behave in the same way that directories work. First we can create a group and then add a dataset to it:

import numpy as np
import h5py
arr = np.random.randn(1000)
with h5py.File('groups.hdf5', 'w') as f:
    g = f.create_group('Base_Group')
    gg = g.create_group('Sub_Group')
    d = g.create_dataset('default', data=arr)
    dd = gg.create_dataset('default', data=arr)

We create the Base_Group group and inside it we create the second one, called the Sub_Group. In each group, we create a default data set and store a random array in them. When you count files, you will notice how the data is structured:

with h5py.File('groups.hdf5', 'r') as f:
   d = f['Base_Group/default']
   dd = f['Base_Group/Sub_Group/default']
   print(d[1])
   print(dd[1])

As you can see, to access the dataset, we address it as a folder in the file: Base_Group / default or Base_Group / Sub_Group / default. When you are reading a file, you may not know how the groups were named, and you need to list them. The easiest way is to use keys ():

with h5py.File('groups.hdf5', 'r') as f:
    for k in f.keys():
        print(k)

However, when you have nested groups, you will also need to initiate nested for-loops. There is a better way to iterate through a tree, but this is a bit more cunning. We need to use the visit () method, for example:

defget_all(name):
   print(name)
with h5py.File('groups.hdf5', 'r') as f:
   f.visit(get_all)

Note that we define a function get_allthat takes one argument, name. When we use the visit method, it takes a function of type get_all.visit as an argument that passes through each element, and until the function returns a value other than None, it will continue to iterate. For example, imagine that we are looking for an element called Sub_Group, we need to change get_all:

defget_all(name):if'Sub_Group'in name:
        return name
with h5py.File('groups.hdf5', 'r') as f:
    g = f.visit(get_all)
    print(g)

When visit is repeated through each item, as soon as the function returns something that is not None, it will stop and return the value that is generated by get_all. Because we are looking for a Sub_Group, we force get_all to return the name of the group when it finds a Sub_Group as part of the name being analyzed. Keep in mind that g is a string, if you want to actually get a group, you should do:

with h5py.File('groups.hdf5', 'r') as f:
   g_name = f.visit(get_all)
   group = f[g_name]

You can also work with groups as explained earlier. The second approach is to use a method called visititems, which takes a function with two arguments: name and object. We can do it:

defget_objects(name, obj):if'Sub_Group'in name:
      return obj
with h5py.File('groups.hdf5', 'r') as f:
   group = f.visititems(get_objects)
   data = group['default']
   print('First data element: {}'.format(data[0]))

The main difference when using visititems is that we have gained access not only to the name of the object that is being analyzed, but also to the object itself. You can see that the function returns an object, not a name. This template allows for more sophisticated filtering. For example, you might be interested in groups that are empty or have a specific type of data set.

Metadata storage in HDF5

One aspect that is often ignored in HDF5 is the ability to store metadata attached to any group or set of data. Metadata is crucial to understand, for example, where data came from, what parameters used to measure or simulate, etc. Metadata makes the file self-descriptive. Imagine that you are opening old data, and find a matrix of 200x300x250. You may know that this is a movie, but you do not know which dimension is time, and not the timing between frames.

Saving metadata to an HDF5 file can be done differently. The official method is to add attributes to groups and data sets.

import time
import numpy as np
import h5py
import os
arr = np.random.randn(1000)
with h5py.File('groups.hdf5', 'w') as f:
    g = f.create_group('Base_Group')
    d = g.create_dataset('default', data=arr)
    g.attrs['Date'] = time.time()
    g.attrs['User'] = 'Me'
    d.attrs['OS'] = os.name
    for k in g.attrs.keys():
        print('{} => {}'.format(k, g.attrs[k]))
    for j in d.attrs.keys():
      print('{} => {}'.format(j, d.attrs[j]))

In the code above, you can see that attrs is like a dictionary. Basically, you should not use attributes to store data; keep them as small as possible. However, you are not limited to individual values, you can also store arrays. If you have metadata stored in the dictionary, and you want to automatically add them to the attributes, you can use update:

with h5py.File('groups.hdf5', 'w') as f:
   g = f.create_group('Base_Group')
   d = g.create_dataset('default', data=arr)
   metadata = {'Date': time.time(),
      'User': 'Me',
      'OS': os.name,}
   f.attrs.update(metadata)
   for m in f.attrs.keys():
      print('{} => {}'.format(m, f.attrs[m]))

Remember that the data types supported by hdf5 are limited. For example, dictionaries are not supported. If you want to add a dictionary to the hdf5 file, you need to serialize it. In Python, you can serialize the dictionary in different ways. In the example below, we are going to do it with JSON, because it is very popular in different areas, but you can use whatever you want, including pickle.

import json
with h5py.File('groups_dict.hdf5', 'w') as f:
    g = f.create_group('Base_Group')
    d = g.create_dataset('default', data=arr)
    metadata = {'Date': time.time(),
                'User': 'Me',
                'OS': os.name,}
    m = g.create_dataset('metadata', data=json.dumps(metadata))

The beginning is the same, we create a group and a data set. To store metadata, we define a new data set, called metadata, respectively. When we define data, we use json.dumps, which convert the dictionary into a long string. We are actually storing a string, not a dictionary in HDF5. To load it back, we need to read the dataset and convert it back to a dictionary using json.loads:

Python
with h5py.File('groups_dict.hdf5', 'r') as f:
    metadata = json.loads(f['Base_Group/metadata'][()])
    for k in metadata:
        print('{} => {}'.format(k, metadata[k]))

When you use json to encode your data, you define a specific format. You could use YAML, XML, etc. Since it may not be clear how to load the metadata stored in this way, you can add the data set's attr attribute to determine which serialization path you used.

Final thoughts on HDF5

In many applications, text files are more than enough; they provide an easy way to store and share data with other researchers. However, as information increases, you need to look for tools that are better suited than text files. One of the main advantages of the HDF format is that it is self-sufficient, which means that the file itself contains all the necessary information to read it, including metadata information that allows you to reproduce the results. Moreover, the HDF format is supported in different operating systems and programming languages.

HDF5 files are complex and allow you to store a lot of information in them. The main advantage over databases is that they are self-contained files that can be easily shared. Databases need a whole system to manage them, they cannot be easily transferred, etc. If you are used to working with SQL, you can try the HDFql project , which allows you to use SQL to analyze data from the HDF5 file.

Storing large amounts of data in a single file potentially increases the chance of data corruption. If your file loses its integrity, for example, due to a faulty hard disk, it is difficult to predict how much data will be lost. If you save years of measurements in a single file, you expose yourself to unnecessary risks. Moreover, backups will be cumbersome because you cannot make incremental backups of a single binary file.

HDF5 is a format that has a long history and is used by many researchers. You will need a little time to get used to, and you will need to experiment for a while until you find a way that it can help you save your data. HDF5 is a good format if you need to establish transversality rules in your laboratory regarding the storage of data and metadata.

THE END

As always, we are waiting for questions, suggestions here or can look into our open lesson .

Also popular now: