Memory Optimization DataTable

I think many people are familiar with the DataTable class . By subtracting large tables from the database on the client via ADO.NET, sometimes you have to extend the lifetime of instances of this class for a long time. For example, if you need to process and analyze the data obtained without resorting to ORM materialization (yes, it would be better to do this in the database, but far from everything can sometimes be brought there). When the amount of data is small, then there is no particular problem with this. However, on wide tables with a large number of rows, you can get quite thick objects in memory. Comparing the amount of data coming from the database and the size of the received DataTable , we can come to two conclusions:

  • With large amounts of varchar data, among which there is duplication, you can try to organize some kind of internment of string data inside a DataTable ;
  • DataTable contains a rather weighty internal infrastructure. Manipulating data types and the number of rows in the table, it was possible to establish that the percentage of overhead is 15-20% for tables from 100 thousand records. Most of the infrastructure provides correct editing and other table functionality. In the case when you need DataTable to be a simple container for data received from the database, you can write a lightweight, gutted analogue of this class.


The question of replacing the DataTable with an internal structure will be considered in the next article, if it is interesting. Now consider the first item. As you know, string interning consists in eliminating duplicates of string in memory (more details can be found here). We will not use the built-in internment mechanism so that the lines do not hang in the process memory after we no longer need them.

The idea is to go around all the varchar columns in the table, and in each column replace all duplicates with a link from a temporary dictionary in which the rows will lie in a single copy. The state of the heap before and after will be:



It is worth noting that the data in the DataTablestored in columns, not in rows, as you might expect. This is due to lower memory costs - because all the values ​​in the columns are of the same type, and you can use ordinary arrays with constant access time by index to store. For speed, we will read directly from these arrays through reflection ( FieldInfo ).

        // Приватные поля, используемые для оптимизации потребления таблицей памяти и быстрого доступа к данным
        private static readonly FieldInfo StorageField = typeof (DataColumn).GetField("_storage",
            BindingFlags.Instance | BindingFlags.NonPublic);
        private static readonly FieldInfo ValuesField =
            typeof (DataTable).Assembly.GetType("System.Data.Common.StringStorage")
                .GetField("values", BindingFlags.Instance | BindingFlags.NonPublic);

As I mentioned above, we will use Dictionary to store string instances in a single instance in memory .

        var exclusiveStrings = new Dictionary();

Key will be needed only for the hash, Value - to refer to a single instance in memory. It is worth noting that Dictionary resolves collisions that arise using the basket method.

Full method code:

        /// 
        /// Оптимизация таблицы по использованию памяти. По сути делает интернирование строк в рамках таблицы.
        /// 
        /// Таблица.
        /// Ссылка на себя.
        public static DataTable Compact(this DataTable table)
        {
            if (table.Rows.Count == 0)
                return table;
            var exclusiveStrings = new Dictionary();
            for (int column = 0; column < table.Columns.Count; ++column)
            {
                if (table.Columns[column].DataType == typeof(string))
                {
                    // Прямой доступ к массиву (вертикальное хранилище)
                    var values = (string[])ValuesField.GetValue(StorageField.GetValue(table.Columns[column]));
                    int rowCount = table.Rows.Count;
                    for (int row = 0; row < rowCount; ++row)
                    {
                        var value = values[row];
                        if (value != null)
                        {
                            string hashed;
                            if (!exclusiveStrings.TryGetValue(value, out hashed))
                                // строка встречается впервые
                                exclusiveStrings.Add(value, value);
                            else
                                // дубликат
                                values[row] = hashed;
                        }
                    }
                    exclusiveStrings.Clear();
                }
            }
            return table;
        }

We estimate the complexity of the resulting algorithm for a particular column. The complexity in this case consists of two main parts: bypassing all n values ​​and searching the Dictionary by key. With the first part, everything is clear, but with the search a little more interesting. In the most fantastically bad case, when all n values ​​lead to a collision and fall into one basket, the complexity of the search is reduced to linear. But, given the implementation of the hash function for a string in .NET, this thought cannot cause anything but a smile. Dictionary developers have also decided so. the documentation for TryGetValue claims O (1) complexity . Thus, within one column, the expected complexity is reduced to O (n) . For the whole table - O (mn)where m is the number of string columns.

Consider the speed on real data:



As you can see, the complexity is linear, as follows from the analysis of the algorithm. Regarding memory costs, for the algorithm at every step, filling the dictionary, creates a new pointer and removes links to strings if it finds a duplicate.

It is easy to notice that the efficiency of the algorithm depends on the input data: if there are a lot of duplicates, then more memory will be freed. In my tasks, memory was reduced by 30-50%. But, I repeat, your situation may be different.

I hope someone will find this solution as useful as me.

Also popular now: