
Optimization of ToArray and ToList methods by providing the number of elements
- Transfer
- Tutorial
The ToArray and ToList extension methods are a convenient way to quickly convert an enumerated sequence (for example, a Linq query) into an array or list. However, there is something that bothers me: both of these methods are very inefficient if they don’t know the number of elements in the sequence (which almost always happens when you use them in a Linq query). Let's first look at the ToArray method ( ToList has a few differences, but the principle is almost the same). Basically
, ToArray takes a sequence and returns an array that contains all the elements of the original sequence. If the sequence implements the ICollection interface , ToArray uses the Count property to place an array of the correct size in memory and copies the elements of the sequence into it. For example, we convert a list of users into an array:
In this case, ToArray is pretty efficient. Let's change the code so that only their names are extracted from the list of users and converted to an array:
Now, the argument for ToArray is the IEnumerable type returned by the Select method . IEnumerable does not implement ICollection interface , therefore, ToArray does not know the number of elements and cannot place an array of a suitable size in memory. He does the following:
If there are only a few elements in the source, then this algorithm runs quite painlessly, but for a very large number of elements it becomes quite inefficient due to the large number of memory allocations and copying elements.
It is annoying that in most cases we know the initial number of elements! In the example above, we use Select , which does not change the number of elements in the sequence, and we know that the same number remains. But ToArray does not know about it, because for him this information is not. Would we have a way to give him this on our own ...
Well, actually, this is quite easy to do: you just need to write your extension method, which takes the number of elements as a parameter. It might look like this:
Note from the translator: PsyHaSTe suggested an alternative method without explicitly specifying count:
Now we can optimize our example in this way:
Note that if you pass fewer elements than you actually get, you will get an IndexOutOfRangeException . You must provide the method with the correct number.
So what performance do we get now? In my tests, this version of ToArray is about twice as fast as the standard version for large sequences (tests ran 1,000,000 items). Pretty good!
Note that we can improve the ToList method in a similar way using the constructor of the List class. , which allows you to set the initial capacity of the list:
In this case, the performance increase is not as large as in the case of ToArray (approximately 25% instead of 50%), probably due to the fact that the list does not need to be truncated.
Obviously, the same optimization can be applied to the ToDictionary method , since the Dictionary class also provides a constructor indicating the initial capacity of the dictionary.
Improved versions of the ToArray and ToList methods are available in my Linq.Extras library, which also provides many other useful methods for working with sequences and with collections.
, ToArray takes a sequence and returns an array that contains all the elements of the original sequence. If the sequence implements the ICollection interface
List users = GetUsers();
User[] array = users.ToArray();
In this case, ToArray is pretty efficient. Let's change the code so that only their names are extracted from the list of users and converted to an array:
List users = GetUsers();
string[] array = users.Select(u => u.Name).ToArray();
Now, the argument for ToArray is the IEnumerable type
- First, a small array is placed in memory (4 elements in the current implementation )
- Elements are copied to the array until it is full.
- If there are no more elements in the source, go to step 7
- Otherwise, a new array is placed in memory, two times larger than the previous one
- Elements from the previous array are copied to the new
- Go to step 2 and everything repeats
- If as a result the array is larger than the number of elements, it is truncated: a new array with the correct size is placed in memory and elements are copied into it
- The resulting array is returned by the method
If there are only a few elements in the source, then this algorithm runs quite painlessly, but for a very large number of elements it becomes quite inefficient due to the large number of memory allocations and copying elements.
It is annoying that in most cases we know the initial number of elements! In the example above, we use Select , which does not change the number of elements in the sequence, and we know that the same number remains. But ToArray does not know about it, because for him this information is not. Would we have a way to give him this on our own ...
Well, actually, this is quite easy to do: you just need to write your extension method, which takes the number of elements as a parameter. It might look like this:
public static TSource[] ToArray(this IEnumerable source, int count)
{
if (source == null) throw new ArgumentNullException("source");
if (count < 0) throw new ArgumentOutOfRangeException("count");
var array = new TSource[count];
int i = 0;
foreach (var item in source)
{
array[i++] = item;
}
//в комментариях хабра RouR предложил добавить проверку на случай, если source.Count < count, чтобы
//не получить массив бОльшего размера
if(i != count) throw new ArgumentOutOfRangeException();
return array;
}
Note from the translator: PsyHaSTe suggested an alternative method without explicitly specifying count:
public static TResult[] SelectToArray(this ICollection source, Func func)
{
if (source == null)
throw new ArgumentNullException("source");
TResult[] result = new TResult[source.Count];
int i = 0;
foreach (var value in source)
{
result[i++] = func(value);
}
return result;
}
Now we can optimize our example in this way:
List users = GetUsers();
string[] array = users.Select(u => u.Name).ToArray(users.Count);
Note that if you pass fewer elements than you actually get, you will get an IndexOutOfRangeException . You must provide the method with the correct number.
So what performance do we get now? In my tests, this version of ToArray is about twice as fast as the standard version for large sequences (tests ran 1,000,000 items). Pretty good!
Note that we can improve the ToList method in a similar way using the constructor of the List class.
public static List ToList(this IEnumerable source, int count)
{
if (source == null) throw new ArgumentNullException("source");
if (count < 0) throw new ArgumentOutOfRangeException("count");
var list = new List(count);
foreach (var item in source)
{
list.Add(item);
}
return list;
}
In this case, the performance increase is not as large as in the case of ToArray (approximately 25% instead of 50%), probably due to the fact that the list does not need to be truncated.
Obviously, the same optimization can be applied to the ToDictionary method , since the Dictionary class
Improved versions of the ToArray and ToList methods are available in my Linq.Extras library, which also provides many other useful methods for working with sequences and with collections.