Sorting large amounts of data, implementation in Java
Recently on Habré there was an article Sorting a million 32-bit int'ov in 2 megabytes of memory on Python . Interested, tried to implement in Java.
Of course, 32 lines did not work. It turned out 235.
But it seemed to me that the result could be used as the first post - do not judge strictly;)
First, we got a trial version that worked with arrays of integers and stored them in a binary file. After it worked (by the way, 2.3 seconds on p4-3.2), I thought that tight binding to architecture and types is not the right way. After all, it was not for nothing that collections were created in Java that work with any type of data.
After spending 2 hours, I got quite a working class digesting any Comparable types.
Test - a class for checking health. I didn’t bother with unit tests.
FileSort - in fact, the main employee, cook and carpenter.
FileSortStorageObject - an assistant that saves a list of objects in a temporary file.
FileSortStorage - helper interface. In fact, serialization (ObjectOutputStream and ObjectInputStream) slows down the operation of the algorithm, so those who wish can implement a different saving method.
Since there is a lot of data, it makes no sense to present them as a full array. It seemed to me that you could work with them using an iterator.
1. Using the iterator, a part of the data is read, which fits into the memory (the number of records is set).
2. The read block is sorted by the same quicksort (its memory overhead is not large).
3. Next, the sorted block is added to its temporary file, where it lies for the time being.
4. As long as there is data, steps 1-3 are repeated
5. Here we have N files, each of which contains a sorted portion of the data.
6. An iterator is created that simply merges the existing files.
When merging, iterators are also used (what to do, I like them), which select the next record from each individual file. From the selected records, the minimum is selected, which returns out. In place of the record issued outside, the next object from the corresponding file is immediately calculated.
Code ... Yes, you could write better.
The algorithm ... Yes, for sure it is not correct. But it works :)
my original
Of course, 32 lines did not work. It turned out 235.
But it seemed to me that the result could be used as the first post - do not judge strictly;)
First, we got a trial version that worked with arrays of integers and stored them in a binary file. After it worked (by the way, 2.3 seconds on p4-3.2), I thought that tight binding to architecture and types is not the right way. After all, it was not for nothing that collections were created in Java that work with any type of data.
After spending 2 hours, I got quite a working class digesting any Comparable types.
What “it” consists of:
Test - a class for checking health. I didn’t bother with unit tests.
FileSort - in fact, the main employee, cook and carpenter.
FileSortStorageObject - an assistant that saves a list of objects in a temporary file.
FileSortStorage - helper interface. In fact, serialization (ObjectOutputStream and ObjectInputStream) slows down the operation of the algorithm, so those who wish can implement a different saving method.
How does it work:
Since there is a lot of data, it makes no sense to present them as a full array. It seemed to me that you could work with them using an iterator.
1. Using the iterator, a part of the data is read, which fits into the memory (the number of records is set).
2. The read block is sorted by the same quicksort (its memory overhead is not large).
3. Next, the sorted block is added to its temporary file, where it lies for the time being.
4. As long as there is data, steps 1-3 are repeated
5. Here we have N files, each of which contains a sorted portion of the data.
6. An iterator is created that simply merges the existing files.
When merging, iterators are also used (what to do, I like them), which select the next record from each individual file. From the selected records, the minimum is selected, which returns out. In place of the record issued outside, the next object from the corresponding file is immediately calculated.
So, the source. Suddenly someone wants to play around
Main class:
package ru.dchekmarev.test.FileSort;
import java.io.*;
import java.util.*;
/**
* Класс для сортировки больших объёмов данных с использованием файловой подкачки
*/
public class FileSort> implements Iterable {
private int bufferSize = 10000;
private List partFiles = new LinkedList();
private Iterator source;
private List part = new LinkedList();
/**
* Конструктор по умолчанию, ничего не делает
*/
public FileSort() {
}
/**
* Конструктор с параметром - источником
*/
public FileSort(Iterator newSource) {
setSource(newSource);
}
/**
* Конструктор с двумя параметрами - источником и количеством объектов на файл
*/
public FileSort(Iterator newSource, Integer newSize) {
this(newSource);
setBufferSize(newSize);
}
/**
* Установка количества объектов на один файл
*/
public void setBufferSize(int newSize) {
bufferSize = newSize;
}
/**
* Установка источника данных, используется итератор
*/
public void setSource(Iterator newSource) {
source = newSource;
try {
sortParts();
} catch (IOException e) {
throw new RuntimeException(e);
}
Collections.sort(part);
}
/**
* Получение результата в виде итератора
*/
public Iterator iterator() {
if (partFiles.size() == 0) {
// маленькая оптимизация, если всё уместилось в память
return part.iterator();
}
return new Iterator() {
Long t = 0L;
List items = new ArrayList();
List> iterators = new ArrayList>();
Integer minIdx = null;
// динамическая инициализация итератора, вместо конструктора
// делаем список итераторов по одному на файл, из которых будем извлекать элементы при слиянии
{
iterators.add(part.iterator());
for (FileSortStorage f : partFiles) {
iterators.add(f.iterator());
}
for (Iterator item : iterators) {
if (item.hasNext()) {
items.add(item.next());
} else {
throw new RuntimeException("failed to get first for iterator");
}
}
}
/**
* Находит среди объектов минимальный, возвращает доступность очередного объекта
*/
public boolean hasNext() {
if (minIdx == null) {
for (int i = 0; i < items.size(); i++) {
if (items.get(i) != null &&
(minIdx == null ||
items.get(i).compareTo(items.get(minIdx)) < 0)) {
minIdx = i;
}
}
}
return minIdx != null;
}
/**
* Если есть доступный объект - возвращает,
* замещая его в доступных на очередной из файла
*/
public T next() {
T res = null;
if (hasNext()) {
res = items.get(minIdx);
if (iterators.get(minIdx).hasNext()) {
items.set(minIdx, iterators.get(minIdx).next());
} else {
items.set(minIdx, null);
}
}
minIdx = null;
return res;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
/**
* Производит чтение исходных данных с сохранением блоков во временные файлы
*/
void sortParts() throws IOException {
while (source.hasNext()) {
part.add((T)source.next());
if (part.size() >= bufferSize && source.hasNext()) {
Collections.sort(part);
partFiles.add(new FileSortStorageObject(part));
part.clear();
}
}
}
}
Interface for saving to disk:
package ru.dchekmarev.test.FileSort;
import java.util.List;
import java.io.IOException;
public interface FileSortStorage extends Iterable {
public void setObjects(List objects) throws IOException;
}
Implementation of saving serialized objects to disk:
package ru.dchekmarev.test.FileSort;
import java.io.*;
import java.util.*;
/**
* Класс сохраняет список в хранилище и предоставляет к нему доступ через итератор
*/
public class FileSortStorageObject implements FileSortStorage {
private final File file;
/**
* Конструктор, создаёт временный файл и сохраняет в него объекты
*/
public FileSortStorageObject(List objects) throws IOException {
file = File.createTempFile("FileSort", "dat");
setObjects(objects);
}
/**
* Сохраняем объекты в файл
*/
public void setObjects(List objects) throws IOException {
ObjectOutputStream wr = new ObjectOutputStream(new FileOutputStream(file));
for (T item : objects) {
wr.writeObject(item);
}
wr.close();
}
/**
* Итератор по файлу-хранилищу объектов
*/
public Iterator iterator() {
try {
return new Iterator() {
private ObjectInputStream fr =
new ObjectInputStream(new FileInputStream(file));
T obj;
public boolean hasNext() {
if (obj == null) {
try {
obj = (T)fr.readObject();
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
obj = null;
}
}
return obj != null;
}
public T next() {
hasNext();
T res = obj;
obj = null;
return res;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
} catch (IOException e) {
throw new RuntimeException(e);
}
}
/**
* Зачищаем
*/
protected void finalize() {
file.delete();
}
}
And the test class, you always want to see a visual result:
package ru.dchekmarev.test.FileSort;
import java.util.*;
public class Test {
public static void main(String[] args) {
System.out.println("Test start");
// создаём класс-сортировщик
FileSort sort = new FileSort(
// в конструктор передаём итератор - источник данных
// у нас он просто генерирует случайные числа
new Iterator() {
private int i = 0;
private Random rand = new Random();
public boolean hasNext() {
if (i >= 1000) {
System.out.println("generator finish");
}
return i < 1000;
}
public Double next() {
i++;
return rand.nextDouble();
}
public void remove() {
throw new UnsupportedOperationException();
}
});
int i = 0;
// выводим отсортированный результат
for (Double res : sort) {
if (++i % 10000 == 0) {
// когда результатов много имеет смысл их вывод отключить
// и просто считать количество
System.out.println(i);
}
System.out.println(" == " + res);
}
}
}
Code ... Yes, you could write better.
The algorithm ... Yes, for sure it is not correct. But it works :)
my original