Optimal continuous archive sorting
The embodiment of one idea is to arrange the files so that the archive size is minimal.
The program checks the compressibility of the files in a pair and then sorts the list for compression by the archiver.
sourceforge.net/projects/saro-vks
If you need anyone - take it.
In general, the idea is not new - for example, WinRAR of the fifth version can find duplicates and save them in the archive as links.
Here, a similar principle - “similarity” of files is determined by a compression check.
The best result is obtained with large files when the size of a pair of files is larger than the volume of the “dictionary” of the archive.
For example, 1.5 GB of executable files, approximately 50 MB each, shrank from 709 to 696 MB, and an archive of 25 MB MIDI files, 30..300 kB each, decreased from 5.55 to 5.51 MB, compared with the factory sorting.
A couple of pictures according to the algorithm, for clarity.
First, the compressibility of the pairs is checked - the matrix of N * N files is filled (with a rhombus - if you delete the last files from the list, the results already obtained for the first files are saved).
Then the sequence is sorted by the “window” - within the window, all possible combinations of files are searched without duplication (factorial N), and the combination with the minimum size is left.
The window gradually shifts throughout the list, while the size continues to decrease.
When the smaller size is nowhere to be found, sorting is complete.
ps I apologize for the damp version, but the "dopilivanie" is slow, and when the final version will be - it is unknown, but it is already possible to use it.
While there is a significant minus - the size of archives with file pairs is limited to 4 GB (very large files were not needed, and processing of 64-bit sizes took longer).
If something is missing - write, I will do it if possible.
The program checks the compressibility of the files in a pair and then sorts the list for compression by the archiver.
sourceforge.net/projects/saro-vks
If you need anyone - take it.
In general, the idea is not new - for example, WinRAR of the fifth version can find duplicates and save them in the archive as links.
Here, a similar principle - “similarity” of files is determined by a compression check.
The best result is obtained with large files when the size of a pair of files is larger than the volume of the “dictionary” of the archive.
For example, 1.5 GB of executable files, approximately 50 MB each, shrank from 709 to 696 MB, and an archive of 25 MB MIDI files, 30..300 kB each, decreased from 5.55 to 5.51 MB, compared with the factory sorting.
A couple of pictures according to the algorithm, for clarity.
First, the compressibility of the pairs is checked - the matrix of N * N files is filled (with a rhombus - if you delete the last files from the list, the results already obtained for the first files are saved).
Then the sequence is sorted by the “window” - within the window, all possible combinations of files are searched without duplication (factorial N), and the combination with the minimum size is left.
The window gradually shifts throughout the list, while the size continues to decrease.
When the smaller size is nowhere to be found, sorting is complete.
ps I apologize for the damp version, but the "dopilivanie" is slow, and when the final version will be - it is unknown, but it is already possible to use it.
While there is a significant minus - the size of archives with file pairs is limited to 4 GB (very large files were not needed, and processing of 64-bit sizes took longer).
If something is missing - write, I will do it if possible.