spiff March 2, 2011 at 21:36

The story of one optimization

annotation

The article reveals the features of high-level optimizations of computational algorithms in Java using the example of a cubic matrix multiplication algorithm.

Step 0. Set a reference point!

Determine the environment:

Hardware: 1-socket / 2-cores Intel Core 2 Duo T7300 2GHz, 2Gb ram;
Java: HotSpot (TM) Client VM (build 17.0-b17, mixed mode, sharing);

Consider the simplest cubic matrix multiplication algorithm. Perhaps any student or even a schoolboy knows him. In addition, the principle of the algorithm perfectly illustrates the picture from Wikipedia at the beginning of the article, so we will not dwell on its features.

for (int i = 0; i < A.rows(); i++) {
	for (int j = 0; j < A.columns(); j++) {
		double summand = 0.0;
		for (int k = 0; k < B.columns(); k++) {
			summand += A[i][k] * B[k][j];
		}
		C[i][j] = summand;
	}
}

As a workload we’ll take two dense square matrices N x N double precision, where N = 1000.

In this case, the working time of a simple cubic algorithm is 18.921 s . We will consider this number as a reference point (baseline) for subsequent optimizations.

Step 1. Know the 20/80 rule!

It is known that 80% of the time 20% of the code works. The task of the developer is to calculate these same 20% of the total code. In our case, everything is obvious - the desired 20% are concentrated in the internal computing cycle. From the perspective of the JVM, this code is the most likely candidate for just-in-time compilation. You can check if the bytecode is compiled to native using the options - XX: + PrintCompilation -XX: + CITime .

$ java -XX:+PrintCompilation la4j.Demo
  1       java.lang.String::hashCode (64 bytes)
  2       java.lang.String::charAt (33 bytes)
  3       java.lang.String::indexOf (151 bytes)
  4       java.lang.String::indexOf (166 bytes)
  5       java.util.Random::next (47 bytes)
  6       java.util.concurrent.atomic.AtomicLong::get (5 bytes)
  7       java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
  8       java.util.Random::nextDouble (24 bytes)
  9       la4j.matrix.DenseMatrix::set (10 bytes)
  1%      la4j.matrix.MatrixFactory::createRandomDenseMatrix @ 30 (68 bytes)
  10      la4j.matrix.MatrixFactory::createRandomDenseMatrix (68 bytes)
  2%      la4j.matrix.DenseMatrix::multiply @ 81 (152 bytes)

From the listing, the matrix multiplication method (la4j.matrix.DenseMatrix :: multiply) was successfully compiled. This gives us some scope for further action. First, let's try to take advantage of final variables. It is known that when compiled into native code, they are translated directly into values. Theoretically, replacing the borders of matrices with final values will reduce memory accesses by 1000x1000x1000 times. Instead, the values will be substituted directly. Check the theory.

final int aColumns = A.columns();
final int aRows = A.rows();
final int bColumns = B.columns();
final int bRows = B.rows();
for (int i = 0; i < aRows; i++) {
	for (int j = 0; j < aColumns; j++) {
		double summand = 0.0;
		for (int k = 0; k < bColumns; k++) {
			summand += A[i][k] * B[k][j];
		}
		C[i][j] = summand;
	}
}

Algorithm runtime : 16.996 s ;
Performance increase: ~ 11% ;

Step 2. Remember Cache!

In fact, such an implementation will always work slowly. Iteration over “k” refers to the elements of the matrix B in columns. Such reading is very expensive, because the processor has to prefetch data from memory each time, instead of taking it ready from the cache.

It is noteworthy that Fortran does not have such a problem. Multidimensional arrays in it are stored in columns. Therefore, the cache will be operated as it should and the regular implementation of the cubic algorithm will work many times faster.

On this platform, the cache line size of the first level (L1) - 64 bytes - this is the cheapest memory for the processor. In our case, 64 bytes of almost free memory give the potential to get as many as 8 matrix cells (sizeof (double) = 8) in place of one accompanied by overhead from prefetching.

The problem with the algorithm is that it practically does not use this feature. At the same time, random access to memory (in columns) leads to a reset of the cache line and the initialization of the procedure for updating the cache line with each call.

Let's try to fix this and implement sequential access to matrix elements in order to get the maximum benefit from the cache. To do this, simply transpose the matrix B and we will access its elements row by row.

final int aColumns = A.columns();
final int aRows = A.rows();
final int bColumns = B.columns();
final int bRows = B.rows();
double BT[][] = new double[bColumns][bRows];
for (int i = 0; i < bRows; i++) {
	for (int j = 0; j < bColumns; j++) {
		BT[j][i] = B[i][j];
	}
}
for (int i = 0; i < aRows; i++) {
	for (int j = 0; j < aColumns; j++) {
		double summand = 0.0;
		for (int k = 0; k < bColumns; k++) {
			summand += A[i][k] * BT[j][k];
		}
		C[i][j] = summand;
	}
}

Algorithm runtime : 7.334 s ;
Performance increase: ~ 232% ;

Step 3. Think!

There was more code, but the operating time was reduced by almost 2.5 times. Not bad. However, when viewing the code, obvious flaws are evident. The main one is a lot of cycles. Let's try to reduce the amount of code and make it more elegant by combining the operations of transposition and multiplication in one computational cycle. At the same time, transposition will be carried out not for the whole matrix but in columns, as needed.

final int aColumns = A.columns();
final int aRows = A.rows();
final int bColumns = B.columns();
final int bRows = B.rows();
double thatColumn[] = new double[bRows];
for (int j = 0; j < bColumns; j++) {
	for (int k = 0; k < aColumns; k++) {
		thatColumn[k] = B[k][j];
	}
	for (int i = 0; i < aRows; i++) {
		double thisRow[] = A[i];
		double summand = 0;
		for (int k = 0; k < aColumns; k++) {
			summand += thisRow[k] * thatColumn[k];
		}
		C[i][j] = summand;
	}
}

Algorithm runtime : 3.976 s ;
Performance increase: ~ 190% ;

Step 4. Throw away!

We take advantage of another advantage of Java - exceptions. Let's try to replace the check for going beyond the boundaries of the matrix with the try {} catch {} block. This will reduce the number of comparisons by 1000 in our case. Indeed, why compare 1000 times what will always return false and 1001 times return true.

On the one hand - we reduce the number of comparisons, on the other - there is additional overhead for handling exceptions. One way or another, this approach gives some growth.

final int aColumns = A.columns();
final int aRows = A.rows();
final int bColumns = B.columns();
final int bRows = B.rows();
double thatColumn[] = new double[bRows];
try {
	for (int j = 0; ; j++) {
		for (int k = 0; k < aColumns; k++) {
			thatColumn[k] = B[k][j];
		}
		for (int i = 0; i < aRows; i++) {
			double thisRow[] = A[i];
			double summand = 0;
			for (int k = 0; k < aColumns; k++) {
				summand += thisRow[k] * thatColumn[k];
			}
			C[i][j] = summand;
		}
	}
} catch (IndexOutOfBoundsException ignored) { }

Algorithm runtime: 3.594 s ;
Performance increase: ~ 10% ;

Step 5. Do not stop!

At this stage, I have stopped so far because I have reached the desired goal. My goal was to overtake the most popular library for working with matrices - JAMA (the first line in Google for “java matrix”). Now my implementation works about 30% faster than JAMA. In addition, relatively small optimizations led to an increase of almost 600% , relative to the initial version.

Instead of a conclusion (this is not an advertisement)

This code is part of the open-source laj4 project . This is a library for solving linear algebra problems in Java. I started working on the project in the 4th year in the process of studying a course in computational mathematics. Now la4j shows excellent performance and is several times faster than its counterparts on many tasks.

If someone is interested and wants to spend evenings in this way - write in a personal :)

Tags: