Speeding up Android code

    I will continue the work begun in my previous article on the optimization of the algorithm. I’ll briefly tell you what has been done. Ready java sources and a cascade model of one of the Viola-Jones algorithm implementations were taken. This algorithm is used to search for objects in a photograph, in particular to search for faces. Testing was carried out on my phone, according to the results, it was found that the original java code worked for 54 seconds in a 300 by 400 photo. It was too slow, the code I rewrote in C ++ showed a result of 14 seconds. In the comments, it was suggested to catch up with the java implementation to C ++ as follows: profile and find bottlenecks, and replace the two-dimensional array with a one-dimensional one. Also, I had plans to parallelize the algorithm in C ++. Everything has been done and researched, the results are below.

    We profile java release. We take statistics at the time of operation of the main part of the algorithm. We see such a picture (the result in the file is of interest to anyone):

    List methods consume a lot of time. In this situation, all the List's can be easily replaced with arrays, which is used in C ++. We replace and get the following result ( in the file ):

    pay attention to Math.sqrt, the percentage of its execution has risen from 6.7 to 13.5, which means that others have reduced the time consumption, and the launch on the phone showed 38 seconds. The rest of the time is spent on primitive methods: multiplying, dividing, and getting an array element and Math.sqrt. Accordingly, I no longer see where and what can be changed, because C ++ repeats the code exactly.

    One-dimensional arrays instead of two-dimensional ones . Change the following code:
    			int[][] grayImage=new int[width][height]; 
    			int[][] squares=new int[width][height]; 
    

    on the:
    			int[] grayImage=new int[width * height]; 
    			int[] squares=new int[width * height]; 
    

    and in appropriate places of use:
    		int total_x = grayImage[(i + w) * height + j + h] + grayImage[i * height + j] - grayImage[i * height + j + h] - grayImage[(i + w) * height + j];
    		int total_x2 = squares[(i + w) * height + j + h] + squares[i * height + j] - squares[i * height + j + h] - squares[(i + w) * height + j];
    ...
    			rect_sum += (int) ((grayImage[rx2 * height + ry2] - grayImage[rx1 * height + ry2] - grayImage[rx2 * height + ry1] + grayImage[rx1 * height + ry1]) * r.weight);
    		}
    

    the result is exactly the same. Let’s try to reduce the number of multiplication operations to calculate an array element, i.e. let's do this:
    		int iwh = (i + w) * height;
    		int ih = i * height;
    		int total_x = grayImage[iwh + j + h] + grayImage[ih + j]	- grayImage[ih + j + h] - grayImage[iwh + j];
    		int total_x2 = squares[iwh + j + h] + squares[ih + j] - squares[ih + j + h] - squares[iwh + j];
    ...
    			int rx2h = rx2 * height;
    			int rx1h = rx1 * height;
    			rect_sum += (int) ((grayImage[rx2h + ry2] - grayImage[rx1h + ry2] - grayImage[rx2h + ry1] + grayImage[rx1h + ry1]) * r.weight);
    

    the result has not changed. The one-dimensional array did not give pluses, only minuses in code readability.
    And so, with Java we ended up with 38 versus 14 seconds in C ++.

    Parallelization. Let's try to parallelize the algorithm written in C ++ on a GT-I9300I phone with a quad-core processor. On Android, posix threads can be used, as it is written in the documentation, it is slightly truncated, but for us it is enough to create threads, pass parameters to them and wait for their execution. All this is done simply:
    #include 
    ...
    extern "C" {
    	struct ThreadArgument {
    		void *array;
    		int threadNum;
    	};
    	void *worker_thread(void *arg) {
    		ThreadArgument *arg1 = (ThreadArgument*) arg;
    		((Detector*) (arg1->array))->workerTansient(arg1->threadNum);
    		pthread_exit(NULL);
    	}
    }
    ...
    	pthread_t m_pt[threadsNum - 1];
    	if (threadsNum > 1) {
    		res2 = new VectorRects*[threadsNum - 1];
    		for (int l = 0; l < threadsNum - 1; l++) {
    			ThreadArgument* args = new ThreadArgument;
    			args->array = (void*)this;
    			args->threadNum = l + 1;
    			int success = pthread_create(&m_pt[l], NULL, worker_thread, args);
    			if (success == 0) {
    				__android_log_print(ANDROID_LOG_INFO, "Detector", "thread %d started", l);
    			}
    		}
    	}
    	__android_log_print(ANDROID_LOG_INFO, "Detector", "gettFaces3 baseScale=%f maxScale=%f scale_inc=%f", baseScale, maxScale, scale_inc);
    	ret = getResult(0);
    	if (threadsNum > 1) {
    		for (int l = 0; l < threadsNum - 1; l++) {
    			int success = pthread_join(m_pt[l], NULL);
    			for (int b = 0; b < res2[l]->currIndex; b++) {
    				ret->addRect(res2[l]->rects[b]);
    			}
    		}
    	}
    ...
    VectorRects* Detector::workerTansient(int threadNum)
    {
    	__android_log_print(ANDROID_LOG_INFO, "Detector", "workerTansient thread running %d", threadNum);
    	res2[threadNum - 1] = getResult(threadNum);
    	return NULL;
    }
    ...
    

    We get the following result: 14 seconds for one thread, 8 seconds for two threads, 5 seconds for three, and 4 seconds for four or more threads. On a Nexus S virtual device with a single-core processor, parallelization did not lead to anything.

    Sources of the obtained algorithm can be downloaded here . Use as follows:
        Detector detector = Detector.create(inputHaas);
        List res = detector.getFaces(image, 1.2f, 1.1f, .05f, 2, true, useCpp, threadsNum);
    

    where inputHaas is the model stream, i.e. file haarcascade_frontalface_default.xml from the original algorithm, useCpp - use C ++ or not, threadsNum - the number of threads for C ++ code. Detector is assumed to be triggered once.

    Total Replacing the LinkedList used in the iteration with arrays as a whole gave a 1.5 times increase for the current algorithm, but it still lags 2.5 times behind the implementation in C ++. Translation of a two-dimensional array into a one-dimensional array showed no changes. Using parallelization in C ++, we managed to speed up the process by 3 times on a quad-core processor, which is already much better, but far from ideal. You can look towards RenderScript - a mechanism for complex calculations at the native level, maybe it will give any benefit.

    Also popular now: