
The eternal question of time measurement
It seemed that long discussions in the forums ended on how to measure the time of the algorithm, what functions to use, what accuracy to expect. It’s a pity, but again I have to return to this issue. On the agenda is how best to measure the speed of a parallel algorithm.
I must say that I won’t give an exact recipe. I myself recently encountered the problem of measuring the speed of parallel algorithms and am not an expert in this matter. This note is more of a research nature. I will be glad to hear your opinion and recommendations. I think that together we will deal with the problem and work out the optimal solution.
The challenge is to measure the runtime of a user code section. Previously, for this task, I used a class of the following form:
This class is based on the use of the GetThreadTimes function , which allows you to separate the time spent on the user code and the system functions. The class is designed to estimate the runtime of the stream in user mode, and therefore only the return parameter lpUserTime is used.
You can measure runtime using tools such as Intel Parallel Amplifier , but it is often convenient to measure time from the inside of a program. For example, this allows you to write values to the log. Therefore, we still try to create a class for measuring speed.
Consider a code example that calculates a certain number. We use the Timing class to measure runtime.
In this form, the time measurement mechanism behaves as expected and gives, say, 7 seconds on my working machine. The result is correct even for a multi-core machine, since it does not matter which kernels will be used in the process of the algorithm (see Figure 1).

Figure 1 - Single thread operation on a multi-core machine.
Now imagine that we want to use the capabilities of multi-core processors in our program and want to evaluate what parallelization of the algorithm based on OpenMP technology gives us . We parallelize our code by adding one line:
Now the program prints on the screen a run time of 1.6 seconds. Since a machine with 4 cores is used, I want to say "Hooray, we got 4 times acceleration and measuring the operating time confirms this."
Sorry. We measure not the running time of the algorithm, but the running time of the main thread. In this case, the measurement seems reliable, since the main thread worked as much as the additional ones. Let's put a simple experiment. Explicitly specify to use 10 threads, not 4:
Logic suggests that this code should work at about the same time as the code parallelized into 4 threads. We have four nuclear processors, and therefore only a slowdown can be expected from increasing the number of threads. However, on the screen we will see a value of the order of 0.7 seconds.
This is the expected result, although we wanted to get something completely different. 10 threads were created. Each of which worked for about 0.7 seconds. It was this time that the main thread worked, the time of which we measure using the Timing class. As you can see, this method is not applicable for measuring the speed of programs with parallel sections of code. For clarity, we will display this in Figure 2.

Figure 2 - This is how the work of 10 threads might look like on a machine with four cores.
Of course, you can always use the time () function, but its resolution is low, it does not allow to separate the operating time of user and system code. Additionally, other processes can affect time, which can also greatly distort time measurements.
A favorite feature of many developers for measuring speed is QueryPerformanceCounter . Let's build a speed measurement using it. In a simple form, the class for the measurement will look like this:
Unfortunately, on a multi-core machine, this can no longer be done. :) Here's what MSDN says about this feature:
On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input / output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function.
We will make improvements and bind the main thread to one core:
This method of measurement is still subject to the same drawback that it is impossible to separate the operating time of the system and user code. If other tasks are performed on the kernel, then the measurement result will also be very inaccurate. However, it seems to me that this method is still applicable to the parallel algorithm, unlike GetThreadTimes. If this is not so, then please correct me!
Let's measure what the Timing and Timing2 classes show on a different number of threads:
We summarize the data in the table shown in the third figure.

Figure 3 - The algorithm running time in seconds measured using the GetThreadTimes and QueryPerformanceCounter functions on a quad-core machine.
As the table shows, while the number of threads does not exceed the number of cores, the GetThreadTimes function gives a result similar to QueryPerformanceCounter, which can lead to the feeling that the correct measurements are being made. However, with more threads, it is no longer possible to rely on its result.
I look forward to your comments and descriptions of how to better measure the running time of parallel algorithms.
I must say that I won’t give an exact recipe. I myself recently encountered the problem of measuring the speed of parallel algorithms and am not an expert in this matter. This note is more of a research nature. I will be glad to hear your opinion and recommendations. I think that together we will deal with the problem and work out the optimal solution.
The challenge is to measure the runtime of a user code section. Previously, for this task, I used a class of the following form:
class Timing { public: void StartTiming (); void StopTiming (); double GetUserSeconds () const { return double (m_userTime) / 10000000.0; } private: __int64 GetUserTime () const; __int64 m_userTime; }; __int64 Timing :: GetUserTime () const { FILETIME creationTime; FILETIME exitTime; FILETIME kernelTime; FILETIME userTime; GetThreadTimes (GetCurrentThread (), & creationTime, & exitTime, & kernelTime, & userTime); __int64 curTime; curTime = userTime.dwHighDateTime; curTime << = 32; curTime + = userTime.dwLowDateTime; return curTime; } void Timing :: StartTiming () { m_userTime = GetUserTime (); } void Timing :: StopTiming () { __int64 curUserTime = GetUserTime (); m_userTime = curUserTime - m_userTime; }
This class is based on the use of the GetThreadTimes function , which allows you to separate the time spent on the user code and the system functions. The class is designed to estimate the runtime of the stream in user mode, and therefore only the return parameter lpUserTime is used.
You can measure runtime using tools such as Intel Parallel Amplifier , but it is often convenient to measure time from the inside of a program. For example, this allows you to write values to the log. Therefore, we still try to create a class for measuring speed.
Consider a code example that calculates a certain number. We use the Timing class to measure runtime.
int _tmain (int, _TCHAR *) { Timing t; t.StartTiming (); unsigned sum = 0; for (int i = 0; i <1000000; i ++) { char str [1000]; for (int j = 0; j <999; j ++) str [j] = char (((i + j)% 254) + 1); str [999] = 0; for (char c = 'a'; c <= 'z'; c ++) if (strchr (str, c)! = NULL) sum + = 1; } t.StopTiming (); printf ("sum =% u \ n", sum); printf ("%. 3G seconds. \ n", t.GetUserSeconds ()); return 0; }
In this form, the time measurement mechanism behaves as expected and gives, say, 7 seconds on my working machine. The result is correct even for a multi-core machine, since it does not matter which kernels will be used in the process of the algorithm (see Figure 1).

Figure 1 - Single thread operation on a multi-core machine.
Now imagine that we want to use the capabilities of multi-core processors in our program and want to evaluate what parallelization of the algorithm based on OpenMP technology gives us . We parallelize our code by adding one line:
#pragma omp parallel for reduction (+: sum) for (int i = 0; i <1000000; i ++) { char str [1000]; for (int j = 0; j <999; j ++) str [j] = char (((i + j)% 254) + 1); str [999] = 0; for (char c = 'a'; c <= 'z'; c ++) if (strchr (str, c)! = NULL) sum + = 1; }
Now the program prints on the screen a run time of 1.6 seconds. Since a machine with 4 cores is used, I want to say "Hooray, we got 4 times acceleration and measuring the operating time confirms this."
Sorry. We measure not the running time of the algorithm, but the running time of the main thread. In this case, the measurement seems reliable, since the main thread worked as much as the additional ones. Let's put a simple experiment. Explicitly specify to use 10 threads, not 4:
#pragma omp parallel for reduction (+: sum) num_threads (10)
Logic suggests that this code should work at about the same time as the code parallelized into 4 threads. We have four nuclear processors, and therefore only a slowdown can be expected from increasing the number of threads. However, on the screen we will see a value of the order of 0.7 seconds.
This is the expected result, although we wanted to get something completely different. 10 threads were created. Each of which worked for about 0.7 seconds. It was this time that the main thread worked, the time of which we measure using the Timing class. As you can see, this method is not applicable for measuring the speed of programs with parallel sections of code. For clarity, we will display this in Figure 2.

Figure 2 - This is how the work of 10 threads might look like on a machine with four cores.
Of course, you can always use the time () function, but its resolution is low, it does not allow to separate the operating time of user and system code. Additionally, other processes can affect time, which can also greatly distort time measurements.
A favorite feature of many developers for measuring speed is QueryPerformanceCounter . Let's build a speed measurement using it. In a simple form, the class for the measurement will look like this:
class Timing2 { public: void StartTiming (); void StopTiming (); double GetUserSeconds () const { return value; } private: double value; LARGE_INTEGER time1; }; void Timing2 :: StartTiming () { QueryPerformanceCounter (& time1); } void Timing2 :: StopTiming () { LARGE_INTEGER performance_frequency, time2; QueryPerformanceFrequency (& performance_frequency); QueryPerformanceCounter (& time2); value = (double) (time2.QuadPart - time1.QuadPart); value / = performance_frequency.QuadPart; }
Unfortunately, on a multi-core machine, this can no longer be done. :) Here's what MSDN says about this feature:
On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input / output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function.
We will make improvements and bind the main thread to one core:
class Timing2 { public: void StartTiming (); void StopTiming (); double GetUserSeconds () const { return value; } private: DWORD_PTR oldmask; double value; LARGE_INTEGER time1; }; void Timing2 :: StartTiming () { oldmask = SetThreadAffinityMask (:: GetCurrentThread (), 1); QueryPerformanceCounter (& time1); } void Timing2 :: StopTiming () { LARGE_INTEGER performance_frequency, time2; QueryPerformanceFrequency (& performance_frequency); QueryPerformanceCounter (& time2); SetThreadAffinityMask (:: GetCurrentThread (), oldmask); value = (double) (time2.QuadPart - time1.QuadPart); value / = performance_frequency.QuadPart; }
This method of measurement is still subject to the same drawback that it is impossible to separate the operating time of the system and user code. If other tasks are performed on the kernel, then the measurement result will also be very inaccurate. However, it seems to me that this method is still applicable to the parallel algorithm, unlike GetThreadTimes. If this is not so, then please correct me!
Let's measure what the Timing and Timing2 classes show on a different number of threads:
int _tmain (int, _TCHAR *) { Timing t; Timing2 t2; t.StartTiming (); t2.StartTiming (); unsigned sum = 0; // # pragma omp parallel for reduction (+: sum) num_threads (2) // # pragma omp parallel for reduction (+: sum) num_threads (4) // # pragma omp parallel for reduction (+: sum) num_threads (6) // # pragma omp parallel for reduction (+: sum) num_threads (10) for (int i = 0; i <1000000; i ++) { char str [1000]; for (int j = 0; j <999; j ++) str [j] = char (((i + j)% 254) + 1); str [999] = 0; for (char c = 'a'; c <= 'z'; c ++) if (strchr (str, c)! = NULL) sum + = 1; } t.StopTiming (); t2.StopTiming (); printf ("sum =% u \ n", sum); printf ("GetThreadTimes:% .3G seconds. \ n", t.GetUserSeconds ()); printf ("QueryPerformanceCounter:% .3G seconds. \ n", t2.GetUserSeconds ()); return 0; }
We summarize the data in the table shown in the third figure.

Figure 3 - The algorithm running time in seconds measured using the GetThreadTimes and QueryPerformanceCounter functions on a quad-core machine.
As the table shows, while the number of threads does not exceed the number of cores, the GetThreadTimes function gives a result similar to QueryPerformanceCounter, which can lead to the feeling that the correct measurements are being made. However, with more threads, it is no longer possible to rely on its result.
I look forward to your comments and descriptions of how to better measure the running time of parallel algorithms.