Cluster performance testing under Windows. Linpack, Lizard
Hello,
Today's post is about the delicate issue of cluster performance testing. Many will say (and be right) that, in general, the results of such tests are intended solely for press releases and reporting to TOP500 have no practical use. However, testing tools can also be used to identify system bottlenecks. So, in the first post we will talk about Linpack & Lizard.
Table of contents:
1) Linpack general information
2) Linpack main parameters
3) Lizard. Linpack implementation for Windows-systems
4) Lizard. Linpack optimization for Windows-based systems
5) Native cluster testing tools
Note: at some points we are talking about the performance of computers, in some - about the network. These two indicators add up to the overall cluster performance
Since the 80s, the Linpack library, now expanded to a more functional LAPACK (Linear Algebra PACKage), has been considered the benchmark of the library for testing the performance of supercomputers (not only clusters). It has interfaces for Fortran and C.
LAPACK analogues:
* Intel MKL
* AMD ACML
* Sun Performance Library
* NAG's LAPACK
* HP's MLIB
Each manufacturer, in the best IT tradition, develops and implements its own library for its architecture. Naturally, Intel’s
MKL library will give better performance than LAPACK.
The main task of Linpack and its analogues / modifications is to solve the system of linear arithmetic equations of the form Ax = f using the LU factorization method with the choice of the leading element of the column, where A is a filled matrix of dimension N. The original matrix is divided into logical blocks of dimension NB × NB. These blocks, in turn, are divided into smaller ones by the P × Q grid. Each of these blocks will go to a separate processor of the system.
More information about the mathematical base of the test can be found at www.intuit.ru/department/supercomputing/tbucs/4/2.html > read on the Intuit website .
Performance in the Linpack benchmark is measured in the number of floating point operations performed per second. The unit of measurement is 1 flop (one such operation per second).
• N , the rank of the matrix. The higher the rank, the more arithmetic floating point operations will be executed. N is limited by the amount of memory that the system can allocate to the HPL process. LIZARD himself can choose the optimal, as he believes, parameters. So, 26,000 is suitable for four nodes with 2 GB of RAM on each. But it’s better to choose the value empirically, starting with the smallest. A performance drop will be detected when the system starts writing to the swap file, and, accordingly, it will be necessary to slightly decrease the rank value in order to get the optimal one. N must be equal to or greater than P * Q.
• P and Q- additional coefficients, the product of which must be adjusted to the value of N. P * Q = Number of Processes. You can equate P to the number of processors, and Q to the number of nodes - it will be quite optimal. Before configuring, you need to consider Hyperthreading (or better off altogether).
• NB - coefficient reflecting the number of parts into which the task will be divided. Shows how much a piece of data will be received by each node. Practically speaking, the smaller the value of this coefficient, the more optimal the processor load. But you can configure it as it is considered necessary, and watch the performance that will turn out in the end (based on the needs of the architecture). When dividing N by NB, the remainder must be zero.
For convenience, you can use Excel Linpack, when filling in the corresponding cells independently calculating the values of the coefficients.
HPL saves the results to an hpl file in its working folder with detailed comments. Unfortunately, I was not able to bring such a file from our configuration into a digestible form.
It is logical that Microsoft, suddenly flew into the TOP500 with its new system, could not stay away. For lazy Windows-system administrators, a shell for testing cluster performance (Lizard, Linpack Wizard) was specially developed, which is based on a canonical library wrapped in a convenient visual wizard (supplied with the HPC Tool Pack 2008). This wizard allows both an express test (with standard parameters automatically selected by the wizard) and advanced for specific coefficient settings. Everything is accompanied by comments.
For optimization, Microsoft recommends disabling all services on which the system’s operation does not directly depend. Script:
sc stop wuauserv
sc stop WinRM
sc stop WinHttpAutoProxySvc
sc stop WAS
sc stop W32Time
sc stop TrkWks
sc stop SstpSvc
sc stop Spooler
sc stop ShellHWDetection
sc stop RemoteRegistry
sc stop RasMan
sc stop NlaSvc
sc stop NetTcpActivator
sc stop NetTcpActivator
sc stop NetTcpActivator
sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivpator
sc stop MSDTC
sc stop KtmRm
sc stop KeyIso
rem sc stop gpsvc
sc stop bfe
sc stop CryptSvc
sc stop BITS
sc stop AudioSrv
sc stop SharedAccess
sc stop SENS
sc stop EventSystem
sc stop PolicyAgent
sc stop AeLookupSvc
sc stop WerSvc
sc stop hkmsvc
sc stop UmRdpService
sc stop MpsSvc
sc config wuauserv start = disabled
sc config WinRM start = disabled
sc config WinHttpAutoProxySvc start = disabled
sc config WAS start = disabled
sc config W32Time start = disabled
sc config TrkWks start = disabled
sc config SstpSvc start = disabled
sc config Spooler start = disabled
sc config ShellHWDetection start = disabled
sc config RemoteRegistry start = disabled
sc config RasMan start = disabled
sc config NlaSvc start = disabled
sc config NetTcpActivator start = disabled
sc config NetTcpPortSharing start = disabled
sc config netprofm start = disabled
sc config NetPipeActivator start = disabled
sc config MSDTC start = disabled
sc config KtmRm start = disabled
sc config KeyIso start = disabled
rem sc config gpsvc start = disabled
sc config bfe start = disabled
sc config CryptSvc start = disabled
sc config BITS start = disabled
sc config AudioSrv start = disabled
sc config SharedAccess start = disabled
sc config SENS start = disabled
sc config EventSystem start = disabled
sc config PolicyAgent start = disabled
sc config AeLookupSvc start = disabled
sc config WerSvc start = disabled
sc config hkmsvc start = disabled
sc config UmRdpService start = disabled
sc config MpsSvc start = disabled
In addition to Linpack and Lizard, Windows HPC Server 2008 (namely, HPC Pack 2008) has standard cluster performance testing tools, such as:
MPI Ping-Pong Lightweight Throughput (packet transfer between nodes)
MPI Ping-Pong Quick Check (check network latency, bandwidth etc)
Of course, this does not end the list of tests, there are more than 10 of them covering the entire functionality of the cluster.
Thanks for attention.
Today's post is about the delicate issue of cluster performance testing. Many will say (and be right) that, in general, the results of such tests are intended solely for press releases and reporting to TOP500 have no practical use. However, testing tools can also be used to identify system bottlenecks. So, in the first post we will talk about Linpack & Lizard.
Table of contents:
1) Linpack general information
2) Linpack main parameters
3) Lizard. Linpack implementation for Windows-systems
4) Lizard. Linpack optimization for Windows-based systems
5) Native cluster testing tools
Note: at some points we are talking about the performance of computers, in some - about the network. These two indicators add up to the overall cluster performance
1) Linpack Overview
Since the 80s, the Linpack library, now expanded to a more functional LAPACK (Linear Algebra PACKage), has been considered the benchmark of the library for testing the performance of supercomputers (not only clusters). It has interfaces for Fortran and C.
LAPACK analogues:
* Intel MKL
* AMD ACML
* Sun Performance Library
* NAG's LAPACK
* HP's MLIB
Each manufacturer, in the best IT tradition, develops and implements its own library for its architecture. Naturally, Intel’s
MKL library will give better performance than LAPACK.
The main task of Linpack and its analogues / modifications is to solve the system of linear arithmetic equations of the form Ax = f using the LU factorization method with the choice of the leading element of the column, where A is a filled matrix of dimension N. The original matrix is divided into logical blocks of dimension NB × NB. These blocks, in turn, are divided into smaller ones by the P × Q grid. Each of these blocks will go to a separate processor of the system.
More information about the mathematical base of the test can be found at www.intuit.ru/department/supercomputing/tbucs/4/2.html > read on the Intuit website .
Performance in the Linpack benchmark is measured in the number of floating point operations performed per second. The unit of measurement is 1 flop (one such operation per second).
2) The main parameters of Linpack
• N , the rank of the matrix. The higher the rank, the more arithmetic floating point operations will be executed. N is limited by the amount of memory that the system can allocate to the HPL process. LIZARD himself can choose the optimal, as he believes, parameters. So, 26,000 is suitable for four nodes with 2 GB of RAM on each. But it’s better to choose the value empirically, starting with the smallest. A performance drop will be detected when the system starts writing to the swap file, and, accordingly, it will be necessary to slightly decrease the rank value in order to get the optimal one. N must be equal to or greater than P * Q.
• P and Q- additional coefficients, the product of which must be adjusted to the value of N. P * Q = Number of Processes. You can equate P to the number of processors, and Q to the number of nodes - it will be quite optimal. Before configuring, you need to consider Hyperthreading (or better off altogether).
• NB - coefficient reflecting the number of parts into which the task will be divided. Shows how much a piece of data will be received by each node. Practically speaking, the smaller the value of this coefficient, the more optimal the processor load. But you can configure it as it is considered necessary, and watch the performance that will turn out in the end (based on the needs of the architecture). When dividing N by NB, the remainder must be zero.
For convenience, you can use Excel Linpack, when filling in the corresponding cells independently calculating the values of the coefficients.
HPL saves the results to an hpl file in its working folder with detailed comments. Unfortunately, I was not able to bring such a file from our configuration into a digestible form.
3) Lizard. Linpack Implementation for Windows Systems
It is logical that Microsoft, suddenly flew into the TOP500 with its new system, could not stay away. For lazy Windows-system administrators, a shell for testing cluster performance (Lizard, Linpack Wizard) was specially developed, which is based on a canonical library wrapped in a convenient visual wizard (supplied with the HPC Tool Pack 2008). This wizard allows both an express test (with standard parameters automatically selected by the wizard) and advanced for specific coefficient settings. Everything is accompanied by comments.
4) Lizard. Linpack Optimization for Windows Systems
For optimization, Microsoft recommends disabling all services on which the system’s operation does not directly depend. Script:
sc stop wuauserv
sc stop WinRM
sc stop WinHttpAutoProxySvc
sc stop WAS
sc stop W32Time
sc stop TrkWks
sc stop SstpSvc
sc stop Spooler
sc stop ShellHWDetection
sc stop RemoteRegistry
sc stop RasMan
sc stop NlaSvc
sc stop NetTcpActivator
sc stop NetTcpActivator
sc stop NetTcpActivator
sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivator sc stop NetTcpActivpator
sc stop MSDTC
sc stop KtmRm
sc stop KeyIso
rem sc stop gpsvc
sc stop bfe
sc stop CryptSvc
sc stop BITS
sc stop AudioSrv
sc stop SharedAccess
sc stop SENS
sc stop EventSystem
sc stop PolicyAgent
sc stop AeLookupSvc
sc stop WerSvc
sc stop hkmsvc
sc stop UmRdpService
sc stop MpsSvc
sc config wuauserv start = disabled
sc config WinRM start = disabled
sc config WinHttpAutoProxySvc start = disabled
sc config WAS start = disabled
sc config W32Time start = disabled
sc config TrkWks start = disabled
sc config SstpSvc start = disabled
sc config Spooler start = disabled
sc config ShellHWDetection start = disabled
sc config RemoteRegistry start = disabled
sc config RasMan start = disabled
sc config NlaSvc start = disabled
sc config NetTcpActivator start = disabled
sc config NetTcpPortSharing start = disabled
sc config netprofm start = disabled
sc config NetPipeActivator start = disabled
sc config MSDTC start = disabled
sc config KtmRm start = disabled
sc config KeyIso start = disabled
rem sc config gpsvc start = disabled
sc config bfe start = disabled
sc config CryptSvc start = disabled
sc config BITS start = disabled
sc config AudioSrv start = disabled
sc config SharedAccess start = disabled
sc config SENS start = disabled
sc config EventSystem start = disabled
sc config PolicyAgent start = disabled
sc config AeLookupSvc start = disabled
sc config WerSvc start = disabled
sc config hkmsvc start = disabled
sc config UmRdpService start = disabled
sc config MpsSvc start = disabled
5) Native cluster testing tools
In addition to Linpack and Lizard, Windows HPC Server 2008 (namely, HPC Pack 2008) has standard cluster performance testing tools, such as:
MPI Ping-Pong Lightweight Throughput (packet transfer between nodes)
MPI Ping-Pong Quick Check (check network latency, bandwidth etc)
Of course, this does not end the list of tests, there are more than 10 of them covering the entire functionality of the cluster.
Thanks for attention.