snk November 4, 2011 at 04:09

CUDA Overview of the NVIDIA Parallel Nsight 2.0 Debugger

From the sandbox

Debugging parallel code is a tedious and costly process. Parallelization errors are difficult to catch due to the non-determinism of the behavior of parallel applications. Moreover, if an error is detected, it is often difficult to reproduce again. It happens that after changing the code, it is difficult to make sure that the error has been fixed, and not masked. Most often, errors in the parallel program are heisenbags . Sometimes there is an urgent need for the most convenient and functional debugging tools for parallel programs.
So, a little over a year ago, NVIDIA released a package of tools that integrate with Microsoft Visual Studio 2008 sp1 and 2010 for debugging parallel programs written in CUDA called NVIDIA Parallel Nsight. About this, in due time, wrote XaocCPS on the habro-community. Since then, this product has become more advanced and completely free. To date, the latest version 2.0. Consider the possible configurations, installation, configuration, as well as the main features of NVIDIA Parallel Nsight.

Possible configurations

NVIDIA offers 4 hardware configuration options for installing Parallel Nsight, which differ in the ability to use certain tools:

Configuration	1 GPU system	2 GPU system	Two systems, each with a GPU	System with 2 GPUs on the same machine (NVIDIA Multi-OS)
CUDA C / C ++ Parallel Debugger	0	1	1	1
Direct3D Shader Debugger	0	0	1	1
Direct3D Graphic Inspector	1	1	1	1
Analyzer	1	1	1	1

NVIDIA calls the “ULTIMATE” configuration 4 option. NVIDIA Multi-OS is a virtual machine with support for a video driver developer. I was thinking of lifting a similar system using VMWare, but I was just faced with the inability to install the driver developer on the video adapter of the virtual system.
NVIDIA offers the following system requirements, depending on the configuration selected:

Hardware Requirements:

	Minimum	Featured
operating system	Windows® Vista SP3, Windows 7 or Windows HPC Server 2008 (32- or 64-bit)	The same
CPU	Intel Pentium Dual-core CPU equivalent @ 1.6 GHz	Intel Pentium Dual-core CPU equivalent @ 2.2 GHz or higher
RAM	For the host: 2 GB For the machine performing the pairing: 2 GB	For a host: 2 GB or more For a machine that performs a calculation: 4 GB or more
Free space on the hard drive	32-bit machine: 240 MB for Parallel Nsight 64-bit machine: 330 MB for Parallel Nsight	32-bit machine with Parallel Nsight host part: 240 MB + space for your project. 64-bit machine with Parallel Nsight host part: 330 MB + space for your project. (If you use a remote machine to run / debug the application, then the remote machine should have 240 MB of free space + space for the debug version of your application)
Output devices	Separate monitor for computing GPU	Use of DVI recommended
Local debugging (host and calculator on the same machine)	Two GPUs supporting CUDA. (see list of supported devices)	The same
Remote debugging (host and calculator on different machines)	By computer: 1 GPU with CUDA support. On the host (where the studio is installed): 1 GPU on host machine: can be any GPU.	The same
GPU Supported	developer.nvidia.com/parallel-nsight-supported-gpus	developer.nvidia.com/parallel-nsight-supported-gpus

Software Requirements:

Display driver	You must install any NVIDIA display driver that supports Parallel Nsight. If you have an NVIDIA graphics card installed on a computer, then this driver is probably already installed on it. However, NVIDIA Parallel Nsight requires an updated version of the driver in order to function properly.	The same
Local debugging (host and calculator work on the same machine)	.NET Framework 3.5 with SP1 Visual Studio: Microsoft Visual Studio 2008 with SP1 Standard Edition or later or Microsoft Visual Studio 2010	The same
Remote debugging (host and calculator work on different machines)	Host machine: .NET Framework 3.5 with SP1 Visual Studio: Microsoft Visual Studio 2008 with SP1 Standard Edition or higher or Microsoft Visual Studio 2010 Computing machine: .NET Framework 3.5 with SP1	The same
Network	Internet connection for downloading the installer. For remote debugging: TCP / IP connection of the host and subt. cars.	The same

Install Parallel Nsight

To be able to debug parallel code, a configuration with two CUDA compatible GPUs on one machine is enough (of course, it would be much more interesting to talk about a configuration with two machines, but I, unfortunately, currently have no way to build such a configuration).
So, I had to buy one of the most budgetary CUDA supporting cards: GeForce 210, in addition to my working card: GeForce GTX460. Thus, the following hardware configuration was prepared for the installation of Parallel Nsight:

Host:

CPU Type QuadCore AMD Phenom II X4 965, 3918 MHz
system board Gigabyte GA-790FXTA-UD5 (3 PCI, 1 PCI-E x1, 3 PCI-E x16, 4 DDR3 DIMM, Audio, Dual Gigabit LAN , IEEE- 1394)
chipset motherboard AMD 790FX, AMD K10 4096 MB
system memory

Conclusion:

Video card NVIDIA GeForce 210 (512 MB)
Video card NVIDIA GeForce GTX 460 (1024 MB)
Monitor ENV LED2770h [NoDB] (AUBB1JA005271) (DVI)

As an operating system I used Windows 7 enterprise edition x64. Next, we need MVS not lower than 2008 sp1.

The NVIDIA website contains the necessary distributions. We will need :

Developer Drivers for WinVista and Win7
CUDA Toolkit
CUDA Computing SDK
Parallel Nsight 2.0.

Install distributions in the same order. Now in the studio, when calling the wizard for new projects, a new section “NVIDIA” (the template comes in the package “CUDA Toolkit”) should be added, and the project type “NVIDIA CUDA 4.0” in it. Select it and create the project. If the installation of all distributions was correct, then the resulting worldword can be compiled and run.
All OK? Then let's deal directly with the Parallel Nsight debugger. Since our machine is both a server and a computer at once, we must first run the host component: “Nsight Monitor”. We open the code and set a breakpoint somewhere in the calculation kernel procedure, start the project with a special button in the nsight panel. Pay attention to a few points:

The project must be built in advance (the nsight application start button does not compile).
All breakpoints set outside the computational core will be ignored if the program is launched in the debug mode nsight. This is done in the reverse order: if the program is debugged in normal mode, then only breakpoints opposite the regular code are taken into account.
The first time you run the nsight debugger on the seven, you are likely to encounter at least two problems: incompatibility with WPF accelerator, and Windows Aero. They must be turned off (the first is turned off by adding to the registry:
Windows Registry Editor Version 5.00 [HKEY_CURRENT_USER\Software\Microsoft\Avalon.Graphics]"DisableHWAcceleration"=dword:00000001
the second one is disabled from the control panel) or you can disable the warning check in nsight itself: in the studio set: Nsight-> Options-> Override local debugging checks to “True”, but this is fraught with problems. For example, if you specify in the code, as a device for calculations, the video card on which the desktop is drawn and run nsight debugging, we get an eternal freeze. It is not clear what is meant by the incompatibility of Parallel Nsight and WPF / Aero, since during debugging with the “Override local debugging check” option turned on, there were no problems with these mechanisms both from the side of the debugger and from the mechanisms themselves.

So, the debugger is at the breakpoint:

Now, as with debugging a regular application, you can see the available control values. A full-fledged watcher allows you to view arrays. In the screenshot above, the “A” variable is of the Matrix3 type: The number of elements of the “elements” array available for viewing is determined by the “Max array elements” parameter in the Parallel Nsight debugger settings.

typedef struct { 

 int x_size; 

 int y_size;

 int z_size;

 float4* elements; 

} Matrix3;

As can be seen from the values of the indices: blockIdx and threadIdx: the debugger is located in the first thread of the first block of the grid. The question arises: how to move to the desired stream? In the nsight tools, a window is available: “Nsight Cuda Device Summary”, an interface that allows you to move between warps in the vicinity of the stream where you stopped. The size of the neighborhood is determined by the hardware capabilities of the video chip. So when calculating on the GeForce 200 at the time of the stop two blocks of 4 warps were available:

Similarly for the GeForce GTX 460:

31 blocks are available. In order to move to a specific thread inside the warp, you need to use the “Cuda Debug Focus” window (the interface of which also allows you to move between blocks).
Again the question: “how to get into a stream that does not fall into the vicinity of the first stream?”. For this, conditional breakpoints are used. The syntax of the condition is as follows:
@blockIdx(x,y,z) && @threadIdx(x,y,z)
The debugger stops at the specified stream, relative to which the warp neighborhood will be available.
The Nvidia parallel Nsight package includes a powerful tool for analyzing CUDA parameter calculations using various parameters with graphing, etc., called “Analysis Activity”, but this is the topic of a separate article.
My impressions of Parralel Nsight are only the most pleasant. It seems to me that a big plus is integration into the most popular development environment for windows. I repeat that recently this product has become absolutely free, which is very nice. And finally: this is the only debugging tool for CUDA programs under Windows, not counting the NVIDIA Compute Visual Profiler profiler.
Related article .
And a couple of thematic videos from YouTube:

Tags: