The story of the first GPU: Rendition Vérité 1000
- Transfer
There is a lot of good literature on the Quake engine: books, countless articles on the Internet, blogs, and wikis. Among them, my favorites are the Graphics Programming Black Book by Michael Abrash, published in 1997, and Rocket Jump: Quake and the Golden Age of First-Person Shooters by David L. Craddock (2018).
Unfortunately, you can find very little information about the equipment developed around 1996, which made it possible to improve 3D rendering and, in particular, the graphics of the revolutionary game id Software. Inside the architecture and design of these pieces of silicon lies the history of a technological duel between the Rendition V1000 and 3dfx Interactive Voodoo.
After the release of vQuake in early December 1996, it seemed that Rendition had taken over. V1000 was a fast card capable of launching Quake with hardware acceleration, which, according to the developer, provides a fill rate of 25 megapixels / s [1] . Just before Christmas, Rendition took over the market, allowing players to launch the game with high resolution, frame rate and 16-bit color [2] . But, as history has shown, the flaw in the design of the Vérité 1000 turned out to be fatal for the innovative company.
Properly selected time and killer applications
The idea of specialized equipment for graphics acceleration did not appear suddenly. Back in 1954, United Airlines had flight simulators for training pilots. The largest player in the field, Silicon Graphics, Inc. (SGI), appeared in 1982 and at that time offered powerful workstations such as Indy, O2 and Indigo². However, the prices of these machines did not allow them to be purchased by ordinary consumers (the 1993 SGI Infinite Reality could be sold for $ 100,000, which is equivalent to $ 177,262 in 2019). The reason for the situation that arose in the late 90s was the combination of three factors.
Firstly, the price of RAM has decreased significantly. Even though there was a huge RAM shortage in 1995 (mainly because 8 MB of memory was recommended for Microsoft Windows 95), over the year the price of RAM dropped by almost 90%. This opened up prospects for cards with stunningly huge frame buffers (640x480 with 16-bit RGB color) that can store textures locally.
Secondly, increased RAM performance. FastPage RAM was a step forward compared to DRAM, but after the release of EDO RAM, delays decreased by 30%, and access time to RAM was 50 ns [3] .
The third and final piece of the puzzle was killer apps. The PC has powerful CPUs, for example, Intel Pentium with a frequency of 166 MHz, which developers used to create high-quality 3D games. In 1996, everyone was talking about two games: Core Design's Tomb Raider and id Software's Quake.
Rendition and V1000
Rendition Inc was founded in 1993. Two years later, in 1995, the company announced the creation of the V1000 architecture, which was quickly licensed by four OEMs. Creative Labs 3D Blaster PCI, Sierra Screamin '3D, Canopus Total 3D and Intergraph Reactor were the first to appear on the market, and soon MiRO took over.
Intergraph Reactor. Image from vgamuseum.ru.
Creative Labs 3D Blaster. Image of the club “Retro Graphics Cards”.
Note that the first V1000-E chip was later replaced with a V1000L-P with lower power consumption and 20% faster [4] .
MiroCrystal VRX. Image from vgamuseum.info.
Canopus Total3D. Image from vgamuseum.ru.
The name of the cards changed, but the chips used in them were the same. The only parameter by which manufacturers had to balance price and performance was the quality installed on the RAM card.
- VGA port for connecting to a CRT monitor.
- Ramdac, usually from Bt, but sometimes an AT&T chip.
- The core of the card is a V1000-E, V1000-P or v1000-L chip.
- Eight 512 kibyte DRAM / EDO chips (4 mebibytes total) for storing frame buffers and textures.
- 64 kb of EEPROM containing BIOS.
The V1000 had two inherent properties that are important to note because 3dfx Voodoo (which I will discuss later) used a radically different approach.
Firstly, the card was supposed to be a replacement for what was already installed at the buyer. The chip supported rendering both 2D and 3D in VGA, and thanks to context switches, it had an impressive “3D in window” mode. Therefore, the card had a single output VGA port.
The second feature is the “big iron” architecture, based on a single Mips CPU that gets access to all 4 memory bytes. The 64-bit data bus between them had no special properties. This standardized design made it easy to program the card using the bootable microcode (this turned the card into the first GPU for a PC, long before Nvidia came up with this definition.)
V1000 Programming
The SDK [5] came with a set of header files for interacting with the C language (RRedline on Windows and Speedy3D on DOS). The rendering of the textured triangle resembled what Vulkan with manual VRAM provides today. The API, capable of rendering angle-based textured triangles, also supported alpha tests, alpha blending and fog.
#include
#include
#include
WinMain(HINSTANCE instance, HINSTANCE prevInstance, LPSTR cmdLine, int cmdShow){
int WIDTH=640, HEIGHT = 480;
HWND hWndMain = ... ;
// Setup Verite board and resolution/refresh rate
v_handle verite;
VL_OpenVerite(hWndMain, &verite);
V_SetDisplayType(verite, V_FULLSCREEN_APP);
V_SetDisplayMode(verite, WIDTH, HEIGHT, 16, 75);
// Copy texture to VRAM
bmp_info bmp = loadBMP("data\\rlogo.bmp");
v_memory memObj = V_AllocLockedMem(verite, bmp.linebytes*bmp.height);
memcpy(V_GetMemoryObjectAddress(memObj), bmp.addr, bmp.linebytes*bmp.height);
v_surface *display, *texture;
VL_CreateSurface(verite, &display, V_SURFACE_PRIMARY, 2, V_PIXFMT_565, WIDTH, HEIGHT);
VL_CreateSurface(verite, &texture, 0, 1, V_PIXFMT_565, bmp.width, bmp.height);
v_cmdbuffer cmdbuffer = V_CreateCmdBuffer(verite, 0, 0);
VL_LoadBuffer(&cmdbuffer, texture, 0, bmp.linebytes, bmp.width, bmp.height, memObj, 0);
VL_InstallDstBuffer(&cmdbuffer, display);
VL_InstallTextureMap(&cmdbuffer, texture);
VL_SetSrcFunc(&cmdbuffer, V_SRCFUNC_REPLACE)
// Clear screen to black
VL_FillBuffer(&cmdbuffer, display, 1, 0, 0, display->width, display->height,0);
// Populate cmd with triangle coo and textCoo
v_kaxyzuvq vertex[3] = ... ;
VL_Triangle(&cmdbuffer, V_FIFO_KAXYZUVQ, &vertex[0], &vertex[1], &vertex[2]);
V_IssueCmdBuffer(verite, cmdbuffer);
VL_SwapDisplaySurface(&cmdbuffer, display);
}
RRedline loaded 128 KB of microcode into Vérité and translated C calls into V1000 assembler function calls.
An interesting fact: the name of the API "RRedline" beat the phrase "Rendition Ready" and most likely was chosen collectively. However, the name Speedy3D was the idea of Walt Donovan.
In fact, the v1000 was just a slow CPU (25 MHz), having a 32 * 32 one-cycle multiplication (occupying a substantial part of the chip!), A one-cycle instruction for calculating the approximated inverse value (i.e., two-cycle approximated integer division), and a common set of RISC instructions. Oh, and also the “bilinear loading” instruction, which read out a 2x2 linear memory block and performed bilinear filtering based on fractional values of u and v passed to the instruction. There was a tiny cache in the map, it seems, only 4 pixels. Therefore, if a perfectly matching 2x2 block appeared, we received a reduction in the load on the memory bandwidth.
There was no hardware support for Z-buffers. Therefore, the software running in v1000 had to read Z, perform a comparison, and then decide whether to write or not.
- Walt Donovan (Algorithm Architect)
To send textures and microcode to the card, the driver used DMA to transfer data via PCI without CPU intervention. In practice, many motherboards did not have bus control correctly, so games had to return to PCI FIFO mode, which negatively affected performance [6] . Inside the card, all operations were performed in 32-bit fixed-point integers.
The developers decided that Rendition would be fully programmable, but did not use any smart pipeline or fast synchronization. Therefore, if 25 instructions were needed to record a pixel, then we get only 1 megapixel / s. If you use equipment with fixed functionality, you can create a conveyor that is equivalent to these 25 instructions and achieve 25 megapixels / s. 3dfx employees came from SGI, so they chose the approach that turned out to be the right decision - to create a triangle processing engine with fixed functionality and a subset of OpenGL functions for management in the equipment. The V1000 developers had a completely different experience, they did not know OpenGL, and therefore decided that it would be more correct to create a CPU.
- Walt Donovan (Algorithm Architect)
In addition to all this set of functions, the card also had an innovative anti-aliasing system, which had a funny side effect.
The anti-aliasing algorithm used in vQuake has been patented (patent number 6005580). There was a funny joke about this algorithm. It worked only with triangles, but not intervals. Quake used the concept of “perfect z-buffering,” in which graphics were divided into intervals and visually sorted using BSP / PVS (binary space partitioning / a set of potentially visible elements). Therefore, the engine created a set of intervals that ideally covered the screen without overlays and missing pixels, and for rendering, a single write operation (without z-buffering!) In the display memory was required. However, the initial data for these intervals were triangles. The antialiasing algorithm looked for edges of silhouettes and smoothed them. (For more information on this idea, see humus.name, Geometric Post-Process antialiasing entry from March 2011 - the author invented this technology again!) But since anti-aliasing was performed after the screen was rendered (all intervals were already drawn), the algorithm had no idea whether the edge was visible or not. He painted it anyway. (If a z-buffer were used, only visible edges would be redrawn!) In practice, this was not a big problem, because BSP usually cut invisible triangles very well.
But not with character models! Therefore, vquake allowed the player to see people hiding behind doors and walls, creating a small and moving distortion in the textures!
- Walt Donovan (Algorithm Architect)
vQuake
At the time of the release of the cards, they supported some good games. Yes, Descent II, Grand Prix Legends, IndyCar Racing II, Myst, Nascar Racing, EF2000 and Tomb Raider were good games, but Quake was the true diamond in the crown, the most demanding and promoting sale. The game id Software received its own port under Vérité called vQuake, released on December 2, 1996. It was written by Walt Donovan and Stefan Share from Vérité in collaboration with id Abrash Michael.
The work was quite painstaking, but the port worked. Pentium 166Mhz, capable of rendering Quake at 320x200 resolution at 26 frames per second, could jump to 640x480 with bilinear filtering and still render at 22 frames per second [8]. In practice, the players chose a resolution of 512x384, which looked beautiful and made it possible to provide 32 frames per second on the P166. For a short time, vQuake has undeniably been the best way to play Quake.
Software rendering
Vérité V1000
Many thanks to @swaaye from the vogons.org forum for taking screenshots of the V1000 and Fruit Of the Dojo for its high-quality and easy-to-hack Quake port on MacOSX [9] .
Software rendering
Vérité V1000
Z-buffer flaw
What the V1000 lacked (and indirectly its successor V2200) was the hardware acceleration of the z-buffer. As soon as the developer included a depth test, the fill rate dropped to 12.5 megapixels / s and the frame rate was halved. As Stefan Podell later explained [10] , vQuake (and all other games) were ported to the V1000 in such a way as to minimize reading of the z-buffer.
The developers found that the only way to ensure the necessary speed was to transfer the main part of the work to the CPU. In the case of vQuake, this meant that the map would be used as an ultrafast horizontal interval renderer that always writes to the z-buffer, but z is read and compared only when rendering enemies. And although the developers managed to create good products, the consequences of such a choice of architecture hovered for a long time.
3dfx and drop Rendition
id Software released GLQuake on January 22, 1997. It was implemented based on miniGL (a subset of the OpenGL 1.0 standard, which, among other things, lacked GL_LIGHT and GL_FOG). This binary opened the door to all hardware accelerated PC cards. In this regard, 3dfx Interactive's Voodoo cards were particularly distinguished, their stunning performance (41fps in 512x384 resolution with 16-bit color on P166 [11] ) became the de facto standard for 3D accelerators. The V1000's fill rate of 25 megapixels / s, which once compares favorably with Pentium's software rendering, now seemed mediocre against a background of 50 megapixels / s from the Voodoo card, which was not even affected by z-tests.
The Rendition response was the more powerful V2x00, which paradoxically worsened the situation. It was advertised that thanks to the hardware z-buffer, the V2x00 was twice as fast, however, it was not able to improve even the frame rate in vQuake. This anomaly undermined customer confidence and had a negative effect on vQuake developer Stefan Sharele, who felt he needed to explain why vQuake's performance was limited by CPU rather than GPU [12] .
... my reputation turned out to be tainted by the fact that VQuake and VHexen2 did not work faster on V2x00, so I must explain why this happened.
[...]
Walt and Michael decided that since the Verite 1000 did not perform very well in pixels with Z-buffering, if Pentium was allowed to do this sorting of intervals, it could reduce the number of pixels that Verite had to draw. Moreover, we could turn off the Z comparison function in Verite.
[...]
... whatever the Verite chip was, the CPU got a lot of work.
- Stefan Podell
Moreover, there were significant problems in the hardware architecture, which initially led to the failure of [13] V2x00. It took several months to fix the problem, and even after that the board still worked at a frequency of 50 MHz, while NVidia NV3 and Voodoo2 already reached 100 MHz.
The third generation, based on the V3300, could change the course of history, but it came out too late. The project was canceled in 1998, after Rendition was acquired by Micron Technology.
While working at Rendition, we made a lot of mistakes. It was possible to release v1000 a few months earlier (and have no competitors during these months) if we developed the scheme ourselves and not transferred to the fab. In addition, the quality control of the chip raised questions. One guy in our company spent several months implementing mpeg decompression in assembly language V1000, but could not get it to work due to unpredictable chip bugs.
vQuake worked well just because the v1000 didn't do much work. “Render this list of intervals”, “smooth this edge” - that’s almost all he did. Mike Abrash and I spent too much time making Quake compatible with the V1000, so this model was not suitable for the long term.
- Walt Donovan (Algorithm Architect)
After the collapse of Rendition, 3dfx redoubled its efforts to promote Voodoo2, the outstanding characteristics of which allowed to sweep away all competitors. The king of 3D graphics on PC has ruled the market for a while. Then the game continued, new competitors appeared on the scene, and among them were the Canadian ATI and a company almost unknown at that time called Nvidia.
Reference materials
[1] Source: VGA Museum, V1000 Texel Fillrate (MTexel / s) reported as 25
[2] Source: John Carmack .plan Aug 22, 1996 “at 512 * 384 it is almost twice as fast”
[3] Source: 3dfx VOODOO1 Reference Rev. 1.0
[4] Source: Review of the V1000
[5] Source: Rendition Verite V1000 SDK
[6] Source: The immaturity of the PCI bus [...] caused DMA bugs to surface
[7] Source: RRedline Programming Guide
[8 ] Source: Benchmarks to compare the Rendition Vérité V1000-E and V1000L-P
[9] Source: MacOSX X Quake port source code on github.com
[10] Source: Stephan Podell BSS post
[11] Source: Comparison of Frame-rates in GLQuake Using Voodoo1
[12] Source: Stephan Podell BSS post
[13] Source: wikipedia.com, Downfall section