PatientZero April 5, 2019 at 16:59

History of 3dfx Voodoo1

Transfer

This is the second article in the series “3D Cards of the Late 90s Quake Worked on”. In the first part, we examined the Rendition Vérité 1000 of the end of 1996 and a special game port for it called vQuake. Rendition managed to defeat everyone in the Quake market. For a short period of time, it remained the only board capable of launching the id Software blockbuster with hardware acceleration.

But that all changed in January 1997, when id Software released a new version of Quake called GLQuake. Since the port was created using miniGL (a subset of the OpenGL 1.1 standard), any manufacturer of hardware accelerators could write miniGL drivers and take part in the 3D card race. From that moment on, the possibility of competition was open to everyone. The goal was to generate as many frames per second as possible. The reward was the fame and money of customers. Having briefly studied history, one can understand that the two authorities of that time no doubt considered the kings of the mountain two producers.

So far, there is no doubt about it: the world of Quake is ruled by Voodoo. And since Quake rules the world of games, buying 3Dfx Voodoo is almost inevitable for gamers.

- Tom's Hardware, November 30, 1997

3DFX Voodoo 1
- The standard by which all other cards are measured.

- John Carmack .plan file. February 12, 1998 ^[2]

Just looking at the specifications ^[3] , which stated a fill rate of 50 megapixels / s, I immediately wanted to study this card and understand what 3dfx did to create such a powerful product.

3dfx Interactive

Ross Smith, Scott Sellers and Gary Tarolli met when they worked together at SGI ^[4] . After working a little at Pellucid, where they tried to sell IrisVision boards for PC (in 1994 such boards cost $ 4000 apiece), colleagues founded their own company with the support of Techie Garmy Campbell. 3dfx Interactive, headquartered in San Jose, California, was founded in 1994.

Initially, the company intended to create powerful hardware systems for arcade machines, but changed its course by developing PC boards. There were three reasons for this.

Fairly low price of RAM.
Starting with FastPage RAM and then EDO RAM, latency in RAM has decreased by 30%. Now the memory could work with a frequency of up to 50 MHz.
Games in 3D (or in pseudo-3D) have become more and more popular. The success of games such as DOOM, Descent and Wing Commander III has shown that a market for 3D accelerators is about to emerge.

The founders of the company realized that they needed to create something powerful, designed for games and with a retail price in the range of 300-400 dollars. In 1996, the company announced the creation of the SST1 architecture (named after the founders - Sellers-Smith-Tarolli-1), which was soon licensed by several OEMs such as Diamond, Canopus, Innovision and ColorMAX. For their creation came up with the marketing name "Voodoo1", emphasizing its magical performance.

As in the case of the V1000, when creating cards, manufacturers could only change the selected type of RAM (EDO or DRAM), the color of the boards and the physical arrangement of the chips. Almost everything else was standardized.

Diamond Monster 3D, image taken from vgamuseum.info.

Canopus Pure3D, image taken from vgamuseum.info.

BIOSTAR Venus 3D, image taken from vgamuseum.info.

ORCHID Righteous 3D, image taken from vgamuseum.info.

When looking at the SST1 board, it was striking how different it was from its competitors - Rendition Verite 1000 and NVidia NV1.

Firstly, 3dfx took a bold step, abandoning the support of 2D rendering. Voodoo1 had two VGA ports, one used as an output, and the other as an input. The card was developed as an addition, it took as input the output from a two-dimensional VGA-card, already installed in the computer. When the user worked with the operating system (DOS or Windows), Voodoo1 simply redirected the signal from its VGA input to the VGA output. When switching to 3D mode, Voodoo1 took control of the VGA output and ignored the signal from its VGA input. Some boards had a mechanical switch that clicked when switching between 2D and 3D modes. This decision meant that the card can only be used for full-screen rendering, there was no “window” mode.

The second noteworthy aspect of SST1 was that it was not made from one CPU, but from two non-programmable ASICs (Application-Specific Integrated Circuit, Special Purpose Integrated Circuits). If you walk along the tire tracks, you can see that each of the chips labeled “TMU” and “FBI” has its own RAM. On the memory card, 4 mebibytes of RAM were divided equally: 2 mebibytes TMU for storing textures and 2 mebibytes FBI for storing the color buffer and z-buffer, while the values were stored respectively as 16-bit RGBA and 16-bit integer / half-float. A memory card with 4 mebibytes supported resolution up to 640x480 (2 color buffers (640x480x2) for double buffering + 1 depth buffer (640x480x2) = 1 843 200). Later models with 4 mebibytes of FBI RAM allowed using resolutions up to 800x600 (2x800x600x2 + 800x600x2 = 2,880,000).

SST1 rendering pipeline

The conveyor is not described in detail in the specifications. According to my interpretation, the life of a triangle consisted of five stages.

A triangle is created and transformed in the main processor of the computer (usually Pentium). Such operations include multiplication by the matrix of the model / projection space, truncation, vertex perspective division, cutting off homogeneous coordinates, and transforming the field of view. At the end of this process, only visible triangles of the screen space remain (due to clipping, one triangle may turn out to be two).
Using the triangleCMD command, the triangles are transferred via the PCI bus to the Frame Buffer Interface (FBI). They are converted to raster string queries created by the Texture Mapping Unit. For each element of the raster line (called a fragment), the TMU performs up to four search queries per pixel if the developer requires bilinear filtering. Fragmented perspective division is also performed in TMU.
TMU sends fragments to the FBI as a textured 16-bit RGBA color value + 16-bit z-value.
FBI performs fragment tests in the z-buffer, comparing them with the allocated RAM, which stores the RGBA values and the z-values of the frame buffer.
Finally, lighting is applied to the fragment based on its color attribute and a search in the fog table of 64 elements. If mixing is required, the FBI combines the resulting fragment with what is already in the color buffer.

Interesting fact: if you are a 3D enthusiast, you probably know about the fast reverse square root code, which became famous thanks to the original Quake 3 code:

float Q_rsqrt(float number) {
    long i;
    float x2, y;
    const float threehalfs = 1.5f;
    x2 = number * 0.5f;
    y  = number;
    i  = * (long*) &y;    // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );                // what the fuck? 
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );     // 1st iteration
    return y;
}

In search of ^{[5] the} source Q_rsqrt Rys for Software contacted Gary Tarolli, who said that he used this code while still working in SGI. So it is fair to assume that it was also used in the SST1 pipeline.

Something does not match

Having got acquainted with the conveyor and knowing that each component (TMU, FBI, EDO RAM) operates at a frequency of 50 MHz, we can understand that there is some kind of error in the calculations and the card cannot reach a speed of 50 megapixels / s. Two problems had to be solved here.

First, the TMU had to read four texels to perform bilinear texture filtering. This means that four cycles of access to RAM are required, which would lead to a lack of data for TMU and a fill rate of 50/4 = 12.5 megapixels / s.

There is another bottleneck at the FBI level. If z-buffer checking is enabled, then before writing or discarding the incoming z-value of the fragment should be compared with what is already in the z-buffer. If the test was successful, then the value must be recorded. These are two operations with RAM, which led to a decrease in the fill rate by half: 50/2 = 25 megapixels / s.

Four-way interleaving TMU

The solution to the four sample problem at the TMU stage is mentioned in the SST1 specification.

Full interleaving is implemented in the texture memory data path, which allows a single bank to access data regardless of the address used to access data in other banks.

- Specification SST1

It does not indicate whether the bus uses address multiplexing, or common data and address buses. It’s easier to figure it out if you draw them without multiplexing and without separation.

Regardless of the details, the TMU architecture allowed to receive 4 x 16-bit texels per cycle. If the input data arrives at the correct frequency, then the TMU can perform fragment-wise division by w, and then generate the z-value of the fragment (16-bit) and the color of the fragment (16-bit), which were transmitted to the FBI.

Two-way interleaving FBI

The solution to the problem of two RAM access operations at the FBI stage is also not described in the specification. However, the document mentions a fill rate of 100 megapixels / s achieved with glClear due to the ability to record two pixels per cycle, and this makes us understand that two-way interlacing was used here.

FBI read and wrote two pixels at a time (2 x 1 pixels consisting of 16-bit color and 16-bit z = 64 bits). To do this, the 21-bit address generates two 20-bit addresses, in which the least significant bit is discarded for reading / writing two pixels in order. Since the raster line algorithm needed for writing / reading in horizontal lines moves from left to right, reading two ordinal pixels at a time worked very well.

64-bit bus TMU-> FBI

The final piece of the puzzle is the 64-bit FBI-TMU bus. Almost nothing is written about it in the specification, but its behavior can be understood by the data that the FBI consumes. Since the FBI processes two pixels at a time, it is reasonable to assume that the TMU does not send texels as quickly as possible, but combines them two as two 16-bit colors + 16-bit z-value.

Programming Voodoo1

At the lowest level, Voodoo1 programming was done using memory-mapped registers. The API consists of a surprisingly small number of commands, there are only five of them: TRIANGLECMD (with a fixed point), FTRIANGLECMD (with a floating point), NOPCMD (no-op), FASTFILLCMD (buffer clearing) and SWAPBUFFERCMD related to loading data registers for mixing settings, z-test, fog color downloads and more. Texture loading in VRAM was performed through 8 mebibytes write-only PCI RAM with memory mapping.

(Real) Voodoo1 Programming

The developers programmed Voodoo1 through the Glide API ^[6] . The API design logic was inspired by IRIS GL / OpenGL, it used a state machine and prefixes for everything (only “gr” was used instead of “gl”, and programmers needed to control VRAM, as is now done in Vulkan.)

#include 
void main( void ) {
   GrHwConfiguration hwconfig;
   grGlideInit(void);
   grSstSelect( 0 );
   grSstQueryHardware(&hwconfig);
   grSstSelect(0);
   grSstWinOpen(null, GR_RESOLUTION_640x480, GR_REFRESH_60HZ, 
     GR_COLORFORMAT_RGBA, GR_ORIGIN_LOWER_LEFT, 2, 0);
   grBufferClear(0, 0, 0);
   GrVertex A, B, C;
   ... // Init A, B, and C.
   guColorCombineFunction( GR_COLORCOMBINE_ITRGB );
   grDrawTriangle(&A, &B, &C);
   grBufferSwap( 1 );
   grGlideShutdown();
}

"Standard" MiniGL

Although MiniGL was a subset of the OpenGL 1.1 standard, a specification was never released for it. MiniGL was "just those features that Quake uses." By running objdump for the quake.exe binary, it is easy to build an “official” list.

$ objdump -p glquake.exe | grep "gl"
glAlphaFunc glDepthMask glLoadIdentity glShadeModel
glBegin glDepthRange glLoadMatrixf glTexCoord2f
glBlendFunc glDisable glMatrixMode glTexEnvf
glClear glDrawBuffer glOrtho glTexImage2D
glClearColor glEnable glPolygonMode glTexParameterf
glColor3f glEnd glPopMatrix glTexSubImage2D
glColor3ubv glFinish glPushMatrix glTranslatef
glColor4f glFrustum glReadBuffer glVertex2f
glColor4fv glGetFloatv glReadPixels glVertex3f
glCullFace glGetString glRotatef glVertex3fv
glDepthFunc glHint glScalef glViewport

If you started learning OpenGL recently, then you should be intrigued by such function names as glColor3f, glTexCoord2f, glVertex3f, glTranslatef, glBegin and glEnd. They were used for a mode called “Immediate mode”, in which the vertex coordinate, texture coordinate, matrix manipulation, and color were indicated by one function call at a time.

This is how “in those days” one textured and shaded by Gouraud triangle was drawn.

void Render {     
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    glEnable(GL_TEXTURE_2D);
    glShadeModel(GL_SMOOTH);
    glBindTexture(GL_TEXTURE_2D, 1);  // Assume a texture was loaded in textureId=1
    glMatrixMode(GL_PROJECTION);
    glLoadIdentity();
    glOrtho(-1.0, 1.0, -1.0, 1.0, -1.0, 1.0);
    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();
    glBegin(GL_TRIANGLES);
      glColor3f(1.0f, 1.0f, 1.0f);
      glTexCoord2f(0.0f, 0.0f);
      glVertex3f(-1.0f,-0.25f,0.0f);
      glColor3f(0.0f, 0.0f, 0.0f);
      glTexCoord2f(1.0f, 0.0f);
      glVertex3f(-0.5f,-0.25f,0.0f);
      glColor3f(0.5f, 0.5f, 0.5f);
      glTexCoord2f(0.0f, 1.0f);
      glVertex3f(-0.75f,0.25f,0.0f);
    glEnd();

GLQuake

The theoretical maximum fill rate of 50 megapixels / s was supposed to provide almost 50 frames per second in a resolution of 640x480. However, since Quake combined two layers of textures per surface (one for color, the other for the lightmap), SST1 had to draw each frame twice with additional blending in the second pass. As a result, Quake ran at 26 fps on the P166Mhz.

By reducing the resolution to 512x384 on the same machine, it was possible to achieve smooth 41 fps ^[7] , which at that time could not be provided by any competitor.

Software rendering

GLQUAKE VOODOO1

Interesting fact: SST1 was not for everyone. Some people liked the pixels and found the bilinear filtering “blurry.” Others were annoyed by the loss of gamma correction.

Glquake looks crap. I think someone can argue with this, but let's admit - it looks awful, especially on NVidia cards. On 3dfx boards, everything is not so bad ... but the colors are still blurry. On TNT2, the picture is disgusting; she is too dark and gloomy.

- @Frib, Unofficial Glquake & QW Guide ^[8]

3fdx Voodoo ²

If I said that 3dfx rules on the market from 1996 to 1998, this would be an understatement. After SST1, Voodoo ² technology raised the bar even higher thanks to 100 MHz EDO RAM, ASIC with a frequency of 90 MHz, and not just one, but two TMUs, which allow rendering a multi-textured Quake frame (color + lighting) in one pass ^[9] . This technology was a real monster, and even the graphics cards themselves looked luxurious.

Filling speed in Voodoo ² almost doubled, reaching 90 megapixels / s. Quake benchmarks have skyrocketed to a stunning 80 fps on the Pentium II 266 MMX (compared to 56 fps with Voodoo1), essentially reaching the limits of game logic and monitor capabilities.

Super Voodoo 2 12MB, image taken from vgamuseum.info.

Unfortunately, after the release of Voodoo3 in 1999, the 3dfx story made a sharp turn. She began to strive to develop her own universal cards and stopped selling OEM technology, faced with increasing competition.

This transition did not complete as expected, and Voodoo3's performance was disappointing compared to NVidia’s GeForce 256, capable of providing hardware tessellation and lighting (Pentium did this part in the pipeline).

In response to NVidia, 3dfx canceled the development of Voodoo4 to begin building Voodoo5 with VSA-100 (Voodoo Scalable Architecture) technology. The result was unexpected: after the release of “Napalm” (the code name of the card), she ran into more powerful NVidia GeForce 2 and ATI Radeon cards. In the end, on March 28, 2000 3dfx filed for bankruptcy and was bought by NVidia.

For those who lived in the late 90s and had the pleasure of playing Voodoo1 or Voodoo2, 3dfx remains a landmark company symbolizing excellence. She became an ode to the deserved success achieved through courage, outstanding talent and hard work. Thank you guys!

Reference materials

[1] Source: The story of the Rendition Vérité 1000

[2] Source: John Carmack .plan. Feb 12, 1998

[3] Source: SST-1, HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION

[4] Source: 3dfx Oral History Panel

[5] Source: Origin of Quake3's Fast InvSqrt ()

[6] Source: Glide Programming Guide

[7] Source: Comparison of Frame-rates in GLQuake Using Voodoo & Voodoo 2 3D Cards

[8] Source: Frib, Unofficial Glquake & QW Guide

[9] Source: VOODOO2 GRAPHICS HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION

Tags: