Optimizing OpenGL drivers for Quake3
This is intended mostly for people working on 3D drivers for Linux, and is basically the same information we have provided to the Windows and Mac driver coders.
First off, if your driver is communicating over any standard communication pipe (like X), you are pretty much SOL. The data traffic is so high that a good framerate is going to be almost impossible to achieve. A direct rendering model is needed to get reasonable performance.
Next, if your driver is directly writing to a small command FIFO on the chip, you will be limited to about 2/3 or less of the framerate you could get with a fully decoupled DMA buffer approach. It is possible to get a playable game with a directly writing driver, but it won’t be running with the best of them.
If the hardware is capable of it, supporting ARB_multitexture gives a significant performance boost.
Quake3’s rendering architecture has been defined with the primary goal of minimizing API calls and focusing as much work as possible in a single place to make optimization more productive.
During gameplay, 99.9% of all primitives go through a single API point:
glDrawElements( GL_TRIANGLES, numIndexes, GL_UNSIGNED_INT, indexes );
GL_VERTEX_ARRAY is always enabled, and each vertex will bet four floats. The fourth float is just for padding purposes so that each vertex will exactly fill an aligned 16 byte block suitable for SIMD optimizations.
GL_TEXTURE_COORD_ARRAY is always enabled for the base texture unit, and points at pairs of floats.
If ARB_multitexture is available, GL_TEXTURE_COORD_ARRAY may or may not be enabled for the second texture unit.
GL_COLOR_ARRAY is always enabled and pointing at four unsigned chars in the current release, but we can expose a path where the color is constant for all vertexes and the color array is disabled. We removed this at the last minute because of some driver problems, but we may be putting it back in later. The push for this option is that a multitexture vertex is an odd 36 bytes if color is included, but a very comfortable 32 bytes if a constant color was set ahead of time. On some older cards that require manual setup, knowing that the color gradients don’t need to be calculated is also a speed win.
If EXT_compiled_vertex_array is not present, we set up the same vertex arrays, but we do strip finding ourselves and issue glBegin() / glArrayElement() / … / glEnd(). This is faster than the discrete triangle path for most drivers that don’t have compiled vertex arrays (because they don’t retransform every vertex), but results in a lot more API overhead and limits batch processing. You can change between this behavior and the single draw elements call with the variable "r_drawstrips 0/1". The optimal path is to have compiled vertex arrays and take it as one big glDrawElements call.
So, for a single texturing card with the current (1.03) Quake3 release, there is one single set of conditions to optimize: completely full featured (vertex, color, texture coord) discrete triangles going through the DrawElements path.
A multitexture driver will also see the case where two texture units are active, which requires a different code path.
Note that the 2D overlay graphics, including the console text, currently go through standard glBegin / glTexCoord / glVertex / glEnd paths, but they don’t amount to many triangles during gameplay. If you are profiling the startup or connection process, this might confuse your data.
While the array primitives are discrete triangles, they are arranged so that the triangles actually neighbor each other in tristrip order when possible. You can just send them all to the card as completely separate triangles, but to optimize the bus bandwidth utilization you can compare the indexes of the current triangle with the previous triangle to see if there are shared vertexes. Exactly how this needs to be done is hardware dependent. The easiest case is hardware that just has three or more vertex registers, where you can change any given one of them and the others stay the same. Hardware that requires separate begin_tri_strip type commands will require a bit more work to take advantage of. This type of optimization work will only matter after all the other stuff is done.
Ideally, drivers should supporting EXT_compiled_vertex_arrays, which allows us to explicitly tell you that we aren’t going to change the vertex values after we have specified them, so you can batch process the entire load. There are two levels of benefit from this: shared vertexes in a single DrawElements call and shared vertexes across multiple rendering passes on the same geometry. Some drivers get the first benefit even without the compiled vertex arrays by scanning the indexes before processing the triangles, but to save the work across multiple rendering passes the extension is necessary.
If a given set of triangles is only going to need a single pass of rendering, we will set up all the vertex arrays before issuing the lock arrays. This allows color and texcoord data to be munged if necessary, but the performance benefits are minor compared to the work saved by the vertex arrays.
glColorPointer( 4, GL_UNSIGNED_BYTE, 0, tess.svars[0].colors );
glTexCoordPointer( 2, GL_FLOAT, 0, tess.textureSt );
glVertexPointer (3, GL_FLOAT, 16, input->xyz);
glLockArraysEXT(0, input->numVertexes);
<set some rasterization state>
glDrawElements(GL_TRIANGLES, input->numIndexes, GL_UNSIGNED_INT, input->indexes);
glUnlockArraysEXT();
If the triangles are going to need to be rendered in multiple passes we only lock the vertex array, then change the color and texcoord arrays on each pass. This allows you to cache the vertex data, but not the color or texcoord data.
glVertexPointer (3, GL_FLOAT, 16, input->xyz);
glLockArraysEXT(0, input->numVertexes);
<set some rasterization state>
glColorPointer( 4, GL_UNSIGNED_BYTE, 0, tess.svars[0].colors );
glTexCoordPointer( 2, GL_FLOAT, 0, tess.textureSt );
glDrawElements(GL_TRIANGLES, input->numIndexes, GL_UNSIGNED_INT, input->indexes);
<set some rasterization state>
glColorPointer( 4, GL_UNSIGNED_BYTE, 0, tess.svars[0].colors );
glTexCoordPointer( 2, GL_FLOAT, 0, tess.textureSt );
glDrawElements(GL_TRIANGLES, input->numIndexes, GL_UNSIGNED_INT, input->indexes);
glUnlockArraysEXT();
The only weird thing we do with geometry is enabling a single user clip plane when looking through a portal in the game. This usually punts drivers to an unoptimized path, so we don’t use it very often.
There are a couple common optimizations that I would recommend avoiding, due to artifacts they introduce.
In many cases triangle back face culling can be done more efficiently by a fast CPU than by the graphics card, especially if the card is taking discrete triangles instead of strips. The problem is that the CPU and card will have slightly different computations, and triangles that are very near edge-on may be considered culled by one and not the other. The result is a brief crack between polygons when a polygon goes edge on.
Guard band clipping is another optimization that usually leads to tiny cracks between polygons. The idea behind guard band clipping is that triangles that poke some distance off the screen are more efficiently handled by letting the hardware scissor them instead of manually clipping them. Only triangles that extend far off the screen or cross the near clip plane are actually clipped by the CPU. The problem is that when two triangles share an edge that hits the screen bounds and one of them stays within the guard band and the other doesn’t, the clipped triangle will get a slightly different edge slope if it is clipped to the screen bounds while the other triangle scissors off the edge. This can be solved by clipping to the guard band edge instead of the screen edge, but on current hardware that can exact a fairly high pixel cost, blunting the benefit of the saved clipping. Plus, there are all sorts of other common bugs with drivers that try guard band clipping.