Unified nVidia TNT/GeForce driver for Haiku

Personal 3D 'weblog' news:

17 February 2008: 3D.. testing testing..


Just so you know: every now and then I'm fiddling with 3D support improvement. The high-level interface I wrote about two years ago is 'nothing more' than a new 3D command in the engines. This command is called TCL_PRIMITIVE_3D, and I'm testing it on a NV15.

I already noted some time ago that the command NV11_TCL_PRIMITIVE_3D is responding on my NV15: I can set the colorspace for the back colorbuffer. I can also set the Z buffer range on the card and apparantly most other stuff without hanging the engine. The old 3D command keeps rendering nicely as always so that's good I think.

Just one thing is hanging the engine: setting the 2D window position. Well, I'll leave that disabled for now since it's working anyway on this card.

Tonight I find myself at the spot where I need to actually try to render using the new command. Which means I'll now shut-down the old 3D render command and test the new one. The coming time that is..

I have to admit I'm feeling exited about this. Will I actually be able to get this thing going or not? And: will it improve speed for Q2 and the teapot? And after it works on NV15: could NV20/NV30 be made to work?

Interesting stuff. Anyhow: little time to work in, lots to do. (as usual ;-)



15 April 2006 (expanded 17 April 2006): 3D driver Alpha 4.1 / 2D driver 0.80 benchmarks.

As promised, here's a new table with current 3D driver speeds. These speeds were taken with 3D driver Alpha4.1 combined with 2D driver 0.80: the latest BeOS release. The speeds are compared to Windows98 second edition speeds using a detonator driver. The detonator driver was setup to be as closely matched to the BeOS setup as was possible: This way a speed comparison should be more or less 'fair'.

If you look below at the table you'll find that more cards were tested than the BeOS driver currently supports: I was curious to know a bit more about the relative speeds between 'newer' architecture cards.
Maybe you'll also notice that TNT1 cards are slower than in older driverversions: this is because they now work correctly. Earlier much texturing errors showed as the driver did not synchronize the acceleration engine to the 'live' texture swapping that takes place on cards with 16Mb RAM. Granted, 16Mb should (probably) be enough to run Quake 2 without swapping in the mode tested, but the current BeOS driver setup is not optimal yet concerning use of memory. On TNT1 cards you'd best run Quake 2 in 16 bit colordepth for 1024x768 resolution for now, as that mode doesn't need to use texture swapping with 16Mb RAM.
Another interesting thing is the Geforce 4 MX4000. This is the second tested card with a 64bit bus to it's RAM (the other one is the TNT2-M64). I tested a 'noname' card, which was visible because of different things: The card's GPU is clocked at 275Mhz, which seems to be the official speed: my Geforce 4 MX440 runs at the same GPU clock at least. Well, because of the RAM speed related findings, I did a few extra speed tests with this card of which you'll find the results in the table below. It's interesting to see that the card was fully stable when I clocked it's RAM at 320Mhz: I tested for some 15 minutes or so. Note please that I do not advice you to overclock cards, that might be dangerous to your card and whole system. Anyhow, I did not want to keep these interesting results from you.

Further below, there's a second table showing some speeds for Quake2 timedemo 1 on other (much slower) systems.

Table: alpha 4.1 Q2 speed on P4@2.8Ghz, FSB@533Mhz; at 1024x768x32@75Hz.

Card under test:




TNT1 PCI, 16Mb (NV04)
9.3 fps (tex swapping)
23.2 fps @16bit color
-- (not supported)
-- (not supported)
TNT2-M64, 32Mb (NV05M64)
10.2 fps 21.9 fps 47 %
TNT2-pro, 32Mb (NV05)
17.0 fps 41.5 fps 41 %
GeForce2 MX400, 32Mb (NV11)
27.1 fps 86 fps 32 %
GeForce2 Ti, 64Mb (NV15)
45.6 fps 165 fps 28 %
GeForce4 MX440 AGP8x-type, 64Mb (NV18)
37.0 fps 119 fps 31 %
GeForce4 MX4000, 128Mb (NV18)
20.8 fps
22.3 fps coldstarted
28.0 fps RAM overclocked 20%
76 fps
-- (not tested)
-- (not tested)
27 %
GeForce4 Ti 4200, 128Mb (NV25)
-- (not supported) 250 fps --
GeForceFX 5200, 128Mb (NV34)
-- (not supported) 130 fps --
Table: alpha 4.1 Q2 speed on P4@2.8Ghz, FSB@533Mhz; at 1024x768x32@75Hz.

The second table (below) shows some interesting things as well. But first please note that the speeds on the dual P3 are just a tiny bit above speeds you'd note for a single P3 of the same speed: openGL is executed single threaded, which means the second CPU doesn't do any work on it. It just handles other OS tasks which reliefs the first processor from that (relatively small) burden.
Interesting things are: Also, personally, I was very pleased with the relatively good speeds I saw on that old P2 system I sometimes have access to. All system's results combined, I'd say Alpha 4.1 is a nice release.. :-)
Well, that's it for now. Have fun!

Table: alpha 4.1 Q2 speed on other systems.

System under test:

640x480, 16bit @60Hz:

1024x768, 32bit @60Hz:

P2-350Mhz, fsb@100Mhz, TNT2-ultra, 32Mb (NV05), dano
29.0 fps 21.3 fps
dual P3-500Mhz, fsb@100Mhz, TNT1, 16Mb (NV04), R5.0.1 pro
33.0 fps 6.2 fps (tex swapping)
21.4 fps @ 16bit color
dual P3-500Mhz, fsb@100Mhz, Geforce 2MX400, 32Mb (NV11), R5.0.1 pro
35.0 fps 24.8 fps
Table: alpha 4.1 Q2 speed on other systems.

7 April 2006: 3D rendering speed now upto twice as fast!

As some of you already know, 3D rendering speed went up with a factor of 1.4-2.0 on all supported GeForce style cards. TNT2(M64) render 1-4% faster, which is not very impressive compared to the GeForce speedup: but nevertheless noticable in some occasions.

The 3D driver renders at approx. 40% of the Windows driver speed for TNTx style cards, if the Windows driver uses the same setup as our BeOS driver: blits for swapbuffers, 16-bit texturing and disabled retrace-sync. For GeForce style cards, the driver now runs at approx. 30% of the Windows driver speed.

Needless to say I think, is that I am very happy with the big improve in speed we now see. If you compare different cards for speed using Windows, and you put the results of that next to a comparison in speed using BeOS: you'll find that the relative scores are starting to match! In other words, if a certain card performs fastest in Windows, it will now also perform fastest in BeOS..

Hey, what happened?
Well, after finding the results of the delta-speed test I did, I (once again) did a sweep of the engine init code in the 2D driver. Only this time, much more detailed than ever before. And then I found what I needed: just a few tiny bits needed toggling to get the new speeds! Don't really know what they do though.. We had sub-optimal settings, especially for GeForce cards. When you think about it, it could have been expected, as the UtahGLX driver my work is based on was very old: first created probably in the pre-GeForce era.

Anyhow, I did the sweep on all cards I have here, so TNT1, TNT2, TNT2-M64, GeForce2MX, GeForce2Ti, GeForce 4MX440, and GeForce 4MX4000. I am very sure the init setup code is now optimal speedwise. Of course I also looked at degrading quality: but that I didn't see. This combined with the speed comparison against Windows (for all those cards) leads me to believe this should be perfectly OK now.

So, thinking about how the engine's inner workings are: those ROP's in there (parallel raster outputs) are all working already. So two on most cards, and four on the GeForce 2Ti cards. The new speed comparisons on BeOS confirm this (more or less).

I'll release a new 2D driver asap: probably current SVN version 0.79. I'll also recompile the Alpha4 3D add-on against it and publish the combination as well as Alpha 4.1. Note: the 3D driver has not changed one bit, it's just that the shared_info in the 2D driver was expanded with a new (unrelated) nv.setting: that's why the recompile is needed.

Furthermore I'll publish a new benchmark table with all those results on both Windows and BeOS. Still done with Quake2 timedemo 1 (with sound running), as I always did: you can easily compare it to those old Alpha 2 benchmarking results I published.

The new benchmark table will be on this page (3D news page).

So, why are we not yet at 100% of 'Windows' speed?
Well, there are a number of things to account for that.

1. The acceleration engines are still idle at times during rendering, even on the faster CPU systems. This has to do with the bursting nature of those vertices being sent, and can only be fixed by using a higher-level interface to Mesa. Luckily Mesa3.4.2 turns out to have this interface, so I'll try to interface to it for a later Alpha release. This interface sends entire GLbegin()/GLend() sequences in one call to the driver, instead of the current one-triangle-per-call system. Furthermore this higher-level interface is needed for current Mesa as well, so interfacing to that would be a 'translation step' for me: which is good. :-)

Having this higher-level interface in place might very well increase rendering speed with a factor 1.5 or so: even (much) more so on slower CPU systems. Of course I'm guessing a bit about the possible speedgain at this point.

2. on TNT style cards, no more gain is possible: this high-level Mesa interface needs to be translated down to the current low-level hardware interface: sending seperate triangles.

On GeForce style cards however, this higher-level interface Mesa has, is also supported in the hardware!. This means that in theory it's possible to tell the engine what type of primitives we want to render (at GLbegin() time), like seperate triangles, tri-strips, tri-fans, and all other styles GLbegin() supports. The nice thing about this is of course, that we don't need to send seperate triangles to the engine, but just the vertices needed to describe the primitive we want to render (so literally like the app does that sends the commands).

Seems that this would save the driver sending tremendous amounts of redundant vertices: after all, in a real scene a lot of triangles have sides in common.

(Example: a quad. sending seperate triangles means we need to send 6 vertices, while a quad by itself only needs 4 vertices to describe it.)

(Example 2: a cube. There are 8 vertices needed to descibe it. Now break it down in seperate triangles, we'd need 36. Right?)

Well, this would increase rendering speed: can't miss. And the (very) good news is, that we might even be able to get this hardware interface up and running! Thanks to a very new open nVidia DRI driver attempt that is...

How about maximizing effective use of the RAM-datapath-width?
We talked about this a bit: about how fetching a single point (pixel) would waste valuable bandwidth, and how a crossbar memory controller could improve effectiveness by using smaller 'lanes'. Remember? Well, that crossbar controller was invented for GeForce3 and later: so we can't be suffering from that.
No. It turns out that it's like this:
- the hardware already maximises effective use of bandwidth! After all, we are not drawing single pixels, we are drawing 'entire' triangles! These consists of a large number of pixels (most of the time), so the engine can 'auto-optimize' access by doing parallel pixel rendering!
- a crossbar memory controller only comes in handy when you render lots of very small triangles: here the internal engine parallelisation fails (we deal with just one, or a few pixels). So, this crossbar controller is needed for next-generation games, where much more (smaller) triangles make up a scene. Makes all sense now, no?

So, we don't have more bottlenecks than described just now (those two), apart from probably some extra Mesa software overhead caused by the fact that we will 'never' be able to utilize all hardware tweaks and features that could exist in the cards: lack of docs (and time), as ususal.
Personally, I can live with that. I mean, if those two bottlenecks could be fixed, I'd be very satisfied. Ah, I'm glad with what we have already now as well... ;-)

Have fun!

3 April 2006: RAM access bandwidth test for 3D in nVidia driver

I promised to tell you about that 'delta speed test' I did with the 3D driver, which I did to get a better idea about how fast the RAM could be actually accessed by the 3D part of the acceleration engine. I considered this interesting because the 3D driver was rather slow compared to it's windows counterpart.

So how did I test?
It was rather simple, really. In the previous post I did I already 'calculated' some numbers for RAM bandwidth, and bandwidth needed by the CRTC to fetch the data to show us the memory content on the monitor. So I thought, if I can ascertain how much fps is gained by stopping CRTC accesses to the memory, I also know how much fps is theoretically feasable if the engine could use the complete bandwidth.

In a formula
(total RAM bandwidth / needed bandwidth by CRTC access) * (fps without CRTC accessing memory - fps with CRTC accessing memory) = nominal fps rate possible.

The setup
Just modify the 2D driver to enter DPMS sleep mode as soon as the cursor is turned off, and enter DPMS on mode when the cursor is turned back on. DPMS sleep mode in facts sets the CRTC in a 'reset' state since it's not required to fetch any data anymore: we won't be looking at it anyway (monitor is shut off). So this DPMS sleep state gives back the memory bandwidth otherwise used by the CRTC, to the 3D engine.

- you can start Quake2 from a terminal 'command line', including instructing it to do a timedemo test,
- you see the fps results back on the command line after game quitting and,
- starting Quake2 turns off the driver's hardware cursor, while stopping it turns it back on,
This setup will work nicely.

Result for the Geforce4MX, NV18
(6000 / 225) * (29.9 - 26.3) = 96 fps. (Windows measured value = 119 fps)

Well, for my taste this proves enough already that RAM access is actually OK, and the fault would not be low clocked RAM or something like that. So the reason for slow fps on BeOS should be found somewhere else.

Why is the delta speed with the CRTC test actually OK, while the total rendering speed is much to low? Well, it's interesting to realize that CRTC accesses are spread evenly through time, while 3D engine memory access requests are of a bursting nature.

The conclusion would be that there's some bottleneck in the GPU somehow after all... And with this new knowledge I went to sleep, not knowing yet what to make of this new information.

25 March 2006: 3D add-on Alpha 4 released..


Well, it's done: Alpha 4 is outthere. I had to do a lot more work than I anticipated, hence the delay. But it's well worth it I think: a lot more bugs were solved. In other words, Quake 1 and 2 showed more rendering errors after all, just less easy to see for someone not much running games (yet :).

The driver entry on bebits contains a list of all errors solved, so just have a look there for the nifty details, or just download Alpha 4 and read the included HTML file (which contains the same list plus updated application running status info).

What can I add to that info?
Well, rendering speed is higher than ever before, although still slow compared to windows and linux closed source drivers. But I mentioned that already as well. I'll try to do a new benchmark using Alpha 4 to give you the current detailed results: that way you can (finally) compare it to old Alpha 2 speeds I once gave.

GPU and RAM speed (overclocking and bottlenecks)
Let's talk a bit about rendering speed and bottlenecks. I spent a lot of time to find out why the speed is so much slower than on Windows and Linux closed-source drivers. I also tried to get NV20 and higher going once more. I did not find the solution to either problem, but I learned more about the cards and windows drivers in the meantime: who knows, it might help one day.

Anyway, one of the things I did was add tweaking options for GPU and RAM clocking speed in the 2D driver. Of course the 3D driver also benefits from this, and that was the intended result.
I did a test on my P4 at 2.8Ghz/533Mhz fsb, using the GeForce 4MX 440. This card has BIOS settings 275Mhz clock for GPU and 400Mhz for RAM. Coldstarting the card revealed that these speeds are actually programmed.
Here's the result of testing GPU speed with RAM speed at default (400Mhz):

Table: GPU speed versus Q2 timedemo1 fps (Alpha 4, RAM @ 400Mhz).

GPU speed:

800x600, 16bit @ 60Hz:

50 Mhz 38.8 fps
100 Mhz 66.2 fps
150 Mhz 78.8 fps
200 Mhz 82.8 fps
275 Mhz 86.0 fps
Table: GPU speed versus Q2 timedemo1 fps (Alpha 4, RAM @ 400Mhz).

I find this interesting, doubing the GPU speed did NOT double the rendering speed. Now look at RAM testing with GPU at default (275Mhz):

Table: RAM speed versus Q2 timedemo1 fps (Alpha 4, GPU @ 275Mhz).

RAM speed:

800x600, 16bit @ 60Hz:

100 Mhz engine hang
150 Mhz engine hang
200 Mhz engine hang
250 Mhz 55.7 fps
300 Mhz 66.4 fps
350 Mhz 76.4 fps
400 Mhz 86.1 fps
450 Mhz 92.9 fps (overclocking 12%!)
Table: RAM speed versus Q2 timedemo1 fps (Alpha 4, GPU @ 275Mhz).

Interesting here is the fact that increasing the RAM with a certain percentage, increases fps with the same percentage! When we combine both tables, we can conclude that the RAM access speed is the bottleneck, not the GPU speed. When you benchmark Q2 some more using different settings for texture filtering this conclusion remains intact: fps is not influenced one bit depending on filtering. At least, the GPU doesn't care.

About the engine hangs at low speeds: most RAM used on graphics cards is of the dynamic type: it must be refreshed within a certain amount of time to keep it's content. When it's done too slow the content gets damaged: hence trouble.

So, why is RAM access the bottleneck?

background: RAM bandwidth considerations
So how much data can be transferred with RAM anyway? Well, raw speeds look like about this.
the NV18 (most cards as a fact still), have a 128bit wide path between GPU and RAM. This means that per clock-cycle (SD-RAM) 128/8=16 bytes data are transferred. If we have a clock of 400Mhz, that means 400.000.000 * 16 bytes = 6,4Gbytes/second can be transferred.
We need to deduct some room for refreshcycles, so let's say for argument sake we keep about 6Gbytes/sec bandwidth for our card's functions.

So which functions are running? Well, we need to send RAM content to the screen (monitor). In 1024x768x32 mode at 75Hz refresh that means 1024*768*4*75 = 225Mb/sec are transferred.

This leaves some 4.8Gbytes/sec bandwidth for the GPU accesses and CPU accesses combined. Note that when you run for instance Quake2 timedemo, that at first the textures are loaded into the cardRAM, and then the demo starts running. Running the demo nolonger transfers data between the host system (CPU) and cardRAM: everything needed is already there. Apart from the actual rendering commands that is, but these are a relatively small amount of data: which resides in main system RAM, and are fetched by the GPU directly (AGP DMA accesses). So these commands don't load the RAM bandwidth. Just the GPU.

Bottleneck identification?
One serious 'problem' with these calculations is the fact that the GPU not always needs chunks of 16 bytes (128bits, the width of the datapath, data transferred in one clock-cycle). If you render using a 16-bit Z-buffer, and some serious hardware access optimisation doesn't exist, those two bytes will cost 16 bytes worth of bandwidth. In other words: these accesses run at 2/16 = 12,5% of maximum speed. For a 32bit colorbuffer, this would be 4/16 = 25% speed.

This could be what we are looking at. Unfortunately, I don't have a clue how to engage optimisation in the GPU for this kind of stuff: the crossbar memory controller (if it exists in those cards, this I should check in the coarse specs from nVidia). This piece of hardware is capable of splitting up those 16 bytes in seperate smaller lanes so to speak.

On the other hand, this same problem should exist on TNT class cards: but we are running at relatively high speeds there already compared to GeForce class cards. If you look at the windows driver results. But then again, there might be completely other reasons for that. It remains sort of guessing.

So, how are those speeds again? I'll sum a bit up once more (for the P4 2.8Ghz system at 1024x768x32 mode):

Table: Windows speed (blit-swapbuffer function forced, 16-bit textures) versus BeOS speed.



BeOS ALpha 4:

TNT2 (ASUS V3800) 41.3 fps 15.6 fps
GF2MX400 86.0 fps not tested
GF2Ti 165 fps not tested
GF4MX440 119 fps 26.3 fps
Table: Windows speed (blit-swapbuffer function forced, 16-bit textures) versus BeOS speed.

So, the TNT2 works at 15.6/41.3 = 38% of max. speed. The GF4MX440 at 26.3/119 = 22% of max speed. Both cards have 128 bit wide buses by the way.
Note to self: so geforce class cards seem to be running at relatively 50% speed of the TNT class cards. Do we need to enable DDR (double data rate) explicitly?? At least more cards should be compared for this. I seem to remember the GF2Ti running at some 23fps on BeOS, which would be relatively much slower than the GF4MX440.

Indications for actual RAM access bandwidth on BeOS
I had in mind to give you the result of an interesting delta-speed test I did on BeOS, but that will have to wait until another day. Time is up for now. But I'll post it as next item, I promise. It's very much related to the above story after all!

In the meantime: have fun with 3D! It seems we can run Quake1,2 and 3 all on BeOS now. With acceleration...

Signing off. Good night :-)

25 February 2006: nVidia 3D status update

High-time for a status update I'd say. A lot has happened.

Resizing outputwindows and another BeOS bug
This function is implemented and working reliably now. However, that took some time: there turns out to be another BeOS (R5 and dano) bug here: when you enable Window resizing events in the constructor of the BGLview, the corresponding routines are called: but the new size given to the resize routine is NOT the new one! Instead the one before the latest one is given.. Needles to say it took a lot of testing and trying to recognize this bug. By the time I knew what was going wrong, I saw a workaround in Be's teapot code: not using the given size, but asking for it itself.
After I added this workaround to the driver as well, resizing worked correctly. But not before I added LockLooper()/UnlockLooper() calls to the routines LockGL() and UnlockGL(): this was the trouble I saw about resizing being asynchronous to rendering. So not a Mesa problem, but 'my' problem: syncing threads. Fortunately this solution was already there in current Mesa, so I just copied it over to my driver.

Resizing sometimes still temporary distorts the rendered output, but that's not very important. Although it could look better of course. The driver is not calling Mesa GLviewport or resize_buffer functions, that's a task for the application programmer (as can be seen in various code examples outthere). The driver only takes care of programming the card right. It's nice to see a very large Teapot spin around, and to recognize the speed effects of resizing it. :-)

Trilineair filtering: 'random' engine crashes
As usual (so I can say these days :-) BeOS Mr.X is 'betatesting' the driver for me. As he has a lot of knowledge about the Quake series of games and the console commands you can give them, he finds bugs rather easy compared to myself. One of the problems he encountered was that using GL_LINEAR_MIPMAP_LINEAR texture rendering in Quake2 hung the driver. After two days of searching I discovered what was wrong: the original UtahGLX driver I based my work on contained an error: the rendering quality setting was increased 'by one step' when rendering switched from textured rendering to non-textured rendering. Of course the driver meant to set a basic 'level 1' as textures aren't actually used then.

The driver crashed the acceleration engine because it was fed with an illegal setup: a non-existing filtering mode.
So: alpha 4 will have this problem fixed. And the really funny part is that I didn't even know the driver supported filtering in this area... But it does: you can choose no filtering, bilinear filtering, and trilinear filtering: all just a setting fed into the engine. Checkout the gl_texturemode console command for Quake1 and Quake2 if you are interested.

Quake1 touching keyboard hangs game.
Another interesting thing to have a look at. Since the sources of GLquake are outthere, I made it compile on R5 and dano. That turned out to be a relatively easy thing todo. After having the game run, I could now go on a bughunt for this error. It turns out that the keyboard routine is protected by a semaphore, which is also grabbed when a new frame is rendered. This is a bad situation apparantly, as rendering stops completely when the keyboard was touched. I could fix it anyway by modifying the game executable a bit: acquiring/releaseing the semaphore as close to actual rendering as can be (inside the LockGL()/UnlockGL() part instead of outside of it.) Still you see the keyboard lagging on high-res modes however. Anyway: it has to do. Unless someone else is going to have a better look ofcourse. (Enable the networking support someone?).

With Be's software GL the game did not hang, but then, this renders rather slowly. I could prevent hanging in the accelerated driver by adding a snooze(100000)...
If someone knows of a driver-way to overcome this problem I am interested in knowing: after all R4.5's accelerated GL did not suffer from this symptom.

Quake1 rendering faults.
While I was playing around with Q1, I decided to try to find the reason for the wrongly rendered game scores at the mid-bottom of Quake1's screen. I found it after another day or two: yet another 'original' UtahGLX driver bug. It forgot to send the active texture's colorspace to the engine when a new one was activated...
Oh, doing a timerefresh in Q1 console was strange to see as well: it rendered in the foreground buffer. As the nVidia driver doesn't support that yet, I modified this command to use the backbuffer so it's accelerated: it's nasty to have to wait for 128 frames when rendering a single one cost about a second or so.

Anyway: it makes more sense as well to have it render in the backbuffer as otherwise we would always see the 'distortions' of the engine building up the frames in plain sight.

Next up..
So: all in all all rendering faults for both Quake1 and Quake2 have been solved. I guess I should release an updated version of Quake1 to bebits (including source) so this game can be played using openGL once more on BeOS. It would be nice if someone could take it from there to update it for networking and such I'd say.

OK, back to work. I want to have a look at switching resolutions inside Quake2 (wich only works partly for some reason), and then comes the 'real' swapbuffer thing. Apart from these items the driver seems ready for a new release.

Talk to you later!

18 February 2006: back on Mesa 3.4.2

While searching for a good solution for that delayed swap I mentioned to solve drawing errors, I once again compared Mesa 3.2.1 and 3.4.2. One of the differences between them turns out to be the added Mesa feature to complete pending drawing commands right before a driver issues the swapbuffer() command.
In other words, I switched back to Mesa 3.4.2, as that solves the drawing problem neatly if I add executing that Mesa internal command in the driver's swapbuffer function.

Mesa speed
The reason for me to fallback to the older Mesa before (alpha 3.5) was the apparant lower rendering speed the newer version had. Luckily it turns out I made a mistake myself: I forgot to enable hardware accelerated Z-buffer clearing! Once I enabled that, speed came a lot closer to Mesa 3.2.1's. Mesa 3.4.2 is indeed a bit slower, but just a tiny bit now.
Of course, I wanted to see if I could at least come up with the old speed, so I started looking once more at the hardware rendering commands hoping to find something I could optimize a bit more. Well, I found something indeed. Instead of issuing the vertexes and drawing commands seperately, I now issue them in one single burst of writes into the DMA cmd buffer. Also I use other vertex offsets in the engine, so the last vertex written automatically points me at the first drawing command entry. This saves overhead in explicitly setting engine register pointers in there, increasing rendering speed a bit (5-10% less words written into the DMA cmd buffer). If only Mesa could send 4 points, 4 lines and 5 triangles in one call... then I could increase the burst to it's max increasing speed another few percents, and save software routine calling overhead a lot.

Vertical retrace sync
In the meantime I added a new nv.settings option in the 2D driver called 'force_sync'. This option is now taken by the 3D accelerant to enforce retrace syncing for the swapbuffers command. With a small addition to the 2D's driver acc engine init code, we can now instruct the acceleration engine to wait for a retrace occuring before coninuing exec of commands. This is a bit nicer than explicitly waiting for a retrace (by CPU), as this enables us to keep sending rendering commands to the engine while that engine wait for the retrace. One of the important things for optimum fps rates, is that we try to keep the engine's DMA command buffer 'filled' at all times... The app should stay ahead with filling of the engine emptying and executing.

This engine wait exists in NV11 and newer cards, so the driver autoselects the retrace sync method depending on card architecture.

Swapbuffers: swapping instead of blitting (copying)
Adding real swapping turns out to be a challenge! The reason for wanting to add this function is to remove some of the acc engine load (used for blitting), so that space-in-time becomes available for 3D acceleration.
Unfortunately, swapping requires a sync to retrace: the CRTC (cathode ray tube controller) part of the GPU cannot do a swap at all times, as it's very busy with data fetching while the screen is drawn. You need to issue such a swap during retraces, otherwise the point in time the actual switch occurs cannot be guaranteed (over here it typically delays some 100 'lines': this register holding the pointer is doublebuffered in hardware apparantly).
Well, you guessed it: if we have to wait for a retrace, then we waste valuable GPU time we could otherwise use for 3D rendering! This contradicts our goal of course... In effect we loose speed instead of gaining it.

So: is there a solution to this problem? Yes, there is I think (I still need to check this out though). On BeOS, all 3D rendering uses double buffering. We have the 'frontbuffer' and the 'backbuffer': two buffers. You could switch between those once rendering to the backbuffer is done. The backbuffer then becomes the frontbuffer and vice versa. The next frame will be rendered to the old frontbuffer, now being the backbuffer. You see the problem: we need to wait until the CRTC actually displays the new frontbuffer, before we can delete and render in the old frontbuffer. The solution presents itself: setup triple buffering. When we use that we don't really care when exactly the CRTC switches to the new buffer, we leave the old one alone anyway! Instead we use the third buffer to render into... Of course we need to wait anyway if the CRTC needs more time to switch than we need to render one frame: I don't know yet how to sync rendering to this limitation.

All in all I don't know yet if swapping will make it into alpha 4: I want to see the thing actually work OK before I promise that. Oh, this solution has a downside as well (of course): we need extra graphics memory to hold the third buffer. Though that is no real problem in practice.

Resizing buffers (viewports, the output window)
This is another subject I am once again putting time into. The last action I took (late last yesterday evening) was determining that we have a hardware (sync) problem here! This is good news for me, as I should be able to fix that. Once I do I can retest the Mesa internal function for resizing buffers (and resizing viewports). Looks like Mesa deservers (much) more credit than I initially gave it: it's well thought out, internal sync wise. I was under the impression that hangs and render errors could occur with out-of-sync events like resizing output windows, but it might well be that this is not the case at all. I'm a happy camper.

If I can find a solution the the hardware problem, and Mesa's function for resizing works well enough (under heavy acc engine load): I'll enable driver support for resizing. This means the teapot can be stretched etc via resizing the window it spins in (unlike having the repeating pots pattern you see now). Also this means that apps that initially create very small buffers and resize later will work (without modification) now.

Well, it looks like this will be an important update: Alpha4. Although in effect the speed won't differ that much (upto 10% speed gain depending on CPU and GPU power: P4-2800, NV18 +10%, dualP3-500 NV11 +3%), there are a lot of visible bugfixes concerning rendering.

I hope this will stimulate people to do some more 3D apps... ;-)

12 February 2006: Mesa 3.2.1 and accelerated 3D on nVidia (again)

While working on updating the nVidia 2D driver for a new BeBits release, I decided to clean-up for 3D support. I tested a lot of register configuration settings with help of two cards in a system: I could never test as speedy as these days (nolonger reboots required).

Some 'nonsense' 3D setup where removed, and I could also find one point where more speed could be gained for 3D: I modified the rendering output colorspace from some 'special' type with different input/output spaces, to 'standard' ARGB32. This apparantly means less drawing overhead, which lead to a 11% rendering speedup in B_RGB32 space on my P4-2.8Ghz using the NV18 card. On my dualP3-500 with NV11 a 7% speedup was still gained. 15 and 16 bit spaces are unmodified speed-wise.

While I was very nearby the 3D subject again, I wandered off more into the 3D accelerant and Mesa3.2.1. I ended up trying to setup a real swapbuffer command (instead of using blitting), which could give us another 5% speedgain for all spaces in fullscreen modes. It still doesn't work correctly (I am thinking another Mesa bug..), but it gave me an interesting view on the rendering behind-the-scenes (seeing a scene being constructed in the backbuffer).

It became apparant to me, that although we have several drawing errors with Quake2 (missing texts, missing parts of text, missing bitmaps, intermittant missing crosshair and scores), these items were drawn none the less! As it turns out, the normally visible rendered buffer (with the just mentioned errors) is rendered in the background, then swapped on-screen, and then the missing pieces are rendered in the now obsolete background! Of course they are never shown, as after this final rendering part, the erase buffers command comes up...

So, I tried a delayed swap, and YES!! Q2 renders without any drawing fault (32bit mode atm)!

This sudden success means I'll put some more time in the alpha3.5 3D add-on, and see what I can do to modify it for general improvement here: I am hoping I can get this incorporated so that it's still useable with other 3D apps as well. I will update the BeBits entry to be Alpha4 after I release the 'current' 2D driver, hopefully with both the 'perfect draw' and the fullscreen swap function in place: all in all for instance 1024x768x32 mode in Q2 would go up from 22 to 28fps then, and without drawing errors anymore.

Well, 'back to work'. Talk to you later!

12 February 2006: Temporary revival of development on nVidia Mesa 3.2.1 based 3D accelerant.

Currently I am finding myself back at 3D development for nVidia. This came about because I wanted to cleanup the 2D driver a bit: since adding a new 2D accelerated function (scaled_filtered_blit) there I was forced to look in that cleanup direction. All in all 3D rendering speed in 32bit colorspace gained upto some 11%, and I am putting some time in a real 'swapbuffers' function to gain upto another 5% speed in higher res fullscreen modes. While working on that I stumbled on the reason for the Quake2 drawing errors, which I even have a workaround for now.

All this new stuff means I'll setup an Alpha 4 version of the driver with as much optimisations as I can get going. The 'current' version alpha 3.5 was just a recompile of alpha 3, only using Mesa 3.2.1 again instead of Mesa 3.4.2 because of speed issues.

I guess it's prudent once again to update this page every once in a while, and I also have a new link for you which points at a 'real' blog page now. I'll try to keep that updated as well. Have a look here for the blog. Talk to you later!

20 September 2005: The nVidia 3D driver development is on hold for now. Let me explain why.

As I already informed you, the Mesa interface to a (accelerated) graphics driver was changed drastically between Mesa versions 3.4.2 and 3.5. Since then, the newly introduced method is still in place (AFAIK), changing relatively 'marginal'. So, I have to let the driver run in 6.2.1: there's no point in trying older versions anymore. While I already have the driver 'compiling' in 6.2.1, and clearbuffers / swapbuffers work accelerated, there is no actual rendering acceleration.

The past two months I have spend much time on trying to understand the new way of plugging in a HW driver. Slowly I am starting to see parts of it, but it's all so different that I fear I cannot make two steps at once: understanding the interface AND writing a new driver 'from scratch'. Unfortunately (AFAIK) there's no good documentation outthere about this interface that would make my life much easier.

So, I have a new 'roadmap' for you: As I was already working on a VIA graphics driver, and a 3D add-on exists for it too (that is 'upto date')(AFAIK), it makes sense to me to take this 'detour'. I need to understand the current DRI drivers fully before I can complete the nVidia driver update. And DRI uses current Mesa (of course ;-).

Mesa 6.2.1 drivers

Well, I should probably say something about what changed in the driver interface. All stuff mentioned is of course my (limited) knowledge about it, and may be incorrect/incomplete.

In Mesa 3.4.2 and earlier, there was one way of interfacing to it from a driver's viewpoint. Mesa used some datastructures to store information, which a driver needed to 'translate' into a format which it's hardware understood. Also (as far as I have seen) every primitive needed a seperate 'call' to the driver: you could render points, lines, triangles and quads. If no hardware driver was available, then Mesa could render itself 'in software'. The software functions used the same interface that a hardware function would have.

Since Mesa 3.5, things seem to be much more nifty. Mesa still has internal software rendering (fallback) functions using (more or less) the same sort of interface to Mesa's core, but the interface to HW drivers is completely new. An important change is the fact that datastructures used to hold information are now exact copies of the hardware it is going to run on: the driver nolonger needs to translate for rendering: it merely literally copies the data to the hardware engine (into it's DMA command buffer (via 'DRM' if I am right)).
Mesa even provides (larger) parts of core HW driver functions now which a driver can just include. Because those core functions are HW dependant, a 'trick' is used: If in certain cases a function cannot be accelerated, the HW datastructures (HWvertex) used by the driver need to be 'translated' to SW datastructures (SWvertex) which the Mesa internal software rendering functions understand. So: translate and call SW function(s).

On top of this all, multiple primitives can be relayed to the driver now (instead of only single primitives in the early days). Linestrips for instance. This should massively reduce software overhead for calling the hardware rendering functions, and enable a driver to instruct for instance the nVidia engine to do a complete linestrip at once (can be placed in a single command in theory). Apart from making higher framerates possible on slower CPU systems, I can even see the HW rendering itself speedup because of this.

Well, all in all I would find it quite interesting to see what results we would see using Mesa 6.2.1. But not just yet though, as I mentioned. Unless someone else creates a Mesa 6.2.1 nVidia driver of course. ;-)

VIA unichrome driver(s)

So VIA first now: or at least an attempt on it. As I promised Yellowtab anyway to do work on a VIA 2D driver, it all-in-all makes perfect sense to concentrate my efforts more on this. The 2D driver already exists (you can download the most recent versions from the Haiku build-factory), although it doesn't have 2D acceleration yet. It does set modes though, and it's gaining hardware overlay capability now (starting with V0.11). When overlay is complete I'll try to setup DMA 2D acceleration. As usual we'll just have to wait and see what I can pull off. Be warned though: I am taking my time. After all, it should remain something nice to work on...

This completes this news-entry here. I'll talk to you again I guess, but I can't yet say where that will be. I don't think I'll setup a VIA graphics driver entry on this homepage (unless 3D is going to work there): it just costs time to keep it upto date, while I even get spammed via bug report forms these days. I guess you can at least follow my commits to SVN via CIA.. :-)

12 August 2005: While I am running a Mesa 6.2.1 compile on my system (several times ;-), I am writing this short update. Since yesterday the 3D driver sits in the current stable Mesa version. Well, a big part of it at least: While the actual rendering functions file doesn't compile yet (riva_prim.cpp), and I also don't know exactly how to update it, the Alpha 4 driver to-be now does swapbuffers and clearbuffers accelerated by hardware (clearbuffers since two minutes :).

What I have to do now is find out how to plugin the 'T&L' render functions to Mesa, and plugin the texture stuff in Mesa. Both subsystems have changed considerably, so I can't say yet how much time this will cost, and even if I will be successfull. From the looks of it, my best shot for the render functions might well be to NOT plug it into the HW T&L interface, but use the software interface instead (looks much more like it was in 'the old days'). The software part is called _swrast BTW. I need to do more research to see what my options are here.

While the texture files inside the driver compile OK, I can't activate it because the interface here has changed as well. Apparantly less so though, seeing the succeeded compile. But anyway: I can't say what it's current useability is/will be. From a pretest I did, it looks like I can test both functions (more or less) independantly though, as keeping texturing disabled while inserting the render functions already accelerates and shows you some figures in Quake 2 (opponent-character-forms but not rooms for instance).

Interesting as well, was plugging the Alpha 3 driver in Mesa 3.2.1 to see what it would do to speed. Quake2 renders at 125fps in 640x480x16 mode then, and the Teapot at 750fps, topping at 800 (exactly like with Mesa 3.4.2 on my system: P4@2.8 with NV18). 1024x768x32 remains at 23fps though. All in all, I kind of hope that Mesa 6.2.1 will combine 'the best of both worlds' and give me the best speeds out of Mesa 3.4.2 and 3.2.1 or better. Even on slow-CPU systems (my P3-500 gave a few fps more with Alpha3 in Mesa3.2.1: I call it Alpha2.5 for now :-). We'll see what Mesa 6.2.1 is capable of. I hope.

Well, that's about it for now. Talk to you later...

8 August 2005: Today I've got Alpha 3 of the 3D nVidia driver for you, still accompanied by 2D driver 0.53. This new version is based on Mesa 3.4.2, which is still OpenGL 1.2. I am making it available anyway because since Mesa 3.2.1 numerous conformity fixes were done in Mesa. Also the 3D driver had some improvements in the meantime which make it slightly faster. All in all, this release will make apps run faster especially on fast CPU's while on slow CPU's especially Quake 2 runs slower. I'll leave it up to you to choose your version for now: Alpha2-final remains online for the time being: you might prefer it for quake2 on slow systems.

Some measured speeds:
Anyway: get it from the downloads page and have fun! In the meantime I'm working on directly upgrading to Mesa 6.2.1: but that's much harder to do due to the large revamp that was done inside Mesa between version 3.4.2 and 3.5. So it will probably be a while before I report back on that...

8 July 2005 (completed 9 July): Benchmarks for alpha 2-final, status update, and personal thoughts.

In the past two weeks I've been focussing on testing for the hardware boundaries of what we can do with the information available. While doing that, I collected some new benchmarks for you as well. Let's start with the latter: Here are a few tables showing the speed that alpha 2 final reaches on the systems avaliable to me for testing.

Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT2-M64, 32Mb (NV05M64) 64-bit bus
100 fps 100 fps 27.3 fps 22.3 fps 23.9 fps 15.8 fps 16.7 fps 9.7 fps
Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Table: accelerated alpha 2f DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1, 16Mb (NV04)
125-130 fps 125-130 fps 33.8 fps 30.5 fps 29.6 fps 22.6 fps 21.4 fps 8.7 fps
Table: accelerated alpha 2f DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1 PCI, 16Mb (NV04)
450-500 fps 350 fps 53.1 fps 37.2 fps 36.2 fps 23.7 fps 22.8 fps 11.0 fps (draw errs)
GeForce2 Ti, 64Mb (NV15)
600-630 fps 600-630 fps 90.0 fps 54.5 fps 65.4 fps 35.8 fps 41.3 fps 20.2 fps
GeForce4 MX440, 64Mb (NV18)
600-630 fps 600-630 fps 118.8 fps 66.3 fps 84.8 fps 40.5 fps 51.6 fps 23.2 fps
Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

You can see that these speeds top all previous speeds measured, indicating this is the fastest version of the driver yet. As usual, looking at GLteapot's speeds on all (but one) setups: you can determine that the software overhead limits the speed. All setups hardware can speed it up more if software overhead were to be further minimized. Although of course, we are getting closer and closer to the engine's limits. For the Quake speeds I think we reach those limits on the Pentium 4 system, except maybe for 640x480x16 mode using the GeForce4 MX440. My feeling however is, that we are almost there even in that mode.

OK, now have a look below at the Linux benchmarks I did on the P4 system using the closed source official nVidia driver (latest available on Suse 9.1, running KDE along with Quake2). You'll easily recognize that the hardware is capable of much more than we get...

Table: accelerated 3D speeds using Linux on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz.

Card under test:

Q2 640x480 @32bit:

Q2 800x600 @32bit:

GeForce4 MX440, 64Mb (NV18)
280 fps 250 fps
GeForce4 Ti4200, 128Mb (NV28)
456 fps 420 fps
GeForceFX 5200, 128Mb (NV34)
400 fps 370 fps
Table: accelerated 3D speeds using Linux on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz.

... so, we are definately missing important stuff. Probably things like: compressed Z-buffer, fast Z-clear, parallel use of (pixel) pipelines, and a better optimized engine setup (use of hardware (user) 'context' switching and the context cache). I can imagine that on pre-GeForce cards, our driver is getting close to the Linux driver in speed, because those 'extra' features might well not exist on these old cards. Anyway, let's face it: we'll have to do with what we got. I myself at least am certainly not about to start reverse engineering closed source Windows or Linux drivers: I simply don't want to put that much time in it.

Let's think in another dimension for a moment: supporting more cards at the current level, so adding support for newer cards (NV2x, NV3x, and NV4x types). I've been trying to find out more for this as well in the past weeks. Thanks to the Poke utility written by Oscar Lesta aka Bipolar (for both Windows 9.x and Haiku/BeOS) I was able to peek at the Windows setups for NV18, NV28 and NV34, and try something in BeOS. While gaining a little bit more 'general' insight, I was not able to actually improve speed and card support however. Besides, these newer cards no doubt have a different setup being used with the official drivers, as they support interesting new programmable features, not available in the cards currently supported.

All in all, we won't be able to add more cards to the support list. Unless (preferably, as far as I am concerned) information comes up on how to set the NV2x, NV3x and NV4x into pre-GF (or NV1x?) compatibility mode. If such a thing even exists, that is. You see, if this were possible, the current simple driver setup would suddenly support all nVidia cards there are. And, while not being superfast, you would still see the speed go up if you use newer cards (compared to older cards). The only thing needed for this would be an update of the 2D driver's 3D init code, which would be feasable to do from a development-time-needed perspective. And, as we now know, the speeds gained are already interesting enough to enable a whole 'new' breed of software on our platform...

Roadmap considerations / update

Let's peek back at what I originally wrote:

During development it's very important to take the smallest steps possible for the largest chance on success. This dictates that I should take the following roadmap to get to the goal: unquote.

Steps 1. and 2. are done. I've also looked at improving speed and card support. We will remain at the current maximum speed level (as it now is on faster CPU's). We might still improve speeds on slower CPU's. The cards supported are NV04 (TNT1) upto/including GeForce4 MX (NV18). The driver will block attempting to use newer cards from now on.

It's time to start work on step 3: switching to current Mesa. After that we'll switch to a real renderer add-on (step 4.), as described on the Haiku homepage. Philippe Houdoin will come up with a example software driver that I'll use to help me create the add-on.

Most of the remaining work will be doing step 3. As the current Mesa is at openGL 1.5 level, while the driver is now at openGL 1.2, switching will require the driver to be expanded and rewritten concerning the Mesa interface. This interface changes for every new version of openGL, as new functions get added all the time. It might be a good idea to do the upgrade in steps, which means I would first have to make it work with Mesa 3.4.2, which is the newest openGL 1.3 compatible library. The change in openGL for this update (concerning the driver interface) is added multiple render/pixelbuffer support (if I remember correctly). Also the Z-buffer interface has to be revised. This was to be expected, as the current used Mesa version (3.2.x) shows kind of inconsistencies here as I previously 'discussed' (I needed to literally copy code from Mesa into the driver to make it work).

Somewhere during updating to current Mesa, I'll probably add support for AA rendered triangles. Also multitextured triangles seem to be possible: I already did a pre-test using the DX6 command; I was able to use it to clear the Z-buffer successfully. Which indicates to me that this hardware function is up and running in the engine so we should be able to make use of it. Interesting is that this function has hardware support for stencil buffers, which might mean we can get that working as well. But I won't make promises, we'll just have to wait and see. I also benchmarked this DX6 function for speed already: and no, by itself it's not faster than the currently used rendering function (DX5, single textured).

Well, the last thing I can talk about today is the time I think I need to switch to current Mesa. Let's put it this way: I plan to do it before the year is over. First up is holiday season, so I'll probably have to do the actual work in the coming fall. And I'll be a bit slower as usual, as my body seems to indicate to me I should (RSI-like trouble, which every computer user has to face at some point it seems). But, as Philippe nicely pointed out: the journey *is* all the fun, right?

General driver performance considerations and personal thoughts

So. Are you disappointed? Well, don't be. We are lucky we got this far. It's a general problem every driver writer has these days: 'no-one' is giving out specs. While in 'the old days' this was not so (our nVidia driver is based on work done by nVidia themselves; Matrox used to hand out specs: upto and including G400), these days even Matrox nolonger responds to your mails.
I think it might be valid we would try to support hardware manufacturors who do give out specs, either via maintained opensource drivers, or register-level specs. Even hardware considered slow would perform much better than the fast hardware nVidia or Ati makes. Look at the Linux speeds with the closed sourced nVidia driver. Even if a slow card performs one-fourth compared to a top-notch brand if both use full drivers, we would be able to use that slow card at vastly higher speeds with a 'supported' driver. And of course, with more hardware features in it.
Anyway, we still need to support hardware from less 'friendly' manufacturors, as in practice this hardware is most commonly used. So, hence my entire work for nVidia cards. I have to tell you however, I am looking more and more at 'cheap' and 'slow' hardware, as it would give me lots more pleasure working on that instead if for once I get my hands on 'full specs'.

OK, that might sound a bit negative, but it's certainly not meant that way. I understand that manufacturors need to protect their intellectual property in this world. And I really already had a lot of fun doing my development. For nVidia hardware I worked on almost every aspect available in the hardware (remember BeTVOut? :-). It's just that it sounds like much more fun, doing the same for hardware that I'd have much better specs on. Which does not mean I will even start on that though, mind you: I've learned that it costs enourmous amounts of time and energy doing a full blown graphics driver..

Anyway, back to performance. The way I see it, it's very nice having a simple driver that just supports (more or less) what we already have. As the supported hardware function is very general, it doesn't really matter how sophisticated your app setup is: it will probably get accelerated anyway. Mesa very nicely takes care of doing software emulation of every other aspect you use, and the final rendering is then done by the accelerated driver. While the total setup is not superfast, it nicely accelerates anyway. While the amount of work needed to be done for the actual driver is very minimal. And, with the way Mesa works, you can add more functions over time: making it doable for just one person if it needs to be. You see, even if we had full specs, we wouldn't have the manpower to put it all to use. Basic specs covering enough to do the primary setup would suffice in practice...

Me, as an 'alternate OS' user, gladly pays more money for a system performing a bit less: speed and feature wise, if that means I get a system that's very stable and that's easily maintained. That's what I want out of my computers. Otherwise, I'd rather just dump them in the bin. More BeOS (style), anyone? Cheers!

23 June 2005: OK, here's 3D-addon version Alpha 2-final. Get it from the downloads page and have fun! While you do that I'll do some more benchmarking of which I will post the results here. A small preview: Hope you like it. Meanwhile: if you test this driver, please provide feedback as usual!

Thanks in advance. Talk to you later..

18 June 2005: I'm finally readying the alpha2 driver for release: it will be accompanied by nVidia 2D driver 0.53. Below is a screenshot of GLteapot as it's running on my P4 2.8Ghz now :-)
Well, actually, the speed is a bit higher: it's difficult to grab a correct screenshot, as taking the shot slows things down. The mean speed is about 470fps, topping at 500fps for short moments with the default settings (in both 16 and 32 bit color).

GLteapot spinning on a P4 @ 2.8Ghz, with GeForce4MX440 @ AGP4x.

Since the benchmarks as shown before the driver has had a few more updates, further speeding up rendering on all systems. Anyway, the speed is 'set' for alpha 2 now, so I'll show you a few new benchmarks here. Shown also (for reference) are the older DMA speeds that were reached. On top of that, I also benchmarked for PCI mode explicitly, to show the gain AGP mode now gives us (on a number of systems).

Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1, 16Mb (NV04).
old speed (AGP mode)
110 fps 110 fps 27.4 fps 24.0 fps 23.4 fps 17.9 fps 17.2 fps 8.2 fps
TNT1, 16Mb (NV04).
new speed (PCI and AGP mode)
120 fps 120 fps 31.8 fps 29.3 fps 28.4 fps 22.3 fps 21.1 fps 8.7 fps
Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

GeForce4 MX440, 64Mb (NV18).
old speed (AGP mode)
400 fps 400 fps 91.5 fps 59.3 fps 72.6 fps 38.0 fps 47.2 fps 22.4 fps
GeForce4 MX440, 64Mb (NV18).
new speed (PCI mode)
320 fps 320 fps 84.2 fps 56.7 fps 68.9 fps 37.1 fps 46.1 fps 22.3 fps
GeForce4 MX440, 64Mb (NV18).
new speed (AGP mode)
470 fps 470 fps 104.5 fps 64.4 fps 81.3 fps 40.2 fps 50.8 fps 23.2 fps
Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

So: Quake 2 now runs timedemo 1 at 105fps in 640x480 @ 16bit on the GeForce4 MX440: I broke the 100fps barrier! Of course, the driver is still very slow compared to the official Windows drivers... But I love it anyway. ;-)

Speaking of this: it's interesting to benchmark Quake2 on Windows using the same card in the same system. It tells you how much speed could be gained 'in an ideal world'. For the GeForce4 MX440 it turns out that even in 1024x768 @ 32bit 70 fps is reached, and probably even more (Windows syncs to retrace, so the screen's refreshrate determines the maximum fps you'll get). You can clearly see (by now) that with the hardware functions we have in use now, we will never get that speed. The card's power for rendering triangles with the function in use is simply too low. So how do they do that speed? They must be using another triangle function, and/or use several 'triangle rendering engines' in parrallel. Quake 2 after all only draws with triangles and is already completely accelerated on BeOS now. And we know we are sending the commands fast enough to the engine or our rendering speed would never reach those 100fps in low-res mode (The number of 3D commands does not depend on the screen's resolution!).

This 'info' gives me thoughts about where to look next for more speed: is the 2D blitting function becoming a big bottleneck in high-res modes? I'll try flipping buffers (only possible for fullscreen apps though). Can I use 'parallel processing' in the engine for the currently used rendering functions? I'll try to 'rotate' entering commands at different offsets in the engine's hardware function. Another approach is sending more than one triangle at once to the engine: but does Mesa support that? All in all, I am not done with this yet. But these experiments will have to wait until a later date: first up we have the alpha2 release. Which BTW is using Mesa 3.2.1: as someone pointed out to me, that version contains numerous bugfixes for openGL conformity compared to Mesa 3.2. And the driver could simply be 'dropped in place'. Oh, no resizing of buffers yet: Mesa 3.2.x seems to contain an internal error preventing that from working correctly. Mesa 6.2.1 is able to do this though...

OK, back to work: Do some final tidbits and release driver. :-)

(Completed) 18 June 2005: I have some benchmarks I'd like to share with you. These benchmarks are taken with nVidia 2D driver 0.49 as it is in SVN right now, combined with the DMA version of the alpha1 3D add-on: which right now has exactly the same functionality and behaviour as alpha1 final as I released it some time ago. Well, I have to be honest: the DMA 3D add-on now nicely supports the NV17/NV18 so you can finally really use them. Cards with NV18 engine are for instance:
I expect the mobile (laptop) versions to work as well, and hopefully also the Quadro types. As usual feedback will be needed to determine the final status of the driver on the cards though.

Along with the 'normal' benchmarks, I also have some comparing benchmarks for you that show differences in speed because of certain new 'items' that are now in the driver set. Indeed, there are several 'big' changes in engine setup in the 2D driver now, all needed to optimize speed for 3D. Of course, this modified setup also speeds up 2D a bit more generally. For instance, it's funny to see that BeRoMeter 1.2.6 nolonger correctly measures speed for 'Graphics Rectangles Unflushed Filled' on most systems: it seems to suffer from a variable rollover in the calculations somehow.

Anyway, here are the results as they are now! By the way, please bear with me a bit more before I do a new release: I am still looking into a few possible additions for the driver to behave a bit better than alpha1 did. Won't be very long though; I'd say I might as well just do seperate releases for these sorts of things..

Current benchmarks, using all new setup 'items' in the driver set.

If you look at the tables below, you can observe some interesting things. For instance, on the dual P3-500 system, you can see that I'm not able to fully load the cards in lower resolution modes (Quake2). It doesn't matter how fast the card is, you won't get more than 30fps out of it. Compare the NV15 with it's speeds on the P4 2.8Ghz, and you'll see that this card can do much more if the system CPU can get it commands fast enough. When you select a high-res mode however, you can see that the card is becoming the bottleneck and the speed difference between the two CPU's is mostly gone. Lookt at 1024x768 @ 32 bit on both systems for the NV15 for proof.

For GLteapot the same sort of thing applies. If the card is fed fast enough, you should encounter speed differences for a card between 16 and 32 bit mode. In most occasions, on most systems, you do not see this however. Another hint: If you enable 'perspective' in the teapot app's menu, you'll see the teapot speed up from 400 to 500fps on the P4 2.8Ghz for example.

These things tell us that if we are able to minimize the software overhead for sending commands to the cards, we can further speedup a lot of hardware combinations. Luckily, Mesa 6.2 is optimized much more than Mesa 3.2 is (6.2's software rendering runs at 150-180% compared to 3.2). So I am hoping that the Mesa version switch planned will further improve our rendering speeds.
Also, the driver itself could probably be faster. Take point and line rendering for example: these are done by seeing these items as a set of triangles. This means we have to feed more information into the engine than would be strictly needed for these functions. In theory, the engine also has specialized commands for such functions, minimizing the overhead: and probably being executed faster as well. The downside is that these hardware commands are less universal, making it nessesary to be able to fallback to the current scheme if the driver would determine that a certain point or line size is outside the engine's hardware capabilities. This fact certainly explains why the point and line functions currently work the way they do: the UtahGLX driver was in it's 'beginning phase' in it's life-cycle.
If you instruct the teapot app to not use filled polygons, you see the rendering speed drop instead of speedup: this is proof of the relatively extra overhead the line function is suffering from. Note however, that it's speed lies a bit above software rendering speed using DMA mode, while with PIO mode it was much slower.

OK, below you'll find the test results for DMA mode. Even further down, you'll encounter comparisons for several driver-setup aspects, and a few comparisons between DMA and PIO mode. Enjoy.

Table: accelerated DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT2-M64, 32Mb (NV05M64) 64-bit bus 85 fps 85 fps 23.5 fps 19.3 fps 20.8 fps 14.8 fps 15.3 fps 9.9 fps
Table: accelerated DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1, 16Mb (NV04) 110 fps 110 fps 27.4 fps 24.0 fps 23.4 fps 17.9 fps 17.2 fps 8.2 fps
GeForce2 MX400, 32Mb (NV11) 110 fps 110 fps 29.2 fps 25.7 fps 28.0 fps 20.2 fps 21.5 fps 11.7 fps
GeForce2 Ti, 64Mb (NV15) 110 fps 110 fps 30.0 fps 29.0 fps 29.5 fps 26.5 fps 27.4 fps 18.7 fps
Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1 PCI, 16Mb (NV04) 250 fps 200 fps 35.0 fps 26.5 fps 25.8 fps 18.3 fps 17.7 fps 9.7 fps (draw errs)
TNT2-M64, 32Mb (NV05M64) 64-bit bus 300 fps 190 fps 37.3 fps 23.0 fps 25.8 fps 15.5 fps 16.1 fps 9.5 fps
TNT2, 32Mb (NV05) 400 fps 300 fps 54.8 fps 38.7 fps 39.6 fps 25.9 fps 26.2 fps 15.7 fps
GeForce2 MX400, 32Mb (NV11) 400 fps 400 fps 55.2 fps 31.6 fps 41.4 fps 21.7 fps 23.5 fps 11.7 fps
GeForce2 Ti, 64Mb (NV15) 400 fps 400 fps 77.3 fps 50.7 fps 59.2 fps 34.1 fps 39.1 fps 19.6 fps
GeForce4 MX4000, 128Mb (NV18) 64-bit bus 400 fps 400 fps 71.0 fps 37.4 fps 49.0 fps 22.6 fps 28.8 fps 12.5 fps
GeForce4 MX440, 64Mb (NV18) 400 fps 400 fps 91.5 fps 59.3 fps 72.6 fps 38.0 fps 47.2 fps 22.4 fps
Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Note please (for all three tables above):
Note also:
NV20 and later cards currently don't work: tested NV28 (GeForce4 Ti4200), NV34 (GeForce FX5200) and NV34Go (GeForce FX5200 in a laptop). It remains to be seen if I can get these up and running.

Comparisons for several driver-setup aspects: DMA versus PIO.

I compared both the 'alpha 1' and 'alpha 2' 3D drivers on two slower systems to find out if DMA mode still gains us speed there even though there's a bit more CPU programming overhead needed for DMA mode. As you can see from the numbers below, 25-30% gain can still be reached in DMA mode (Quake 2), while relative simple things (GLteapot) render a bit slower. Overall DMA mode is faster though, and also has more promiss for speed in the future.

Table: accelerated DMA versus PIO mode openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

TNT2-M64, 32Mb (NV05M64) 64-bit bus in PIO mode 16.1 fps 11.8 fps
TNT2-M64, 32Mb (NV05M64) 64-bit bus in DMA mode 20.8 fps 14.8 fps
Table: accelerated DMA versus PIO mode openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Table: accelerated DMA versus PIO mode openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

GeForce2 MX400, 32Mb (NV11) in PIO mode 120 fps 120 fps 26.5 fps 20.5 fps 23.5 fps 16.2 fps 17.1 fps 10.2 fps
GeForce2 MX400, 32Mb (NV11) in DMA mode 110 fps 110 fps 29.2 fps 25.7 fps 28.0 fps 20.2 fps 21.5 fps 11.7 fps
Table: accelerated DMA versus PIO mode openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Comparisons for several driver-setup aspects: enabled/disabled AGP transfers, MTRR-WC and 1Mb cmd buffer.

I tested speed differences for each of the aspects AGP transfers, MTRR-WC'd cmd buffer and 1Mb sized cmd buffer seperately to see if what I setup really works as should be. The tests where done on the P4 2.8GHz with an original TNT2 AGP. It should be noted that the results even for the fully enabled driver are slower than mentioned in the extensive card-type comparison benchmarks above, as these aspect tests where done on the first stable intermediate version of the 3D/2D driver set. You should also keep in mind that the speed differences will probably become bigger over time, as the driver gets more and more optimized. Probably best view the results as hints that the theoretical setup is correct.

Table: speed differences for different aspects of the DMA driver on a Pentium 4 @ 2.8GHz, FSB at 533Mhz, Dano (gcc 2.95.3).

Aspect combination:

teapot @16bit:

Q2 640x480 @16bit:

Full driver (AGP transfers, cmd buffer is MTRR-WC @ 1Mb) 360 fps 53.2 fps
Full driver minus MTRR-WC 230 fps 49.2 fps
Full driver minus AGP transfers 290 fps 50.3 fps
Full driver minus MTRR-WC and AGP transfers 210 fps 47.3 fps
Full driver in PCI mode 290 fps 50.3 fps
Full driver minus MTRR-WC in PCI mode 210 fps 47.3 fps
Full driver minus 1Mb cmd buffer (now 32Kb)
(not stable!)
360 fps 51.9 fps
Full driver minus AGP transfers and 1Mb cmd buffer (now 32Kb)
(not stable!)
290 fps 49.0 fps
Table: speed differences for different aspects of the DMA driver on a Pentium 4 @ 2.8GHz, FSB at 533Mhz, Dano (gcc 2.95.3).

Note please that although the use of AGP transfers indeed speeds up rendering on the above tested machine, it did not do so for the old P2 system @ 350Mhz. In this case PCI and AGP transfers were both running at the same effective speed. The same applies for the dual P3@500Mhz machine over here. Of course, you should keep in mind that these two systems have an old AGP interface: Version 1.0. The maximum speed those ran at was AGP 2x mode, while the P4 @ 2.8Ghz runs at AGP 4x mode. On top of that, the CPU's are simply not fast enough to really fill up the command buffer so that we could notice the AGP 2x working: the 3D driver version tested does not yet benefit much enough from using AGP transfers for it to show here. Of course, that might change in the future though..

When you look at the results, you can observe some interesting things:

29 May 2005: Hi there. Well, today I have some splended news for you! I was very quiet these last weeks, because I was very busy doing something I feared might not be possible: making the change to DMA mode!. Well, you guessed it: it's official! I have DMA mode up and running smoothly...

I'll update this page asap with more stuff, for now just some facts:
Talk to you later! Wow!!

9 May 2005: Today I am releasing nVidia 2D driver 0.45. This version finally fixes some trouble with older cards outthere: the typical trouble that kept some people using the Be nVidia driver before. We are talking about the trouble I named 'bandwidth trouble': All in all this driverversion is the best version to use alongside the 3D add-on 'alpha 1-final' which I released not too long ago. Get the driver(s) from the Downloads page and have fun!
Oh, by the way: here are the GeForce 2 Ti benchmarks I did with the new 2D driver. These are new 'high scores' for me.. :-)

Table: accelerated openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 190-210 fps / 1.0x 150-160 fps / 0.78x 2.8 fps / 1.0x 2.8 fps / 1.0x
Mesa 3.2 software, AGP4x + FW 190-210 fps / 1.0x 190-210 fps / 1.0x 2.8 fps / 1.0x 2.8 fps / 1.0x
GeForce 2Ti, 64Mb (NV15) 200-220 fps / 1.1x 200-220 fps / 1.1x 45.3 fps / 16.2x 35.0 fps / 12.5x
Table: accelerated openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

5 May 2005: 3D add-on alpha 1-final released!

Today I was able to uploaded the source and binary files for BeOS R5, Dano, Zeta and Max edition. Get them from the Downloads page and have fun! Oh, and don't forget to provide feedback if possible: that would be appreciated ;-)

3 May 2005: Well, finally I am able to release a first alpha version of the 3D nVidia add-on driver including Mesa 3.2 library (named alpha 1-final). This driver requires 2D driver 0.43 in order to work (otherwise your system will hang). On top of that, you need to instruct the 2D driver to use PIO mode for acceleration: do so in nv.settings. Make sure you have the 0.43 2D driver running in PIO mode before you run any 3D application with this driver/library (reboot!).

Installing the 3D add-on is nothing more than moving or copying the two libraries libGL.so and libGLU.so into the ~/config/lib folder. For convenience, I added the 2D driver 0.43 including a preset nv.settings file in the downloads. Also included are precompiled demo applications.

If, while testing, you hang your system, you should hit ALT, CTRL and DEL simultaneously (and keep that pressed down for a few seconds) to reboot. If it turns out you cannot run the driver/library just delete the files libGL.so and libGLU.so from your ~/config/lib folder and you should be OK. Without even rebooting: as those libraries are only loaded while you run an app using it.

If you test this 3D add-on you are encouraged to provide feedback. Feedback will tell us best what the actual useability of this attempt is in the end, and where additional fixes are needed. Feedback can be sent by Email or by talkback on the BeBits entry that I'll create ASAP.

OK, below you'll find a table containing the status of the driver, followed by a table indicating the status of that driver for several applications. I hope it's all of use to you! Have fun...

Table: Driver version 'alpha 1-final' status.



Accelerated libraries
  • libGL.so is accelerated and contains the 3D add-on driver.
  • libGLU.so is a utility library that runs on top of libGL.so, it contains no internal acceleration. If libGLU accelerates, it does so by using libGL.
Engine access type PIO mode. I'll try to get DMA mode up in the future.
Supported colordepths 16 and 32 bit modes are fully working, 15 bit mode is partially working. 8 bit mode isn't implemented yet. At least 15 bit mode will be completed, but not before switching to Mesa 6.2.
Supported cards NV04 (TNT 1) upto and including (the slow performing) NV18 (GeForce 4MX). I'll try to get more modern cards up in the future.
3D Rendering functions Straight and AA (anti-aliased) point and line functions. Straight triangle function: Mesa 3.2 doesn't support AA triangles. Support for AA triangles will be setup if possible (engine command is known), after we switch to Mesa 6.2. The Depth (Z) buffer is fixed at 16 bit depth.
3D Texturing Single texturing is supported. Multiple texturing will be setup later if possible (engine command is known).
3D States
  • Line and Polygon stippling is not supported (driver falls back to software rendering mode so it renders OK but is slow);
  • Stencil buffering is not supported (driver falls back to software rendering mode);
  • Drawing into the frontbuffer is not yet supported (driver falls back to software rendering state);
  • Single buffering is not yet supported (driver will probably not work at all).
3D driver's 2D Functions
  • Swapbuffers() is accelerated for both BWindow (non-direct) and BDirectWindow (direct) modes;
  • Triple buffering still needs to be setup (Be's libraries do that!). Triple buffering is an easy fix for 'single buffered' contexts (which will actually be double buffered then), and it helps out for slow rendering apps when you move their windows off- and then onscreen again.
    I even suspect the non-direct BWindow mode might work fully correct (position and clipping validity during drawing): will use BGLView's Invalidate() and Draw() functions for 'back' to front rendering instead of doing that in Swapbuffers(). Note that non-direct mode would still be a hack though!;
  • Swapbuffers() uses 2D blits, even for fullscreen. Literal swapping for fullscreen apps will be setup later (engine commands are known);
  • In direct BDirectWindow mode the DirectConnected() clipping_info turns out to be working after all! (apart from the BMenuBar error for which I have a workaround in place). This means you can drag direct windows like a madman, without errors appearing onscreen ;-)
  • In non-direct BWindow mode drawing errors will be made of you move the application window to fast, or if you move other windows over the application window to fast. Hopefully the triple buffering scheme mentioned above will fix these errors...
  • Scaling is not yet supported: if you resize output windows you will see repeating patterns or rubbish in the extra 'room' (scaling up), or you will see only part of the applications output (scaling down). Will be fixed for a future release (engine command should be known).
BView's function activation Be's flags B_WILL_DRAW and B_PULSE_NEEDED are required for BGLView: some applications rely on it without explicitly issuing them. These flags are hardcoded added by the driver when an application creates a BGLView.
Table: Driver version 'alpha 1-final' status.

Table: Status for tested applications.

App name:





Driver or library failsafe:

3Dlife Optional sample code Working as is BWindow/non-direct mode Doesn't call glViewport() Driver calls glViewport() while servicing a backbuffer clear command when it detects no backbuffer was created before. glViewport() takes care of creation among other things.
GLteapot Optional sample code Working after relinking BDirectWindow/direct mode None Relinking against both libGL.so and libGLU.so is required because some items used are in libGL.so these days, while being in libGLU.so at Mesa 3.2 'time'. We are 'going back in time'.
Demo Mesa 3.2 Working after fix BWindow/non-direct mode Doesn't call glPopMatrix() Added glPopMatrix() to the application. A library failsafe can be done in theory, but it requires additional code. It would need to check if the stack still contains a Matrix when Swapbuffers() is called (or so). Checking would be state-dependant.
Sample Mesa 3.2,
BeBook: openGL kit, BGLView
Working as is BWindow/non-direct mode None None
GLQuake for BeOS R4.5 running on R5/dano http://www.aixplosive.de/projects.html Partially working as is BDirectWindow/direct mode Yet unknown This application doesn't work correctly for a number of reasons:
  • It sometimes renders directly to the frontbuffer which the driver doesn't really support yet;
  • It crashes sometimes: this seems a Mesa 3.2 fault: Be's lib is working OK, while the full software Mesa 3.2 fails as well;
  • Some bitmaps are rendered in wrong colors: This seems a driver fault: rendering triangles requires more state-checking (for getting color-info, among others probably);
  • The picture 'flickers': half the time Swapbuffers() is executed while still rendering. The effect is that sometimes only partially-rendered pictures are shown. Seems a Mesa 3.2 fault as well: Be's lib is OK, while the full software Mesa 3.2 fails as well.
Quake II V3.20 http://www.bebits.com/app/1712 Working as is, with some displaying errors caused by Mesa 3.2. (Full software Mesa 3.2 has the same errors.) BDirectWindow/direct mode Doesn't call LockGL() Driver calls LockGL() in the BGLView constructor if gl_get_current_context() returns NULL. Probably needs to be done elsewhere ('later') in the driver so it's actually possible that a context was indeed made current by calling LockGL() at some point in time.
Table: Status for tested applications.

As you can see from all info listed above, the general idea for a 3D library apparantly is to make it work even if the application developer forgets to do some things officially needed. At least this is the road Be took. The downside of that is of course, that certain mistakes otherwise easily located might never be found now...

Having said that, I'll try to be as compatible as can be with the Be (software) libraries (BGLView). Oh, and I'll post some more technical info here about things I encountered during this last phase for developing alpha 1-final. As promised earlier. Talk to you later!

24 April 2005: While trying to finish up for a first alpha release, I decided to test two other demo apps for BeOS: 3Dlife (optional sample code) and the app described in the BGLView section of the BeBook as a sample for openGL use. Upto now I only tried Quake2 and the GLteapot which both work just fine. (Quake2 and the GLteapot are both 'direct mode' apps, while these other two demo apps aren't it turns out.) Well, I bumped into something that was not expected by me: again, I am having trouble with BGLView. Of course, I got some warnings in the past from a openGL user on the BeOS relating to this trouble: but back then, I did not recognize it for wat it was. As before, I can tell you there's no better way (for me) to find out about some stuff than by experiencing it first hand...

BGLView and BDirectWindow/BWindow (or: direct mode, yes or no)

You can use BGLView in a normal BWindow or in a BDirectWindow as you might know. Using it in a BDirectWindow is named: direct mode. When you want to use this mode, you have to manually tell BGLView that it is used in a BDirectWindow by calling it's member function EnableDirectMode(bool enabled). From my point of view the theoretical difference between those two modes is:
Well, from a application's perspective, these two modes are nice to have to choose from I guess. But, here's the problem I am having (as a driver-writer): There's no such thing as a non-direct mode from the driver's point of view! The driver always needs to do hardware blitting into the 'destination' window (openGL: the front colorbuffer) itself. It makes absolutely no sense to not let the driver do that. After all, we want to accelerate. Besides, I have learned that if you feed the app_server a faked BBitmap (faked in that's it's a bitmap class derived from BBitmap which provides the app_server with an adress in the graphicsRAM instead of in main memory where normal BBitmaps reside): the app_server won't draw anything at all. I guess I can understand that: you could see it as a failsafe precaution from the app_server's point of view. (BTW: Getting such a 'faked' bitmap to work wouldn't help in single buffered mode of course.)

Anyway: The conclusion is that the way BGLView is setup, acceleration cannot be done officially for non-direct modes. I have a way around this though: the BGLView workaround code I mentioned earlier, can help us out here too. I already confirmed the 3Dlife demo running OK, although I have to fix one other problem I am having with it. Of course, even if the BGLView clipping_rects error (in DirectConnected()) were to be solved, we would still need this workaround hack for accelerated non-direct mode. Hence: I have some recommendations to make:
Well, back to work I guess: I need to finalize the driver's code to work OK in non-direct mode. And there are a few other things left to fix for the above mentioned non-direct demo apps: both problems are probably non-class related. You might already have guessed BTW that I have to postpone the first release a bit maybe: as soon as I have a final fix on those last issues I'll inform you. Oh, the nVidia 2D driver needs one more flag as well to account for the missing info for 3D acceleration in BGLView's non-direct mode. Good I did not yet release that driver again ;-)

20 April 2005: I just added some more benchmarks in the 12 april post (below) today. This time I tested a Pentium 2 running at 350 Mhz. It's nice to see what a card's acceleration engine can do for you ;-)

19 April 2005: Sorry it's been so long since I posted here. I have to confess Email responses are a bit slow as well.. It's for a good cause, so I hope you'll bear with me for say a week more... :-)

That's right: I'll be releasing the first alpha release of the 3D driver (with)in a week now. I completed the remaining HW funcs (which BTW doesn't speed up things more, just lines and points are acceleated as well now). Also I am perfecting the driver's behaviour for things like modeswitches from the screen prefs panel or within an app (Quake2), or workspace switches. Works good now. The GLView workaround code can be making minor errors after all if an app renders quickly, but for normal use it's more or less working very good. I'd have to say I am very pleased with what I got going in this short amount of time!

OK, more info will follow, including more writing/testing/technical info. For now time's up again. I'll finish by saying something about the first alpha release coming up: Talk to you later...

14 April 2005: Still I have not done more development on the driver, but I have been testing a lot more. In the 12 April post below you'll find more extensive benchmark results now, for both a Dano and R5 system. All tested setups are rock solid, and both R5 and Dano behave the same with my 'workaround' setup for the BGLView clipping trouble I encountered. It's interesting to see how a relatively slow system speeds up relatively much with hardware accelerated openGL BTW...

Well, some people were (more or less) shocked apparantly by the low framerates on GeForce4 MX cards. Personally, I was wondering too what could cause this behaviour. Anyway, it's nice to get feedback: it got me searching a bit more for a cause. I coldstarted the MX4000 card to see what CORE and RAM clocks it gets: that seems to be OK (core = 275Mhz, memory = 265Mhz: not top-notch, but high enough). After looking at nVidia's site for more precise specs I think maybe it's the LMA2 (Lightspeed Memory Architecture) messing up here. I am assuming that feature isn't enabled 'by default', possibly effectively killing the largest part of the memory-bandwidth these cards have via this LMA (LMA's Z-buffer compression feature seems interesting for instance). Of course, we have no specs unfortunately: so we have to live with it for now. Sorry about that. :-/

OK, back to (coding) work now I guess: I should finish up on the remaining acceleration commands that exist in the driver. Until next time ;-)

12 April 2005 (extended 14 april and 20 april): Quake2 and GLTeapot run accelerated!!!!

Hi there! After I did some more testing and benchmarking since yesterday I thought I'd give you a much more precise 'forecast' of what we have here. I can now tell you for instance that the preliminary results I gave you were in fact from a NV18 (GeForce4MX440). Well, let's just say: Don't buy that card! Anyway, read the new results and buy another 'new' card.. ;-)


The benchmarks were done on three systems. Cards tested are AGP unless otherwise noted. The 3D driver is still not completed (or further developed), we only have 'base' line and triangle HW rendering in place. Preliminary benchmarking indicated no other functions being used (much) for the GLTeapot and Quake2.

OK, before I give you the bechmarks, I want to share some observations I find interesting:
Table: accelerated openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 35-40 fps / 1.0x 30-35 fps / 1.0x 0.3 fps / 1.0x 0.3 fps / 1.0x
TNT2-M64, 32Mb (NV05M64) 85-90 fps / 2.3x 70-75 fps / 2.2x 19.2 fps / 64.0x 15.1 fps / 50.3x
TNT2 Ultra, 32Mb (NV05) 85-90 fps / 2.3x 85-90 fps / 2.7x 23.2 fps / 77.3x 21.9 fps / 73.0x

Table: accelerated openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 45 fps / 1.0x 50 fps / 1.0x 0.6 fps / 1.0x 0.6 fps / 1.0x
TNT1 PCI, 16Mb (NV04) 110-115 fps / 2.5x 100-105 fps / 2.1x 23.1 fps / 38.5x 19.5 fps / 32.5x
TNT1, 16Mb (NV04) 115-125 fps / 2.7x 105-115 fps / 2.2x 23.5 fps / 39.2x 19.8 fps / 33.0x
TNT2 Ultra, 32Mb (NV05) 115-125 fps / 2.7x 115-125 fps / 2.4x 29.6 fps / 49.3x 26.8 fps / 44.7x
Note: Results will not be much slower in a single CPU system of same setup, as openGL is currently single-threaded AFAIK.

Table: accelerated openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 190-210 fps / 1.0x 150-160 fps / 0.78x 2.8 fps / 1.0x 2.8 fps / 1.0x
Mesa 3.2 software, AGP4x + FW 190-210 fps / 1.0x 190-210 fps / 1.0x 2.8 fps / 1.0x 2.8 fps / 1.0x
TNT1 PCI, 16Mb (NV04) 145-160 fps / 0.8x 130-145 fps / 0.7x 28.5 fps / 10.2x 22.8 fps / 8.1x
TNT2, original, 32Mb (NV05) 195-205 fps / 1.0x 175-185 fps / 0.9x 40.4 fps / 14.4x 31.4 fps / 11.2x
TNT2 M64, 32Mb (NV05M64) 175-185 fps / 0.9x 135-145 fps / 0.7x 30.9 fps / 11.0x 20.7 fps / 7.4x
GeForce2 MX400, 32Mb (NV11) 200-220 fps / 1.1x 190-210 fps / 1.0x 38.2 fps / 13.6x 25.5 fps / 9.1x
GeForce4 MX440, 64Mb (NV18) 90-100 fps / 0.5x 90-100 fps / 0.5x 10.5 fps / 3.8x 10.4 fps / 3.7x
GeForce4 MX4000, 128Mb (NV18) 90-100 fps / 0.5x 90-95 fps / 0.5x 10.3 fps / 3.7x 10.2 fps / 3.6x

Note please (for all three tables):
Note also:
NV20 and later cards currently don't work: tested NV28 (GeForce4 Ti4200), NV34 (GeForce FX5200) and NV34Go (GeForce FX5200 in a laptop). It remains to be seen if I can get these up and running.

Well, I have to say that personally I am very pleased with these results (don't use a NV18 for 3D ;-). And on top of these, it also turns out the system remains rock-solid as usual: I did not encounter any problems yet... :-)

OK, that's it for now. Talk to you later!

11 April 2005: Relocating the texture memory onto the graphicscard turned out to be a breeze: the code worked instantenously. Which is not too surprizing, because this code is not depending on any BeOS specific feature.
Anyway: I just resetup some logging stuff, told the driver to go to active rendering state, and disabled the three actual acceleration hooks: points, lines, and triangles. So, still the same amount of hardware rendering is used as before. Only now, the textures are being placed and used on the graphics card's memory.

I benchmarked GLteapot and Quake2 again: this time there's no further speed decay. For GLTeapot one could expect that, as it doesn't use textures. For Quake2, which does use textures, it's interesting to see the framerate remaining as it is. Apparantly the size of those textures is relatively small, so it doesn't have noticable influence on rendering speed. One other interesting thing I saw, is that sometimes textures are requested from the driver which were never allocated before: this must be the Mesa 3.2 problem with regard to the missing texturing on the Quake2 room floors (I mentioned that problem earlier).

Texturing details:

So what more can I tell you about texturing? Well, as it worked instantenously, not that much, really. Let's just list what I did see:

Further steps to take:

From the looks if it all, it seems setting up a 3D driver can be done nicely by working step-by-step indeed. And, very important: and each step can be tested on it's own. Because of that, the 'variables' that can mess-up each step are kept to a limited number: which makes it all doable. Adding texture support turns out to be one of those steps: it's no problem to map them to the graphicsmemory while keeping their use pure software-based inside Mesa.

The nVidia driver is now in hardware rendering state: which means that in theory the hardware rendering functions and the card-based textures are in use. Luckily, not setting the hardware rendering functions lets Mesa use it's software fallbacks perfectly instead: making the 'pure' texture stuff a seperate step to take. As I already mentioned before, the GLteapot uses lines to draw it's FPS display, while it uses triangles to draw the teapot: The logical next step to take is setting up hardware rendering for lines only. After that the points and triangles functions can be setup one by one: completing the entire basic driver as I planned to do it before trying to switch to other Mesa versions and add DMA support (and such). I'd advice first doing the line function, as it's rendering results are relatively easy to interpret: they are (sort of) 2D after all ;-)

One detail that's interesting to know about is probably the fact that (in the nVidia driver) points and lines have actually two different rendering functions: there are AA (Anti-Aliased) and normal (base) versions (triangles only have a base version in the nVidia driver apparantly). The teapot uses the base version, so that will be my next target. Or better yet: is was my next target, because I already completed it.

The line-rendering (base) function:

Next up was doing the first real hardware rendering function. As I already told you, it uses the same engine command as for instance the clear buffer hardware command, so this should not be very hard to do. Well, this turns out to be indeed true. Initially I hit two problems. I am listing them here along with their solutions: Well, after fixing these two errors the teapot is finally rendering it's FPS readout using a hardware command! Of course, the rendering speed didn't go up: the real work is done with all those triangles. Oh, and quake2 also still renders at 0.4fps over here.

nVidia hardware and coordinates:

As I already mentioned just now, nVidia's Z-buffer apparantly has inverted depth coordinates compared to Mesa's internal one: 'zero' is closeby with nVidia, as opposed to far-away with Mesa.
Interesting to know is probably also that the nVidia color-buffer has it's 2D reference (so 0,0) coordinate at the left-top of the screen (or window), while inside the Z-buffer the 2D reference is at the left-bottom edge: this means that for Z-buffer access the Y-coordinate is inverted inside the driver, just like the Z-coordinate...

Next up:

Now 'all' that's left to do is implement the rest of the 3D rendering functions (points, lines and triangles) to complete the entire UtahGLX driver more or less. That will complete 'step 1' of the roadmap as described in the 3 March 2005 post below. Apart from a major code cleanup that is...

Well, that's it for now. Talk to you later. Looks like we will have acceleration going real soon now!

7 April 2005: OK, here's the (final) update I promised about the subjects backbuffer and clipping. For the purpose of creating a 3D driver that only supports double-buffered contexts, I seem to be done with both subjects now. The results: A perfectly rendering function for blitting the backbuffer to the frontbuffer (so inside BGLView), and much better personal insight in clipping details as needed inside of the driver. Before I fill you in on the details, let's look at the current framerates: So why does the framerate keep on dropping depending on buffersizes/complexity of frames to be rendered? Well, that's just logical, as both the Z-buffer and Backbuffer access have become lots slower due to the bottleneck now sitting in between the CPU and memory used: the 'graphics'bus. Rest assured this will be 'over' once we let the GPU (card's acceleration engine) do the accesses instead of using the CPU. The GPU after all 'nolonger' has this bottleneck as now these buffers are sitting at it's end of the 'graphics'bus...

Clipping trouble:

Unfortunately, I hit some errors sitting in Be's implementation of BGLView. Because I now need to do accelerated back-to-front blits, I needed to setup manual clipping for that so I won't overwrite for instance the Teapot's menubar and it's dropdown lists. Or overwrite Windows that happen to be (partly) on top of 'my' outputwindow. Well: BGLView has a function called DirectConnected() which should contain a list of the so called 'clipping rectangles'. Unfortunately, it does not work correctly. There are two errors: In order to overcome these problems, I created workarounds for both errors. The first problem is overcome by using BView's GetClippingRegion() function, and the second problem is overcome by comparing BGLView's initial given clipping rect to the View's size to find out if a menu exists or not. You see, while the menu's offset is missing, the clipping rect's size is actually correct!

Well, suffice it to say that the resulting code is working surprizingly well actually: I can't get it to malfunction currently (although in theory it could). The good news is by the way that this error should be easy to fix inside Haiku (without loosing compatibility) should they decide to use BGLView in the end. Remember: this is just a 'personal' attempt...

If you want more details on the workarounds, checkout Haiku's app_server mailing list: I posted my findings there just now.

Clipping impact:

You should recognize two different setups for a 3D driver here: single- and doublebuffered contexts. Let's first consider doublebuffered contexts, as this is simpler than singlebuffered contexts.

Double buffering
For doublebuffering, the actual 3D rendering takes place in the backbuffer. This backbuffer is never shown onscreen (for windowed apps), so we don't have to think about clipping around menu's and such there. Only the part of the driver that copies the backbuffer into the window onscreen, has to deal with it: this is the only function accessing the frontbuffer, where (system) menu's (etc) might be shown.

If we have a fullscreen app, we can (later on) simply flip buffers, rather dan doing a copy. This is only possible here, as no other items are shown onscreen: the app has total control over the visible buffer. No clipping has to be done for this setup either: not even for the frontbuffer. The use of flipping buffers will of course further speedup framerates as flipping costs way less time than doing a copy: even if that copy is accelerated. As flipping should be simple to implement, I will add that to the driver later on.

Having flipping in place means that the backbuffer in fact becomes the frontbuffer, and vice versa. A side-effect of that property is that the 'frontbuffer' now needs to be setup using 3D granularity (see below for more info about granularity).

Single buffering
Ah: now we are in trouble! Remember hardware (back)buffer clearing? Well, now we want to do that on the frontbuffer! Which means, suddenly the clearing function has to take clipping into account. And that's not all: every 3D rendering function needs to take clipping into account as well!

Granularities and speeds:

While we are on the subject of single buffered rendering: let's talk about buffer granularities. Here's the thing: 2D acceleration functions have a certain granularity by which we have to abide. This granularity is taken into account by the 2D driver, and relayed to the app_server by use of the frame_buffer_config struct. That way we can use resolutions, that are not natively supported by the engine.

Well, here's the interesting part: 3D functions (might) need larger granularity! On nVidia cards I already confirmed this. So, if we are doing doublebuffering, I can leave the 2D driver setup as it is: after all, copying back-to-frontbuffer is a 2D function! But, if we are going to do singlebuffering, the engine will simply crash: now 3D functions have to directly access the frontbuffer, wich is setup using 2D constraints only (the backbuffer and such are already setup using 3D constraints).

Of course, it's quite easy to patch the 2D driver to setup it's buffer using 3D constraints instead of using 2D contraints: you can look forward to that in one of the upcoming versions of the 2D driver later on.

So: what's the use of granularities anyway? Well, this has to do with buswidths in the GPU, and to/from the graphics memory. By using a large width, the GPU (and RAM) is (are) able to process commands (much) faster than if the buswidths would be lower. Generally speaking: the newer the card, the larger the buswidths it has.

OK, that wraps it about up for today. As you might have guessed, I will now implement the move of the textures to graphics RAM. Talk to you again when I have news about that! Bye! :-)

4 April 2005: Well, BG has been fun, as usual. Only, it gets better each time around! It's also very nice to see the improvements for both Zeta and Haiku each time.. Furthermore two people donated their (older) graphicscards to me so I can finally see why these are still not cooperating as they should: RAM related bandwidth trouble on a TNT2-M64 and a GeForce2 Ti. Now I have those cards to test with I can hopefully finally solve or minimize this problem.

Anyway, back to 3D related news. First, I'd like to point out that I updated the Mesa 3.2 source and dano 'executable' downloads to run Quake2 now. I found myself looking for a HW rendering problem I thought I introduced, so I had to look at my starting point again to see if that was indeed the case. Luckily for me, it was not. The 'numbers' in the middle-lower border of the screen are missing here as well: which means I don't have to look into that as this will be solved automatically when we switch Mesa versions later on. Pfeww! In order to test for this problem, I had to further 'update' Mesa 3.2 which is why I uploaded the source to this site as well. Download it here if you want (although it is still just software rendering of course):

The backbuffer is up and running!

I have the backbuffer up and running nicely. This buffer is now in the same space as the frontbuffer, as opposed to 'normal' software rendering which always uses 32bit space. I have setup a sparse version of accelerated blits for back to frontbuffer rendering: works nicely. This is synced to the (Direct)window via the DirectConnected() function. I need to setup manual clipping to let you see the menu's, and I need to rewrite the frontbuffer write/read bits stuff. I already have rewritten the write/read bits functions for backbuffer access: they work in 16 and 32bit space only atm. The rewrite is needed because while the buffer's space is in 'frontbuffer' depth, the colors handed to those access functions remain to be in 32bit always.

All in all, getting the backbuffer fully up and running is a lot of work, as this is a BDirectWindow like setup. And I still have to fix a few things:
That's it for now. An extra update will be here soon explaining more about frontbuffer clipping and how far this influences 3D rendering. But first: Back to work! 8-)

25 March 2005:I have some very good news for you today: The hardware Z-buffer clear command is now actually working! This means that both the NV10_CONTEXT_SURFACES_ARGB_ZS and NV10_DX5_TEXTURE_TRIANGLE command are up and running. Which in turn means I will actuallly be able to get this driver going, unless I am very much mistaken! Of course it remains to be seen which cards will work with it, and if I can get DMA up and running...

OK, here's the story. After I wrote the previous 'mixed emotions' message here, I started thinking. I realized I had developed the original 2D driver (PIO mode) using XFree 4.3.0 as a reference for specs. Of course the UtahGLX driver never(?) worked with that! So I crosschecked the XFree 4.3.0 driver with the 4.2.0 driver which did work with UtahGLX. After doing this, I saw I had everything in place, and I realized something else: The UtahGLX nVidia driver is still being worked on after all. It seems someone is trying to make it less dependant on Xfree, and to make it work alongside newer Xfree versions. The 3D init code I have in the 2D driver came from the UtahGLX CVS checkout I did late last year. This stuff wasn't in XFree 4.3.0, but it was in 4.2.0. Which means that someone 'moved' it in the UtahGLX driver after the XFree 4.2.0 release. Interesting to see the past 'unravel' :-)

Well, so there was no problem here, and my 2D driver should do enough initing for the 3D driver. The problem had to be somewhere else. Of course, when I looked at my 3D testcode again, I immediately saw a fault: I initialized a command pointer to the wrong command. Don't know how I could have missed that, I was plain tired or so I guess. This fault wasn't the only thing causing trouble though: the card didn't work as expected as well. After I switched back to a NV11 (GeForce2MX400) and removed the pointer fault it all worked at once. Or, to be more exact, I already switched back the day before in an effort to rule out the card as a troublemaker. So I just had to remove the pointer fault this time around. And it works with an unmodified copy of the nVidia 2D driver: V0.41.

HW clearing explained:

So, now that this clear command using 3D vertices and triangles worked, I wanted to know how that could be. I mean, I still thought in terms of doing 2D blit commands to fill a rect (previous attempt ;-). But, when you think about it, using a 'fixed' set of triangles is much smarter! Let me tell you how I think it works. So, how does this clear the Z-buffer? It's simple, actually. By rendering anything, you not only write to the visible buffer (called 'colorbuffer' in openGL terms), but you also write to the Z-buffer. After all, it somehow has to be determined if the next thing you will render lies in front or behind the previous thing you did. Well, by drawing a rectangle with the total screen's size in the utter background: anything rendered next will be closer by and needs to actually update the colorbuffer. Hence, we cleared the Z-buffer.

And how does this clear the colorbuffer? Well, the rectangle we drew lies in exact parallel to our 'viewing position' (the monitor's screen) as we specified the same Z-coordinate for all vertexes, hence resulting in a rectangle on-screen. On top of that, the function gets a 'clearcolor' specified that is used to fill that rectangle. For the Teapot, that's blackness. For Quake2, it's bright-red like. Want proof? Well, just specify to draw to Z position 'zero': so in the utter front of the 'world' we look at. The application won't render anything after that: nothing can be in front of our rectangle this time...

Benchmark results and remaining problems:

I benchmarked a bit with HW clearing of the Z-buffer in place. After all, the speed should go up again, right? Well: right!. Moving the Z-buffer to gfxRAM alone dropped GLteapots framerate to about 35-40fps. Adding 'correct' HW clearing of the Z-buffer increased speed to about 40-45fps again.

Hmm, why do you say 'correct'?, you might ask. Well, the driver currently clears a buffer the size of the total screen, instead of a buffer the size of the Teapot's window. Rendering then is at about 17fps. This I will correct asap: and it did not exist in the UtahGLX driver before. Of course, for fullscreen apps the rendering speed will not be influenced by this. In order to force the driver to do just the correct size (more or less now), I 'enhanced' the NV10_CONTEXT_SURFACES_ARGB_ZS command compared to the UtahGLX/XFree 4.2.0 driver's version: I am explicitly setting the pitches for the color and Z-buffer (it used to be 'just' pre-configured by the 2D driver's acc init code). Testing explicit setting turned out to be rather interesting: it turns out that the engine's granularity for 3D activities is larger than it is for 'just' 2D operations!. For the NV11 and NV18 this turns out to be 64 bytes (not pixels, mind you!), while I think (or rather, I hope) it's even larger for NV28 and some other architectures. I say this because of two things: Anyway, this granularity thing is one of the first things I'll test now, as I'd like to be able to work on my laptop as well :).

The next steps:

Well, after reading the previous stuff, you might have guessed the next 'bigger' steps already:

Interesting side-effects of (testing) HW 3D rendering:

The question of synchronisation between app_server/2D driver and 3Ddriver came up once or twice. I can now tell a bit on this, with evidence. As you might know, 'clients' using the 2D driver are required to AQUIRE_ENGINE before they want to do something accelerated, and RELEASE_ENGINE asap after that. This is the way for instance the app_server and BWindowScreen are serialized when they both want to accelerate (2D) drawing. Well, nothing makes more sense than keeping that system up for 3D as well. Hence, as far as the 2D driver is concerned (more or less), the 3D driver is 'just' another clone of the accelerant (just like BWindowScreen uses one). This means sync between 3D and 2D drawing (i.e. moving the GLTeapot window (== 2D) while it's spinning the teapot (== 3D)) is also done via AQUIRE_ENGINE/RELEASE_ENGINE (a benaphore actually).

Well, I tested this and was punished right at the start. It turned out I had split-up something that should be one3D engine command into three different parts, each doing the AQUIRE/RELEASE stuff. It turns out the engine really gets confused if you insert another (2D) command in between those parts. This is something I can understand however: I was just on the wrong path because in the UtahGLX code these three subparts exists just as that: but they are in fact just one engine command.
So, I combined them to be as one, and then it worked flawlessly. Of course, dragging the window is less fluent when the engine is working on 3D as well, but it works perfectly. I'll optimize the process along the way BTW.

OK: say you are having such difficulty, but you don't recognize it. How can you test for it? That's quite simple: just modify the 2D driver to not export the acceleration hooks, but keep initing the engine. If you now run a 3D app, that will use the engine: but the 2D driver won't. Problem solved? There you go. Worked for me... ;-)

I'll finish for today with an interesting side-effect I encountered while testing all this: I can hear (yes indeed: hear) the engine using power. If I drag the Teapot window (accelerated), I hear the system's power supply (350W I think, testing a passively cooled AGP NV18) complaining (a bit) about the strong fluctuations in power drain: You can hear some noises coming out of it regulating the output voltages, in sync with drawing :-). I heard it with just 2D acc as well (scrolling in sourcecode), but it's definately louder now. While this doesn't matter (it's it's job to regulate after all :), it proves the card actually has to work for us now ;-).

22 March 2005: Well, I got both good news and bad news I quess.

The good news is that I have resetup the lowlevel engine commands, being NV10_CONTEXT_SURFACES_ARGB_ZS and NV10_DX5_TEXTURE_TRIANGLE. I confirmed having access to the engine and it's FIFO, and I confirmed I can issue a 2D related command that apparantly gets executed.

The bad news, however, is that issuing one of those 3D related commands locks up the acceleration engine. I can see the FIFO receiving the commands, but, as they won't execute: the FIFO fills up and never gets emptied again.

I've now got to the point I have to figure out what's wrong by trial and error, combined with painstaking bitwise comparing of my 2D and 3D drivers to the Linux 2D and 3D drivers. This is sort of a point I've been at twice before with nVidia cards: When I setup PIO mode acceleration in the early phase of development on the 2D driver, and just now, when adding DMA mode acceleration to that driver. This, combined with the fact the 3D commands mentioned actually worked on Linux, means that I should be able to find the problem in theory.

I have to admit it tires me a lot though, having to do this: it costs a lot of energy. Sometimes during this I get urges to throw out my computers and never look at them again. Luckily upto now after a day or two of doing nothing, I also get the urge to find out what's wrong. It's in my nature to dig deeper and deeper.. :-) On the other hand, it gets more difficult every time around to overcome the natural resistance I feel on such things.

Anyway: no promises (as usual). I'll do my best and hope I can nail this one. I've come to 'hate' not having documentation however!

20 March 2005: Quake2 is now running (unmodified) on Mesa 3.2. While trying to stabilize the GLteapot I: Well, the Teapot is more stable now (although it's still not perfect, but we'll switch to Mesa 6.2 later on anyway). And quake2 works: which is nice to have for more extensive testing with my current efforts. It's interesting to see that Mesa 3.2 apparantly offers a lot less options than version 6.2: the floors in the quake rooms are perfectly clean for exampe (no textures there). I benchmarked quake2 again just for fun: As a final note for today I have to say the Be debugger (bdb) is proving itself to be invaluable! It's a very handy tool: Can't live without it in user-space! Although I never needed it for my 2D drivers: Yet.

18 March 2005: Yesterday was spent trying to get the Z (so depth) buffer relocated to the space already reserved on the graphicsRAM. In order to do that, I had to hook the driver into the Mesa driver-interface function 'AllocDepthBuffer'. Well, that didn't go as easy as I hoped. After a lot of searching, it turned out you also have to hook in two other functions: 'DepthTestSpan' and 'DepthTestPixels'. Mesa has it's own internal 'fallback' versions of them (like it has for all others as well), but in the case of these two it only uses them as long as AllocDepthBuffer is managed by Mesa internally as well. It thinks that if a driver decides to manage AllocDepthBuffer, this driver will manager the other two functions as well.

While it probably makes sense (they are all depth-buffer related), as far as I can see it should be possible to use Mesa's fallbacks just the same. In fact, I tested that successfully by copying some Mesa code into the driver. Anyway, I don't want to introduce possible new variables in this puzzle, so I decided to hook in those other functions as well. Which in turn, required me to copy some more core code from UtahGLX into the driver (from xsmesa2.c). The reason for this is that the 'upper level' of those two functions is sort of 'common driver code' in UtahGLX, while the two low-level functions sitting under them (that actually access the Z-buffer) are core-driver functions: these two were in GLXProcs. Still, for the current nVidia driver it seems to bring no special stuff: the Mesa fallbacks seem to do just the same.

Well, all in all I got the new Z-buffer up and running (no HW clearing yet, as 'explained' in the previous post here). The simple proof of the Z-buffer working is that the Teapot's framerate immediately dropped to about one quarter of the previous speed, as I indicated would happen some time ago ;-) Other details I had to understand were: OK. That's it about the Z-buffer for now. I should probably tell you that it's mightly convenient that Mesa by default has it's software fallbacks (which were always used on BeOS yet, with Mesa). Not only can you implement 'function by function' in your driver, but you can also look at 'the reference implementation' for a good idea of what it does/should do. And of course, sometimes using the fallback code in your driver: and then replacing that piece by piece with your own code, can help you hunt down bugs (I talk in experience now ;-).

Dano's hardware acceleration I have to talk about briefly as well it seems. Dano(Zeta?) has (have) both the 'old' openGL and the 'new' openGL aboard. The old openGL is the libs as we know them, which use BGLView. This openGL is software rendered only. The 'new' openGL uses BDirectGLWindow (if I remember correctly), and a new lib (libGL2) that has hardware rendering in place. It seems that the 'Be Dano' G400 and ATI drivers use the 'default.so' lib, while the 'old' Voodoo cards use the other libs in the 'libGL2' subfolder(?). This means that if you'd want to see HW acceleration with them, you'll have to rewrite/recompile/relink your app against that lib, or it will use the 'standard' R5 lib. OK, this is just what I expect by now. I may be wrong, but I frankly don't care that much -- unless we get access via Zeta I guess.

Anyway, what I want to point out is, that what I am working on, is HW acceleration for the 'old' openGL (although it will be openGL1.5 as Mesa 6.2.1 is that far). This 'old' openGL is Mesa compatible I'd say, while the'new' openGL is undocumented and differs, if only in the BeOS specifics. Creating HW acceleration for the old openGL 'method', means that it should work on R5, dano, Zeta, and (most) other R5 derivates as well. I expect.

Talk to you later, first I'll do more puzzling.

16 March 2005: Time for another update I guess. I've hooked in some utahGLX driver functions in Mesa 3.2 using it's software BeOS driver. Also I've added my previous 3D add-on driver that did nothing more besides hardware Z-buffer clearing. I removed the Z-buffer clear function though as that should be done in another way. What this driver does add, is shared_info access from the 2D driver. I used it to rewrite some bits and pieces in the UtahGLX driver so that it now correctly reserves graphics memory space for Z-buffer (depth), back buffer (double buffering) and textures. I'm still in the beginning phase though: so no hardware acceleration yet.

While poking around in the driver(s), a path becomes more clear through all the functions that need to be made to work. I will at first run the UtahGLX driver in 'software rendering' mode. This is what that driver falls back to if the current 3D context isn't supported. This is currently the case for: or a combination of those 'items'. In software rendering mode clearing the buffers remains being done with hardware acceleration however. This gives me the opportunity to test just one accelerated function to begin with. While doing that I need to setup lowlevel engine access, which probably 'requires' some more updates to the 2D driver (headers) as well. On top of that, this forces me to find out how in the world some of those Linux-specific structs for X11 need to be filled out: as I need to instruct this function about clipping (called 'window clipping' and 'scissor clipping'). Clearing a buffer is done after first applying clipping information.

Well, if I am to set that function up, I also need to map the buffers to graphicsram first: which means the first step now needs to be doing just that. The UtahGLX driver only works on 'local' graphics memory: it doesn't use AGP or PCI transfers to get stuff from main mem. If such a thing needs to be done, Mesa will do it for the driver instead: using software 'copying' of data. Maybe we can add hardware functionality for that later...

After the first 3D driver function works, it becomes time to instruct the driver to go to 'hardware rendering mode'. Then I have to re-orient myself to find out in what order I should do which stuff. Something interesting is for example, that the clear function does it's thing via vertexes and drawing triangles: so HW 3D instructions.
I already played around a tiny bit, and I can easily switch on (or off) rendering of lines and triangles by hooking in NULL functions in Mesa's driver interface. Switching off lines alone for example kills the FPS output GLteapot can show you, and switching off triangles alone will keep the FPS display, but kill the teapot(s). Yes indeed, I can still see possible routes through this maze..

Anyway: as you can see (well.. read), I'm having fun here! ;-)

12 March 2005: Hi, I've got a new update for you:
Well, after also modifying the driverfiles extensions from .c to .cpp the Mesa 3.2 libraries once again compile and build. They remain working as usual, which was expected of course: the driver doesn't get called yet. This officially ends the first part of trying to do 'step 1' as mentioned in the first post on this page. Now I can proceed with starting to (re)write code. Here the real fun should begin!

9 March 2005: Apart from what seems to be one remaining issue, I've got the core of the UtahGLX nVidia driver compiling on Dano in Mesa 3.2 now. Here's what I did: I added the driver in the BeOS software driver folder for now. That folder now contains these files: The riva_xxx files will be renamed to nv_xxx (as we don't support the riva chips), and probably another new file will be added that contains some 2D driver code for doing the 'cloning' as mentioned before. But that's about it concerning files.

Once I get that last remaining file compiling (riva_vb.c), I'll start work on integrating the pieces we then have to get something probably looking like this schematically:

Mesa 3.2
3DA nVidia driver -- GLH -- BeOS 3D related classes
| |
| graphicscard
2D nVidia kerneldriver -- 2D nVidia accelerant 'shared info'

Well, that's about it for now. I have to tell you that the more I look at it, the more I think it should actually work. Still, don't hold me to it though. And: You know that we are talking about a very basic 3D driver, right? Right! (Anyway: accelerated Quake2 would be nice to have I think. Do we actually have an opensource port of that BTW? ;-)

Talk to you later..

5 March 2005 (updated 5 May 2005): Since yesterday I've got MESA 3.4.2 working on dano/R5, and since today MESA 3.2 also works OK. I decided to put some more effort in getting them to run on dano, because R4.5 had a few downsides for me:
I am using Oliver Tappe's gcc version 2.95.3-beos-041111 as compared to BeOS 'default' version 2.9 MESA gains about 10% in speed.

So: why two versions of 'old' MESA? Well, version 3.2 is the 'official' version to use with utahGLX, and version 3.4.2 is most likely the newest version that will actually support a (more or less) '1:1' port of the utahGLX driver (just drop in place from 3.2). Also version 3.4.2 contains (some of) the new driver interface functions that are still used in current MESA 6.2: which should ease testing for switching versions later on. Oh, and version 3.4.2 is much, much easier to compile than 3.2 on dano (with that gcc version I mentioned). MESA 3.2 remains first choice for now however, as using that simply eliminates possible extra (unknown) variables otherwise maybe introduced.

If you'd want to actually use these old versions of MESA, you would (probably) need to re-link the apps to these old libraties (link to both libGL.so and libGLU.so). I got the GLteapot working nicely this way. The most important reason for having to relink the apps is because in the old days some functions needed/used resided in libGLU.so, while these days they are moved to libGL.so. Probably a natural 'development' of increasing openGL version numbers with the functions they support. Quake2 (closed source) works anyway over here (more or less): a lot of textures are missing during rendering. Anyway: it will all do nicely for now. :-)

For completeness sake I have setup a few archives you could download if you'd want (although it's not very interesting at this point I guess, as it's still software-only rendering): Stay tuned: I'll keep you posted on the work I do.

3 March 2005: Slowly a plan materializes to bring accelerated 3D support for nVidia to BeOS and derivates. Let me fill you in on the current status :-)

In december 2004 the status of things was nicely covered by a number of BeOS related newssites. I want to point you at IsComputerOn though, as they had the most extensive coverage in place: you can read their article here. Well, that status regarding the actual nVidia 3D add-on itself is still current today (as I have been working hard on implementing 2D acceleration using DMA as you might know: driver 0.41). What did evolve a bit, is the thing I wrote about in the last paragraph of the quoted text in the article called: "The main challenge". This challenge was finding out how the utahGLX driver interacts with MESA.

It turns out that the (latest) context in which that utahGLX nVidia driver worked was this: During development it's very important to take the smallest steps possible for the largest chance on success. This dictates that I should take the following roadmap to get to the goal: In order to execute the first step, I need to switch back to use BeOS R4.5. Mesa 3.2 does not compile on R5 or later, but it does on R4.5. I already suspected, but confirmed just a few days ago, that the 2D drivers I code for nicely work from R4.5 and up: so dev can be done there. I should update the BeBits entries to reflect that I guess ;-) Did you know by the way that R4.5 does seem to support video overlay after all? The corresponding driverhooks are exported at least...

Step 2. will nicely speedup the 3D acceleration compared to what I mentioned over at IsComputerOn (upto a factor of 5): if I can pull it off that is. Otherwise I'll at least optimize PIO mode for both 2D and 3D. If DMA is going to work I'll remove PIO mode support from the 2D driver, otherwise it will stay for now.

Step 3. will be a challenge as well, as this requires me to gain more insight in the inner workings of openGL and MESA. MESA has a driver interface that's defined in a file called dd.h. This interface has changed considerably between version 3.2 and the current 6.2, though there are still much similarities. For instance (some) accelerated functions in the utahGLX driver have been moved into the tnl (Transform and lighting) subpart of that interface, while previously that subpart did not exist: the interface was sitting in the general interface. On top of that the routines take one less variable compared to in the old days: the colorindex for things to be rendered is nolonger passed. Apparantly the 'current' (state) color is used instead now.

Creating a interface for MESA that we will use later on so we can interface multiple 3D accelerants with it is something I haven't thought about much yet. It seems pretty straightforward though: make sure we have all functions available that are in dd.h. On top of that we need the 'GLX' extensions for BeOS (just like GLX is the GL interface to X windows on linux, and there's a Windows variant of this as well). These GLX things are platform specific, and will translate to BGLView or so on BeOS: it deals with how rendering should take place so we can actually see it (and such?).
Anyhow: this interface is also used in the utahGLX driver, and works via a struct named 'GLXProcs'. As far as I understand now GLXProcs is the driver's link to the 'infrastructure' utahGLX provides to X-windows. This is the only thing utahGLX does which means I can forget about looking at that alltogether (the functions are nicely described in the openGL reference manuals). Hopefully Philippe Houdoin (BeOS MESA 'port' maintainer) will be doing work on setting up this interface when we need it.

Setting up a BeOS GLXProcs interface and making the utahGLX driver interface correctly with the nVidia 2D driver is all that needs to be done to complete step 1. The interface to the 2D driver has already been setup in that driver (for both PIO and DMA mode, version 0.41) and actually works. It might need some more finetuning though: we'll see.

OK, that's the update for now. Can't wait to start on the 'remaining' work for step 1!

2 March 2005: Welcome on the page that will keep you upto date with my accelerated 3D attempts for nVidia cards on BeOS. You should look upon it as being a blog for now I guess. Not much here yet, but that will probably change in the (near) future ;-)


(Page last updated on May 6, 2016)