The Quest for VSync

It is well known to anyone who has programmed under windows that VSync pretty much does not exist. A simple thing which is taken for granted on other devices is near impossible on windows, and when vsync is finally coaxed out of this broken OS it is always unreliable and inefficient.


It is possible to obtain VSync quite trivally using IDirectDraw::WaitForVerticalBlank. In a correct world, that would be the end of the discussion. HOWEVER the abomination that is windows thinks that the correct way to implement this function is to poll the retrace bit Whilst this method was perfectly acceptable under DOS, it most certainly is not on a multi-tasking system. Not only does this method waste 100% cpu but it will also likely miss the retrace, since the program can be preempted at any time. What if multiple programs wait on VSync??


A much improved method to achieve reasonable VSync is to use IDirectDraw::GetScanLine and Sleep calculating an appropriate number of milliseconds to sleep and then polling the rest, cpu usage can be reduced dramatically, to less than 10%. I have been able to achieve near solid vsync with this method and leave some cpu left to actually do something with. It however is very fragile and can easily suffer timing issues. IDirectDraw does not expose the total number of scan-lines in a frame so it must be approximated on program startup, raising the thread priority to RealTime can reduced the likelihood of incorrect calibration, but it can still fail, in this case calibration must be redone as determined by the user as vsync will be crappy and broken when the number of scan-lines is wrong.


The reason VSync is shit on windows is because there is no access to a VSync interrupt. Most graphics hardware has a vsync interrupt that is used internally in the driver, but there is no interface to access this functionality in windows. At first I attempted to gain access this sacred resource by hacking the display driver, this was fruitless. I then thought of another method, THE PARALLEL PORT. The parallel port has an interrupt line, this line can inform the cpu of a realtime even in microseconds. And what realtime event would coincide with the retrace?, the VGA VSYNC line of course.

VGA male to female adaptor splits out vsync line which then runs up to the parallel port, entering pin 10 (ACK/INT line). The adapter was constructed from standard 15-pin male and female vga sockets. My idiot self thought it would be a good idea to solder them together without any gap between, the center pins being the longest were directly joined without wires. This ended up making a very close work area for soldering short link wires to join the upper and lower rows. After several shorts it looked like I might have to scrap the piece but in the end it was completed. It cannot be considered a neat job however.


Its all well and good making the physical connection, but without the driver to go with it its useless. Driver development is not easy, the documentation is poor and in some cases completely wrong and experimentation leads to crashes and misery. I attempted several times over several months (or longer) following different examples (and even the example in the driver SDK) to connect to the parallel port interrupt. Each time failing and leading to multiple crashes. The examples just did not work. I then made breakthrough when I found a working example as part of a working driver, the OpenCBM driver. After butchering out the parallel port interface code from the OpenCBM driver I was able to get a working VSync interrupt. I made this signal accessible from user mode via a global named pulsed event. Getting perfect vsync was now a simple task of opening the named event "VsyncEvent" and then calling WaitForSingleObject on that event.

// Open global VSync event
HANDLE vsyncEvent = OpenEvent(SYNCHRONIZE, FALSE, "VsyncEvent");
// Wait on global vsync event
WaitForSingleObject(vsyncEvent, INFINITE);

Trace shows VGA VSync pulse superimposed with parallel port data output bit. The parallel output bit is pulsed from user mode after waiting on vsync event. Total interrupt latency into user mode < 13┬Ás.


Due to potential jitter between the interrupt and user mode code, and also the fact that some frames may be missed (in the case of rendering at 30fps for example) it can be difficult to count the frames and maintain steady timing. To solve this the driver keeps a counter incremented each frame, and also a time stamp for when the last vSync interrupt occurred. This information is written into spare bytes in the SharedUserData page which is mapped into the address space of all processes at a fixed address. Writing to SharedUserData is somewhat ill-advised as this memory area belongs to the operating system and whilst these bytes are spare if more than one driver tried this there would be potential for conflicts. This method however is nice and fast allowing direct access to this information from all processes without having to open a handle to the vsync driver and calling into kernel mode.
// Shared memory structure
struct vsync_status_t 
    DWORD frameCountA;
    DWORD frameCountB;
    INT64 frameTime; 
The default base address for this structure is 0x7ffe0FF0, frameCountA is set before frameTime and frameCountB is set after. Having two frame counters allows for the detection of an update tear. To read this structure such to avoid tearing, first copy frameCountA, then copy frameTime, finally compare frameCountB to the previous copy of frameCountA, if they differ then perform the entire operation again. If they are equal then the copy of frameCountA and frameTime are the correct reading.


To make use of the VSync interrupt software must be modified. The first piece of software I modifed was Kega Fusion 3.64 (Sega MegaDrive emulator). The first attempt of modifying was done via dll injection and runtime patching of the code. The patch was trivial and resulted in perfect solid VSync at low cpu usage, where previously to enable vsync would cost 100% cpu. I have since rewritten the patch using my new exe modifier, the patch is now internal to the application.

I also modified DukeNukem3D to use my vsync method, to get perfect frame lock I modified the game timer code, replacing the calls to QueryPeformanceCounter with an artificial time derived from the frame counter. With the polygon software renderer, windowed at 800x600@60 the cpu usage with perfect vsync was ~10%. Full screen 800x600@120 was around 30%.


Kernel mode driver
Vsync test program
Fusion 3.64 Vysnc Mod
DukeNukem3D Vsync Mod