This is part 3 of a series on input latency. Check the first post for background information about input latency, and part 2 for general techniques for reducing input latency in applications.
This post discusses platform-specific ways to minimize input latency in native apps. Right now I have a lot of information about Windows and a little about Android and the Web. I hope to fill this out a little more in the future.
DWM is the compositing window manager for Windows. For every application running in a window, DWM adds one whole frame of input latency, as described in the tearing prevention section of part 2. This is bad! DWM has some features that can remove this extra frame of latency, but only in a couple of situations:
- If you make a full screen window, and there are no overlay windows on top of yours (e.g. on-screen volume display), DWM will skip compositing and remove that frame of latency for you. There is no need to use the old full-screen exclusive mode for latency reasons anymore. (You also don't need to use full-screen exclusive mode to change the screen resolution, because you can use hardware scaling via DXGI instead. And if the display supports VRR then you don't need to change the refresh rate either, making full-screen exclusive mode entirely obsolete.)
- For non-fullscreen windows, if the GPU driver supports a feature called Multiplane Overlay, then DWM can promote your window to a hardware overlay which removes that frame of latency. Multiplane Overlay is currently (in 2022) only supported on Intel GPUs, or Nvidia RTX 20 series or newer. And it is disabled if some features such as display rotation or 10 bit color are used.
So Multiplane Overlay is great and you should support it if possible! To know if Multiplane Overlay is being used, and for exact input latency numbers, it is highly recommended to use PresentMon or one of the debugging tools that integrate it such as PIX. Multiplane Overlay usage is shown in PresentMon as "Hardware Composed: Independent Flip". PresentMon also shows an exact latency number which is calculated as the time from the Present call on the CPU to the VSync when the pixels start being sent to the monitor. (This is slightly less than the application's total input latency because it doesn't include the time from input event processing to the Present call, but it's still a very valuable number to have.)
As an interesting bit of trivia, the software rendered mouse cursor hack is built into the Windows operating system. Whenever you drag a window, DWM switches to a software rendered mouse cursor so that you don't notice DWM's 1 frame of latency. But this only works when dragging top-level windows managed by DWM.
DirectX 9 in a window has about 3 frames of latency by default. DirectX 9 cannot take advantage of Multiplane Overlay and has poor control over queue depth. It is best to avoid DirectX 9 if you care about input latency for windowed applications.
For newer DirectX you can use DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT and SetMaximumFrameLatency(1) to limit the queue depth. When running in a window on a system that doesn't support Multiplane Overlay, this will give you about 2 frames of latency by default. You can reduce that using the delayed rendering technique described in part 2, but DWM makes it impossible to go lower than 1 frame of latency.
With Multiplane Overlay, or a fullscreen window, latency is only limited by how fast you can render and how precisely you can schedule that rendering to hit VSync. It is also possible to disable tearing prevention with DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING. This is actually required to take advantage of VRR displays. It also enables you to implement beam racing, if you are able to take advantage of it.
OpenGL (on Windows)
OpenGL does not support using the Multiplane Overlay feature, so non-fullscreen OpenGL windows will always have more than one frame of latency, even if VSync is disabled.
Vulkan (on Windows)
I haven't tried this, but it should be possible to use Vulkan to render into a DXGI swap chain and take advantage of Multiplane Overlay that way. I don't think Vulkan has a built-in way to take advantage of it otherwise.
The Android Game Development Kit provides a library called Swappy that is almost certainly what you want to use to control your frame rate and queue depth. It has good documentation available here.
Unfortunately Android's compositing window manager SurfaceFlinger adds a frame of latency, just as other compositing window managers do. I am not sure if SurfaceFlinger is able to remove that frame of latency using hardware overlays the way DWM does. If you know, drop a comment below! Also if you know about an equivalent to Windows's PresentMon for Android.
Unfortunately WebGL and requestAnimationFrame give you no explicit control over the queue depth. I have done some experimentation with the "desynchronized=true" WebGL context attribute, pointerrawupdate, and OffscreenCanvas in Chrome, and I was unable to wring any latency benefit out of these features, at least on my Windows machine. Even if there is some benefit to be had it is likely to be Chrome only, and not on all Chrome platforms, and a lot of work to implement, and fragile as browsers update and change their behavior.
Unless the web platform adds explicit control over queue depth, input latency of WebGL and WebGPU content in browsers is going to remain poor. Many of the other techniques discussed here are not possible on the Web either.
I'd like to investigate more about Apple platforms and Linux in the future. If and when I do, I'll add more information here. If you have pointers to useful information, please add a comment below, or contact me directly using the links at the top of the page.