The New Era of Video Backends: The Unification of VideoCommon

It's not common for a rewrite to be something that warrants an article, but, this is one of the exceptions. Over the past few years, parts of Dolphin's video core have seen renovations to make way for new features, but a fundamental problem remained. Dolphin's video backends suffered from both having too many unique features while also duplicating tons of code from the other backends, making it difficult to add new features and maintain old ones.

Those that have followed Dolphin from the very beginning may remember that its video backends were once video plugins. While these plugins were eventually brought into Dolphin as video backends, their initial existence as plugins meant they were designed and laid out in certain ways that were extremely inefficient for something integrated with the emulator. After dealing with these limitations and issues as much as anyone, stenzek finally took the leap and renovated Dolphin's video core.

This massive rewrite, dubbed The Great VideoCommon Unification Project, unites most of the graphics emulation logic in shared part of Dolphin, greatly reducing the size and complexity of each video backend! Along with the rewrite, stenzek has brought with it a slew of small improvements, optimizations, and a big surprise...

A History of Plugins

In the early days of emulation, emulators were mostly hobby projects developed by single developers. Even though open source licenses had existed for some time, good infrastructure wasn't available yet, open source hadn't been legally tested, and it just wasn't necessary. One person could create a basic (inaccurate) 3rd or 4th gen console emulator fairly quickly, and with the complexity of accurate emulation out of their reach due to hardware limitations of the time, they had little reason to work together. Outside of a few exceptions, everyone happily worked on their own proprietary closed source projects.

The 5th generation of consoles would change the game for emulation. Due to their increased complexity that required a much wider range of knowledge to emulate, collaboration suddenly became essential. But without the protection open source licensing provides, sharing code was dangerous since it exposed them to theft. And so, the plugin system quickly took hold among the emulators of 3D consoles. The plugin system allowed a single person (or tight core of trusted people) to write a closed source core of an emulator, then allow outsiders to access the emulator core through an API. Each plugin could contain full emulation logic for a device or piece of hardware and would communicate back and forth with the emulator core as needed. With plugin developers only having to worry about their part of the puzzle, the emulation scene thrived. There are literally hundreds of plugins from that era with their own strengths, weakness, and goals.

Plugin selection in Dolphin 2.0.

By the time Dolphin turned up in 2003, closed source with plugins was the de facto standard. Given its success in Nintendo 64 and Playstation emulators, it was almost a given that they would be the answer for GameCube emulation. The 6th generation of consoles quickly broke that idea. With tremendously high complexity for even rudimentary emulation, the 6th generation required tight collaboration among dozens of people in a scope that the awkward plugin system was just not good enough to handle. Only a handful of GameCube emulators ever appeared, with only Dolphin reaching maturity. The GameCube would never have a plugin scene.

The plugin system gave Dolphin nothing but redundant, harder to maintain code. Naturally, Dolphin's ties to the plugin model quickly evaporated as developers cheated it by sharing common code between video plugins in a folder known as "VideoCommon." Once Dolphin was rereleased as an open source project, more and more code was moved from the plugins into VideoCommon, further distancing Dolphin from the plugin model. In the 2.0 era, the video plugins had become so tightly integrated into Dolphin that they could barely be called plugins anymore, and a decision was made to accept it. r6996 (2.0-1612) removed the plugin interface, and r7041(2.0-1657) renamed video plugins to video backends.

However, due to extreme differences between APIs of the time, complete integration into VideoCommon was impossible without major regressions. So many features, such as EFB (Embedded Framebuffer) copy handling, EFB peeks, video dumping, texture decoding, etc were unique between backends. This naturally lead to new features being added on a per-backend basis, worsening the situation. As such, those initial design decisions would continue to shape Dolphin for years and years to come.

Unification

Unification efforts have actually been going on for years now. During the development of Ubershaders, stenzek realized that Dolphin's video backends were actually a huge problem for the development and maintenance of Dolphin. A significant portion of the work involved in Ubershaders creation was spent renovating Dolphin's shader logic and moving it into VideoCommon. Imagine trying to renovate a house then realizing that you have to replace the foundation before you can even get started!

Though it was only a cleanup, moving the shader logic to VideoCommon gave immediate benefits for users, like the shared shaderUID cache that allowed the shader cache to be shared between all backends even if a user was only playing in one. stenzek has finally completed the large scale reoganization of video logic to VideoCommon, which not only reduces Dolphin by over 7000 lines of code, but also provides considerable real world benefits to users.

Unified EFB Peeks/Pokes/Copies

How many times has there been an Embedded Framebuffer (EFB) issue unique to a particular backend? Way too many! Each backend has had its own implementations and optimizations for emulating EFB and providing EFB Access From CPU, so there were lots of edge cases.

Monster Hunter Tri

In order to make their games look more vibrant, a lot of developers employed a sort of fake HDR bloom effect on the Wii. You can find examples of this in games like Metroid Prime 3 and Xenoblade Chronicles using stacks of blurred EFB copies to mimick the "bloom" that cameras exhibit when overwhelmed by light. While you can find this fake bloom in tons of games, Monster Hunter Tri's developers went another route.

Monster Hunter Tri uses a full screen glow effect to mimick the atmosphere of a tropical island, trying to create the sense of space and depth that modern games achieve with volumetrics today. The Wii is certainly not capable of volumetrics, so they use a clever little trick to create this atmosphere. The game takes a very low resolution EFB copy of the screen, reads that copy with the CPU using EFB peeks, and then uses EFB pokes to write a luminosity map to brighten and darken parts of the screen. The result is surprisingly convincing for what is effectively a fancy post-processing shader!

Using renderdoc, we can see that the game looks a bit deary and lifeless before the atmosphere effect is applied.
Later in the pipeline, the Monster Hunter Tri is much more vibrant and appealing.

While this effect may not seem all that special, emulating it in Dolphin was extremely performance intensive. In Dolphin 5.0, an i7-6700K and GTX 1070 at 1x Native Internal Resolution would be stuck at single digit framerates when trying to correctly emulate scenes with this effect. By looking at how the backends handled this challenge, we can see exactly why it was so slow.



  • D3D11 had the least number of optimizations for EFB peeks and pokes. It's way of emulating this was to synchronize the CPU and GPU after each of its 4880 EFB peeks per frame. The pokes were also problematic, as each of the games 4880 EFB pokes were rendered in separate draw calls. This constant stalling meant that the CPU was always waiting on the GPU. At higher resolutions, it was easier to measure the game's performance in frames per minute than frames per second.

  • OpenGL handled things slightly better. EFB peeks were slightly better thanks to an optimization known as the Tile Cache. When Dolphin did an EFB peek, rather than immediately synchronizing, it would instead take the data for a 64x64 block and then synchronize. If the game needed another of the pixels in the block, another synchronization wouldn't be needed as CPU already had that data! While EFB peeks were faster, EFB pokes suffered from the same issue of each one being a separate draw call. Because excessive draw calls tend to cost more on OpenGL, it was even slower than D3D11.

  • Vulkan actually ran the game full speed. By the Dolphin 5.0 era, developers were well aware of clever games like this. stenzek had an optimization ready for this exact case when writing the Vulkan backend. Unlike D3D11 and OpenGL, Vulkan could batched all of the EFB pokes into a single draw call! Mixed with the tile cache, Vulkan shrugged off the seemingly impossible to emulate feature. Unfortunately, this fast-path suffered from an off by one glitch that caused the bloom to be offset.

Stenzek's Vulkan Backend featured an optimization to speedup Monster Hunter Tri. Unfortunately, it was wasted due to an issue causing EFB pokes to be offset slightly.

This brings us to the core problem, all three backends emulated these features in different ways. After moving these things to VideoCommon, there is still fast path and a slow path, but they have been greatly simplified. Backends that support point rendering can use the fast path, and backends that don't will use the slower path. Despite this, all three backends now perform within 1% of one another and share the same logic. A game that was once so slow to emulate that developers waved the white flag and disabled the feature by default finally can be emulated efficiently and correctly.

Finally being able to emulate this effect in Dolphin comes with a bittersweet feeling. While Dolphin can now accurately and performantly emulate the effect, it also greatly shows the limitations of it. The fact of the matter is that the game is taking an EFB copy that is 1/10th of native resolution and then using it to create the bloom. This wasn't especially noticeable when playing the game on an analog television over a composite signal, but in Dolphin its 4480 pixels (~80x56 resolution) limitation is very apparent. It is especially rough at high internal resolutions and during motion. Because the CPU is poking the luminosity map into the EFB, it is impossible for us to increase its resolution and improve the effect. However, users can disable the effect after the character creator and use a custom post processing shader to accomplish a similar look.

With high internal resolutions, the atmosphere effect's low resolution is very defined. The blockiness is less noticeable at lower resolutions, but you can still see them once you know they're there.

Note: The developers of Monster Hunter Tri had an obsession with EFB Peeks and Pokes. Of all the ways a game could let you choose colors for various parts of your character, Monster Hunter Tri presents you with a color wheel and lets you use the Wii Remote to pick a color. It then uses an EFB peeks to read the pixel you selected to see what color you used and applies that to your character. While users will probably disable EFB Access From CPU in game, at least the character creator works properly now.

F-Zero GX

F-Zero GX is one of the fastest racing games ever made. While most of the game is fairly lightweight to emulate, its Sand Ocean tracks are home to a heat blur that utilizes EFB peeks in a way that drove Dolphin right into a wall. This particular effect was so demanding that most users would outright remove it by enabling the Skip EFB Access From CPU option present in Dolphin. This takes away some of the ambience of the track, but also made the game much easier to run.

The Sand Ocean heat effect scales beautifully in Dolphin. It's a shame that so few have seen it.

While doing this rewrite, stenzek had the opportunity to re-examine some of these worst-case performance scenarios and Sand Ocean was brought up. Remember that Tile Cache optimization used in OpenGL and Vulkan from earlier in the article? This feature bundles a 64x64 tile of values when an EFB peek is done. The assumption is that if the game is usually going to be doing more than one EFB peek, so by grabbing a 64x64 chunk, we can greatly reduce the number of synchronizations per frame. The problem is that when stenzek's moved this optimization to VideoCommon, D3D11's performance dropped dramatically!

In order to figure out why this was happening, stenzek's took a look at what F-Zero GX was doing to produce its heat blur. Examination revealed that the game was doing depth checks across the screen at 64 pixel intervals, each check ending up into its own tile. For this particular effect, the tile cache was causing Dolphin to use 4096 times more bandwidth per GPU sync!

This was a bit of an issue, as a 64x64 pixel tile cache is optimal in other games. Disabling the tile cache restored performance to normal, but, now that stenzek's knew what was going wrong, he could optimize it further. By making the size of the tile cache customizable, he made it so that Dolphin could account for this worst case scenario. Instead of doing a lot of small or big synchronizations, for F-Zero GX we now set the tile cache to the whole screen! With this, all backends could see dramatic performance improvements.



Super Mario Galaxy

While F-Zero GX may be a case where one bigger transfer ends up faster, the same isn't true for Super Mario Galaxy 1 and 2. By using EFB Peeks and Pokes for various features like detecting the depth of what the IR pointer is hovering over, drawing a fancy lens flare, and much, much more, these titles ended up being a great testing ground for potential EFB optimizations.

Before the rewrite, all three of our video backends had different optimizations for EFB Peeks and Pokes, thus, each backend was faster or slower depending on the task at hand. D3D11 was typically the fastest at EFB operations, but struggled areas with effects that covered the screen. OpenGL was generally a little bit slower, but, wouldn't get bogged down on certain skybox effects that bothered D3D11. Vulkan did well with EFB operations, but absolutely tanked whenever a lens flare shown across the skies. With none of the backends having the perfect implementation, stenzek worked on a set of optimizations to address the worst case scenarios of each backend.

When testing the VideoCommon rewrite, Super Mario Galaxy's performance suffered tremendously in dual core, which managed to slip by testers. After reports came in that the game was slower, it quickly became a priority to optimize it. For a short time, Super Mario Galaxy was using a Tile Cache size of the full frame buffer. While that made F-Zero GX faster, it actually hurt in this title. While it was easy enough to revert that change and restore performance, stenzek thought up a new optimization while looking at the slowdown. Deferred EFB Peek Cache Invalidation is an experimental option that allows Dolphin to disable invalidating the EFB peek cache until the game actually sends a drawdone token. The theory behind this is that the game won't do anything with incomplete values.

This can greatly improve performance in games that use a lot of EFB peeks, and improves the performance in certain areas of Super Mario Galaxy far beyond what they were before. With this optimization, every backend is faster.



The results are staggering, with an average gain of ~65% across all backends in EFB Peek/Poke heavy areas. But the raw numbers don't tell the whole story - the games are in general smoother and more stable, without those random drops when a previously demanding effect would show pan across the screen.

Other Games

These types of extreme slowdowns also happened in other games, including some that are considered very lightweight. Players that dive into the GameCube version of Panel De Pon (Tetris Attack/Puzzle League) present in Nintendo Puzzle Collection will notice the gigantic, sudden freezes during menu transitions are gone. For a more popular title, but less intrusive slowdown, the Wi-Fi menus in Super Smash Bros. Brawl also do some strange EFB peek behaviors for currently unknown reasons. While these are just two examples users noticed, there are likely many more out there.

This isn't going to magically raise performance in every game, though. For example The Legend of Zelda: The Wind Waker uses EFB peeks for it's lens flare effect, it's usually bottlenecked elsewhere so the optimizations never end up mattering.

Bounding Box

Bounding Box is a rather difficult feature to emulate. Unique to the GameCube/Wii, bounding box can tell a game how large and where a 3D model is on screen. While this doesn't sound all that special, Paper Mario: The Thousand-Year Door and Super Paper Mario put this feature to work for their famous folding and crowd duplication effects. After doing the main VideoCommon rewrite, stenzek stumbled upon a pixel centering error causing certain special effects to become offset and corrupt at higher resolutions. It turns out that Dolphin was truncating bounding box values, potentially causing coordinates to be off by one. You heard that right folks, it's our favorite kind of bug - rounding errors!

Fixing this isn't as simple as it would appear at a glance because the GameCube GPU uses the rather unorthodox value of 7/12 for the pixel center instead of 1/2 like modern GPUs. Funnily enough, you can read about this kind of issue in Dolphin's very first progress report! Paper Mario: The Thousand-Year Door renders objects at precise enough positions that 1/12th of a pixel is enough to cause bounding box coordinates to round in the wrong direction. By running the game at a higher resolution, you can actually increase the range of error high enough to get consistent glitching in certain areas.

While developers finally fixed this issue at 1x internal resolution, no one knew why it happened at higher resolutions.
Thanks to a little bit of luck, the rounding error is gone along with the corruption.

On top of fixing the rounding error, stenzek also implemented an optimization to help performance. Explaining it in detail is beyond the scope of this article, but here is a very quick summary. Emulating bounding box is especially slow because it relies on Atomic Operations. Atomics lock a set of threads (Warp on Nvidia, Wave on AMD) into a "single threaded mode" where each thread is isolated with its own registers (memory) so that they can complete a task without interference from other threads. This is contrary to the asynchronous parallelization that makes GPUs fast, so Atomic Operations are incredibly slow. However, that isolation allows Atomics to do unusual things like bounding box calculation without triggering race conditions.

stenzek implemented Shuffling to address this. Instead of doing a memory operation for each individual thread, the threads pass the bounding box calculation values from one thread to the other, and only do a single memory operation at the final thread of the warp/wave. On a GPU with a warp/wave size of 32, this can reduce the overhead of bounding box operations by a factor of 32!

AMD GPUs have shuffling as a driver optimization, so they have benefited from this for years. However, for Nvidia users this will bring a nice bounding box performance improvement to OpenGL and Vulkan 1.1. As for Mobile GPUs... shuffling is an optional feature, so they are probably never going to support it.



Backend Specific Crashes and Bugs

Nothing destroys the experience of replaying your favorite game in an emulator more than a sudden emulator crash. It's even worse when the only solution is to just use another backend, especially if that backend is less efficient for your particular hardware.

This was an unfortunate side effect of each backend having its own implementation of various features. No matter how careful a developer is, having to juggle up to four different versions of something at once is just more prone to cause mistakes. In general developers are pretty good at catching small errors in the review process, but from all of the typos we've mentioned in the past, quite a few have still slipped in. And when backends are involved, small errors means instability.

Going to Murluvlee to see your future on D3D11 would crash in older builds.

D3D11 was probably the one hit the worst by backend specific instability as developers on Linux couldn't as easily test the code. And this wasn't just obscure games, The Legend of Zelda: Skyward Sword and Paper Mario:The Thousand-Year Door were reported to crash during very specific actions in both of these games. Throughout these cleanups, a lot of these crashes have vanished without any direct examinations of the crashes.

A unified VideoCommon means that each backend is sharing the same logic, and less duplicated code means less room for errors. That isn't to say that it's now impossible for there to be backend unique bugs, though. Dolphin's OpenGL backend still has some unique behaviors to watch out for, and there is always the threat of driver differences, API features, and more that can trip up developers. Still, what we have now is still a vast improvement over the old situation.

Fixing Feature Inequalities Between Backends

Because new features had to be written for each backend individually, developers would often only work on the backends that they were able to easily test. Given that most of Dolphin's developers use Linux, D3D11 often got the short end of the stick. Even many recently added features were outright missing from the Windows-only backend, like GPU Texture Decoding!

Even niche features like post-processing effects are now available in all backends! Also there's something here about an "other version joke" but, let's leave that out.

Unification not only brought all of the features to all of the backends, it has also made it easier to maintain and add new features in the process. While it was done a bit earlier, Abstract Shaders helped unify Dolphin's shader generation. That made doing something like adding a feature like imgui much more reasonable to do. And when a feature like the newly integrated netplay chat is made, all of the backends get the feature immediately without forcing extra work on developers.

The new imgui interface is incredibly useful. In 4:3 games, you can even resize it to fit in the black bars left on a widescreen monitor and chat all while maintaining exclusive fullscreen!

Positioned for the Future

Beyond just features, the goal of the Videocommon Unification project was to make Dolphin's codebase Dolphin easier to use and more flexible to work on, so it can continue to remain an actively developed emulator for years to come. There's still a lot of work that needs to be done!

On the graphics side of things, Dolphin is in a very comfortable position. Almost everything is in VideoCommon and the backends have been simplified to the point where they are mostly just an interface between Dolphin and the API. Developing a new video backend should be easier than ever! To push the new infrastructure and see just how much easier it truly was, stenzek decided to give it a trial run and try to write a brand new backend from the ground up.

We're pleased to announce his experiment was a success, and as a bit of a surprise, with this rewrite comes a brand new D3D12 backend!

Hello old friend.

This isn't Dolphin's first tango with D3D12 - back around the Dolphin 5.0 release, an experimental D3D12 backend was quickly merged in preparation for the release. The backend itself wasn't exactly in a finished state and implemented a lot of features that really belonged in VideoCommon. These design flaws, missing features, and lack of maintenance forced the decision to eventually remove the backend inspite of its obvious performance benefits, especially on integrated GPUs that don't support Vulkan well.

stenzek's rewrite of D3D12 brings the same performance benefits in a much more compact form. The new D3D12 backend is roughly 55% smaller and is actually feature complete this time around. How does it perform compared to the other backends? We've got performance numbers for you on both a gaming PC and an Intel Integrated GPU system.



The results fall in line with what one would expect on a gaming desktop. D3D12 performs somewhat like a mix of D3D11 and Vulkan. Much like Vulkan, it can absolutely chew through draw call heavy games like Twilight Princess with the Hyrule Field Speedhack disabled. As a fun fact, it's the minimap causing all of those draw calls, hence why some other areas like Faron Woods will work as a substitue for performance testing.

Monster Hunter Tri and Super Mario Galaxy acted as a test for EFB Access to CPU and Store EFB Copies to Texture and RAM. Thanks to optimizations, both of these games run pretty well on all backends in fairly high polygon areas with all of the effects active.

Mario Kart Wii acted as a more general use case with none of the special features required. It's fairly high polygon, have a few EFB effects, but doesn't require anything crazy to emulate. In this situation, D3D12 actually outpaces all of the other backends. Overall though, the most surprising thing about the results is how close all backends tested. The Legend of Zelda: Twilight Princess is a known worst case scenario that has an included game patch that cuts down on minimap calls, but other than that, each backend tested within a few percentage points of one another.

But this is on a NVIDIA GTX 1070, which is fairly high-end graphics card. A lot of the users who wanted D3D12 the most didn't have gaming computers, and were stuck on integrated graphics. Thus, while it was fun throwing all of the backends at a strong card, we also had to do a test on integrated graphics.



The first thing we need to address about the results above is the lack of Vulkan. Despite improving Vulkan drivers on Windows, Dolphin's Vulkan backend still will not run on Intel HD's Windows drivers. Users wishing to use Vulkan on their Intel HD graphics chips have to use Linux and the Mesa drivers. Beyond that unfortunate omission of Vulkan, there's a lot to go through here as the results aren't so obvious at a glance. Thankfully, by looking at what these games are doing, we can actually make out the strengths and weaknesses of each backend.

The first example in the list is the ever popular Super Smash Bros. Melee. As a fairly lightweight title to emulate with no special features enabled by default, it comes down to which backend can most efficiently get through all the draw calls. On the Intel HD 630, D3D12 is far and away the most efficient, with D3D11 lagging behind. As would be expected based on past user reports and experiences, OpenGL is considerably slower on the Intel iGPU. Metroid Prime is a more demanding game with the same bottleneck. While the results aren't quite as dramatic, everything else holds true.

Sandwiched between these results are some more interesting ones. Just like on the GTX 1070 testing, D3D12's advantages are limited when a game requires Store EFB Copies to Texture and RAM or EFB Access from CPU. The Last Story uses EFB to RAM and cuts D3D12's advantage down quite a bit. When both are required like in Super Mario Galaxy, D3D11 ends up having the surprisingly big advantage. But that's a game that uses those features heavily, The Legend of Zelda: The Wind Waker requires both but to a much lesser degree as shows in the numbers.

By disabling these features (which may leave the game unplayable), you can see clearly how much they affect each backend across different drivers on the same computer. There are a lot of results to go through, so feel free to disable any results you're not interested in by clicking the options present in the each graph's legend.





Oddly enough, in lightweight games with that require none of the demanding features, the Intel HD 630 can actually be slightly faster than the NVIDIA GTX 1060 at 1x internal resolution in D3D12. On the other hand, if you have the option to use a discrete GPU, you'll be able to get better performance at higher internal resolutions or when a game requires features that put more of a strain on driver efficiency.

While it may seem like a bit of a black mark that D3D12 struggles with features that require GPU stalls, it may not matter all that much to users on weaker computers. Users seeking maximum performance are going to disable them anyway whenever possible, especially in games that only use them for optional effects. In that sense, despite its weaknesses, the D3D12 backend is still an incredible asset to help weaker computers without dedicated graphics cards run Dolphin. This won't work for every title, Super Mario Galaxy is more or less unplayable with EFB Access from CPU disabled, but, you can get away with it in a game like Wind Waker and just enable it when you want to use the pictobox.

Progressing Onward

Maintaining any piece of software for over 15 years, let alone actively improving it, is not an easy task. There have been countless rewrites over the years that stay mostly silent, unless there's some outcry caused by an userfacing regression. With the VideoCommon rewrite, the benefits are so great and so immediate for both developers and users, that we've been given the rare opportunity to dive into the many facets of one of these massive cleanups. It has facilitated many of these optimizations and fixes, along with the return of D3D12 in a form that doesn't hamper the other backends and graphics development.

This is but a single example throughout the history of the project, and unlikely to be the last. As an emulator, Dolphin strives to give the best possible experience to its users, while providing a friendly environment for developers to try their hand at taming the beast that was the GameCube and Wii hardware. With more exciting still features on the horizon, the sun hasn't yet set on this ageless emulator.

Pots continuar la discussió al fil del fòrum d'aquest article.

Entrada següent

Entrada anterior

Entrades similars