Inside Mesa 26.0's RADV RT improvements

January 30, 2026

Mesa 26.0 is big for RADV’s ray tracing. In fact, it’s so big it single-handedly revived this blog.

There are a lot of improvements to talk about, and some of them were in the making for a little over two years at this point.

In this blog post I’ll focus on the things I myself worked on specifically, most of which revolve around how ray tracing pipelines are compiled and dispatched. Of course, there’s more than just what I did myself: Konstantin Seurer worked on a lot of very cool improvements to how we build BVHs, the data structure that RT hardware uses for the triangle soup making up the geometry in game scenes so the HW can trace rays against them efficiently.

RT pipeline compilation

The rest of this blog post will assume some basic idea of how GPU ray tracing and ray tracing pipelines work. I’ve written about this in more detail one-and-a-half years ago, in my blog post about RT pipeline being enabled by default.

Let’s take a bit of a closer look at what I said about RT pipelines in RADV back then. In a footnote, I said:

Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren’t that ludicrous anymore.

I spent a significant amount of time in that blogpost detailing about how there tend to be a really large number of shaders, and combining them into a single megashader is very slow because shader sizes get genuinely ridiculous at that point.

So clearly, it was only a matter of time until the any-hit/intersection shader combination would blow up spectacularly on a spectacular number of shaders, as well.

So there’s this thing called Unreal Engine

For illustrating the issues with inlined any-hit/intersection shaders, I’ll use Unreal Engine as an example because I noticed it being particularly egregious here. This definitely was an issue with other RT games/workloads as well, and function calls will provide improvements there too.

There’s a lot of people going around making fun of Unreal Engine these days, to the point of entire social media presences being built around mocking the ways in which UE is inefficient, slow, badly-designed bloatware and whatnot. Unfortunately, the most popular critics often know the least what they’re actually talking about. I feel compelled to point out here that while there certainly are reasonable complaints to be raised about UE and games made with it, I explicitly don’t want this section (or anything else in this post, really) to be misconstrued as “UE doing a bad thing”. As you’ll see, Unreal is really just using the RT pipeline API as designed.

With the disclaimer aside, what does Unreal actually do here that made RADV fall over so hard?

Let’s talk a bit about how big game engines handle shading and materials. As you’ll probably know already, to calculate how lighting interacts with objects in a scene, an application will usually run small programs called “shaders” on the GPU that, among other things¹, calculate the colors different pixels have according to the material at that pixel.

Different materials interact with light differently, and in a large world with tons of different materials, you might end up having a ton of different shaders.

In a traditional raster setup you draw each object separately, so you can compile a lot of graphics pipelines for all of your materials, and then bind the correct one whenever you draw something with that material.

However, this approach falls apart in ray tracing. Rays can shoot through the scene randomly and they can hit pretty much any object that’s loaded in at the moment. You can only ever use one ray tracing pipeline at once, so every single material that exists in your scene and may be hit by a ray needs to be present in the RT pipeline. The more materials a game has, the more ludicrous the number of shaders gets.

Usually, this is most relevant for closest-hit shaders, because these are the shaders that get called for the object hit by the ray (where shading needs to be calculated). However, depending on your material setup, you may have something like translucent materials - where parts of the material are “see-through”, and rays should go through these parts to reveal the scene behind it instead of stopping.

This is where any-hit shaders come into play - any-hit shaders can instruct the driver to ignore a ray hitting a geometry, and instead keep searching for the next hit. If you have a ton of (potentially) translucent materials, that would translate into a lot of any-hit shaders being compiled for these materials.

The design of RT pipelines is quite obviously written in a way that accounts for this. In the previous blogpost I already mentioned pipeline libraries - the idea is that a material could just be contained in a “library”, and if RT pipelines want to use it, they just need to link to the library instead of compiling the shading code all over again. This also allows for easy addition/removal of materials: Even though you have to re-create the RT pipeline, all you need to do is link to the already compiled libraries for the different materials.

UE, particularly UE4, is a heavy user of libraries, which makes a lot of sense: It maps very well to what it’s trying to achieve. Everything’s good, as long as the driver doesn’t do silly things.

Silly things like, for example, combining any-hit shaders into one big traversal shader.

Doing something like that pretty much entirely side-steps the point of libraries. The traversal shader can only be compiled when all any-hit shaders are known, which is only at the very final linking step, which is supposed to be very fast…

And if UE4, assuming the linking step is very fast, does that re-linking over and over, very often, what you end up with is horrible pipeline compilation stutter every few seconds. And in this case, it’s not really UE’s fault, even! Sorry for that, Unreal.

Why can’t we just compile any-hit/intersection separately?

Clearly, inlining all the any-hit and intersection shaders won’t work. So why not just compile them separately?

To answer that, I’ll try to start with explaining some assumptions that lie at the base of RADV’s shader compilation. When ACO (and NIR, too) were written, shaders were usually incredibly simple. They had some control flow, ifs, loops and whatnot, but all the code that would ever execute was contained in one compact program executing top-to-bottom. This perfectly matched what graphics/compute shaders looked like in the APIs, and what the API does is what you want to optimize for.

Unfortunately, this means RADV’s shader compilation stack got hit extra hard by the paradigm shift introduced by RT pipelines. Dynamic linking of different programs, and calls across the dynamic link boundaries, is something common in CPU programming languages (C/C++, etc.), but Mesa never really had to deal with something like that before².

One specific core assumption that prevents us from compiling any-hit/intersection shaders separately just like that is that every piece of code assumes it has exclusive and complete access to things like registers and other hardware resources. Comparing to CPU again, most of the program code is contained in some functions, and those functions will be called from somewhere else³. Those functions will have used CPU registers and stack memory and so on before, and code inside that function can’t write to just any CPU register, or any location on stack. Which registers are writable by a function and which ones must have their values preserved (so that the function callers can store values of their own there without them being overwritten) are governed by little specifications called “calling conventions”.

In Mesa, the shader compiler generally used to have no concept of calling conventions, or a concept of “calling” something, for that matter. There was no concept of a register having some value from a function caller and needing to be preserved - if a register exists, the shader might end up writing its own value to it. In cases of graphics/compute shaders, this wasn’t a problem - the registers only ever had random uninitialized values in them.

This has always been a problem for separately compiling shaders in RT pipelines, but we had a different solution: At every point a shader called another shader, we’d split the shader in half: One half containing everything before the call, and the other half containing everything after. Of course, sometimes the second half needed variables coming from the first half of the shader. All these variables would be stored to memory in the first half. Then, the first half ends, and execution jumps to the called shader. Once the end of the called shader is reached, execution returns to the second half.

This was good enough for things like calling into traceRay to trace a ray and execute all the associated closest hit/miss shaders. Usually, applications wouldn’t have that many variables needing to be backed up to memory, and tracing a ray is supposed to be expensive.

But that concept completely breaks down when you apply it to any-hit shaders. At the point an any-hit shader is called, you’re right in the middle of ray traversal. Ray traversal has lots of internal state variables that you really want to keep in registers at all times. If you call an any-hit shader with this approach, you’d have to back up all of these state variables to memory and reload them back afterwards. Any-hit shaders are supposed to be relatively cheap and called potentially lots of times during traversal. All these memory stores and reloads you’d need to insert would completely ruin performance.

So, separately compiling any-shaders was an absolute no-go. At least, unless someone were to go off the deep end and change the entire compiler stack to fix the assumptions at their heart.

“So, where have you been the last two years?”

I went and changed more or less the entire compiler stack to fix these assumptions and introduce proper function calls.

The biggest part of this work by far were the absolute basics. How do we best teach the compiler that certain registers need to be preserved and are best left alone? How should the compiler figure out that something like a call instruction might randomly overwrite other registers? How do we represent a calling convention/ABI specification in the driver? All of these problems can be tackled with different approaches and at different stages of compilation, and nailing down a clean solution is pretty important in a rework as fundamental as this one.

I started out with applying function calls to the shaders that were already separately compiled - this means that the function call work itself didn’t improve performance by too much, but in retrospect I think it was a very good idea to make sure the baseline functionality is rock-solid before moving on to separately-compiling any-hit shaders.

Indeed, once I finally got around to adding the code that splits out any-hit/intersection shaders and use function calls for them, things worked nearly out of the box! I opened the associated merge request a bit over two weeks ago and got everything merged within a week. (Of course, I would never have gotten it in that fast without all the reviewers teaming up to get everything in ASAP! Big thank you to Daniel, Rhys and Konstantin)

In comparison, I started work on function calls in January of 2024 and got the initial code in a good enough shape to open a merge request in June that year, and the code only got merged on the same day I opened the above merge request, two years after starting the initial drafting (although to be fair, that merge request also had periods of being stalled due to personal reasons).

Shader compilation with function calls

Function calls makes shader compilation work in arguably a much more straightforward way. For the most part, the shader just gets compiled like any other - there’s no fancy splitting or anything going on. If a shader calls another shader, like when executing traceRay, or when calling an any-hit shaders, a call instruction is generated. When the called shader finishes, execution resumes after the call instruction.

All the magic happens in ACO, the compiler backend. I’ve documented the more technical design of how calls and ABIs are represented in a docs article. At first, call instructions in the NIR IR are translated to a p_call “pseudo” instruction. It’s not actually a hardware instruction, but serves as a placeholder for the eventual jump to the callee. This instruction also carries information about which specific registers parameters will be stored in, and which registers may be overwritten by the call instruction.

ACO’s compiler passes have special handling for calls wherever necessary: For example, passes analyzing how many registers are required in all the different parts of the code take special care to take into account that in call instructions, fewer registers may be available to store values in (because all other values are overwritten). ACO also has a spilling pass for moving register values to memory whenever the amount of used registers exceeds the available amount.

Another fundamental change is that function calls also introduce a call stack. In CPUs, this is no big deal - you have one stack pointer register, and it points to the stack region that your program uses. However, on GPUs, there isn’t just one stack - remember that GPUs are highly parallel, and every thread running on the GPU needs its own stack!

Luckily, this sounds worse at first than it actually is. In fact, the hardware already has facilities to help manage stacks. AMD GPUs ever since Vega⁴ have the concept of “scratch memory” - a memory pool in VRAM where the hardware ensures that each thread has its own private “scratch region”. There are special scratch_* memory instructions that load and store from this scratch area. Even though they’re also VRAM loads/stores, they don’t take any address, just an offset, and for each thread return the value stored in that thread’s own scratch memory region.

In my blog post about RT pipeline being enabled by default I claimed AMD GPUs don’t implement a call stack. This is actually misleading - the scratch memory functionality is all you need to implement a stack yourself. The “stack pointer” here is just the offset you pass to the scratch_* memory instruction. Pushing to the stack increases the stack offset, and popping from it decreases the offset⁵.

Eventually, when it comes to converting a call to hardware instructions, all that is needed is to execute the s_swappc instruction. This instruction automatically writes the address of the next instruction to a register before jumping to the called shader. When the called shader wants to return, it merely needs to jump to the address stored in that register, and execution resumes from right after the call instruction.

Finally, any-hit separate compilation was a straightforward task as well - it was merely an issue of defining an ABI that made sure that a ton of registers stay preserved and the caller can stash its values there. In practice, all of the traversal state will be stashed in these preserved registers. No expensive spilling to memory needed, just a quick jump to the any-hit shader and back.

Performance considerations

If you look at the merge request, the performance benefits seem pretty obvious.

Ghostwire Tokyo’s RT passes speed up by more than 2x, and of course pipeline compilation times improved massively.

The compilation time difference is quite easy to explain. Generally, compilers will perform a ton of analysis passes on shader code to find everything they can to optimize it to death. However, these analysis passes often require going over the same code more than once, e.g. after gathering more context elsewhere in the shader. This also means that a shader that doubles in size will take more than twice as long to compile. When inlining hundreds or thousands of shaders into one, that also means that shader’s compile time grows by a lot more than just a hundred or a thousand times.

Thus, if we reverse things and are suddenly able to stop inlining all the shaders into one, that scaling effect means all the shaders will take less total time to compile than the one big megashader. In practice, all modern games also offload shader compilation to multiple threads. If you can compile the any-hit shaders separately, the game can compile them all in parallel - this just isn’t possible with the single megashader which will always be compiled on a single thread.

In the runtime performance department, moving to just having a single call instruction instead of hundreds of shaders in one place means the loop has a much smaller code size. In a loop iteration where you don’t call any any-hit shaders, you would still need to jump over all of the code for those shaders, almost certainly causing instruction cache misses, stalls and so on.

Also, forcing any-hit/intersection shaders to be separate also means that any-hit/intersection shaders that consume tons of registers despite nearly never getting called won’t have any negative effects on ray traversal as a whole. ACO has heuristics on where to optimally insert memory stores in case something somewhere needs more registers than available. However, these heuristics may decide to insert memory stores inside the generic traversal loop, even if the problematic register usage only comes from a few rarely-called inlined shaders. These stores in the generic loop would now mean that the whole shader is slowed down in every case.

However, separate compilation doesn’t exclusively have advantages, either. In an inlined shader, the compiler is able to use the context surrounding the (now-inlined) shader to optimize the code itself. A separately-compiled shader needs to be able to get called from any imaginable context (as long as it conforms to ABI), and this inhibits optimization.

Another consideration is that the jump itself has a small cost (not as big as you’d think, but it does have a cost). RADV currently keeps inlining any-hit shaders as long as you don’t have too many of them, and as long as doing so wouldn’t inhibit the ability to compile the shaders in parallel.

About that big UE5 Lumen perf improvement

I also openend a merge request that provided massive performance improvements to Lumen’s RT right before the branchpoint.

However, these improvements are completely unrelated to function calls. In fact, they’re a tiny bit embarrassing, because all that changed was that RADV doesn’t make the hardware do ridiculously inefficient things anymore.

Let’s talk about dispatching RT shaders. The Vulkan API provides a vkCmdTraceRaysKHR command that takes in the number of rays to dispatch for X, Y and Z dimensions. Usually, compute dispatches are described in terms of how many thread groups to dispatch, but RT is special because one ray corresponds to one thread. So here, we really get the dispatch sizes in threads, not groups.

By itself, that’s not an issue. In fact, AMD hardware has always been able to specify dispatch dimensions in threads instead of groups. In that case, the hardware takes the job of assembling just enough groups that hold the specified number of threads. The issue here comes from how we describe that group to the hardware. The workgroup size itself is also per-dimension, and the simplest case of 32x1x1 threads (i.e. a 1D workgroup) is actually not always the best.

Let’s consider a very common ray tracing use case: You might want to trace a ray for each pixel in a 1920x1080 image. That’s pretty easy, you just call vkCmdTraceRaysKHR to dispatch 1920 rays in the X dimension and 1080 in the Y dimension.

When you dispatch a 32x1x1 workgroup, the coordinates for each thread in a workgroup look like this:

thread id |  0  |  1  |  2  | ... |  16  |  17  |...|  31  |
coord     |(0,0)|(1,0)|(2,0)| ... |(16,0)|(17,0)|...|(31,0)|

Or, if you consider how the thread IDs are laid out in the image:

-------------------
0 | 1 | 2 | 3 | ..
-------------------

That’s a straight line in image space. That’s not the best, because it means that the pixels will most likely cover different objects which may have very different trace characteristics. This means divergence during RT will be higher, which can make the overall process slower.

Let’s look instead what happens when you make the workgroup 2D, with a 8x4 size:

thread id |  0  |  1  |  2  | ... |  16  |  17  |...|  31  |
coord     |(0,0)|(1,0)|(2,0)| ... |(0,2) |(1,2) |...|(7,3) |

In image space:

-------------------
0 | 1 | 2 | 3 | ..
------------------
8 | 9 | 10| 11| ..
------------------
16| 17| 18| 19| ..
-------------------

That’s much better. Threads are now arranged in a little square, and these squares are much more likely to all cover the same objects, have similar RT characteristics, etc.

This is why RADV used 8x4 workgroups as well. Now let’s get to when this breaks down. What if the RT dispatch doesn’t actually have 2 dimensions? What if there are 1920 rays in the X dimension, but the Y dimension is just 1?

It turns out that the hardware can only run 8 threads in a single wavefront in this case. This is because the rest of the workgroup is out-of-bounds of the dispatch - it has a non-zero Y coordinate, but the size in the Y dimension is only 1, so it would exceed the dispatch bounds.

The hardware also can’t pull in threads from other workgroups, because one wavefront can only ever execute one workgroup. The end result is that the wave runs with only 8 out of 32 threads active - at 1/4 theoretical performance. For no real reason.

I actually had noticed this issue years ago (with UE4, ironically). Back then I worked around it by rearranging the game’s dispatch sizes into a 2D one behind its back, and recalculating a 1-dimensional dispatch ID inside the RT shader so the game doesn’t notice. That worked just fine… as long as we’re actually aware about the dispatch sizes.

UE5 doesn’t actually use vkCmdTraceRaysKHR. It uses vkCmdTraceRaysIndirectKHR, a variant of the command where the dispatch size is read from GPU memory, not specified on the CPU. This command is really cool and allows for some and nifty GPU-driven rendering setups where you only dispatch as many rays as you’re definitely going to trace (as determined by previous GPU commands). This command also rips a giant hole in the approach of rearranging dispatch sizes, because we don’t even know the dispatch size before the dispatch is actually executed. That means the super simple workaround I built was never hit, and we had the same embarrassingly inefficient RT performance as a few years ago all over again.

Obviously, if UE5 is too smart for your workaround, then the solution is to make an even smarter workaround. The ideal solution would work with a 1D thread ID (so that we don’t run into any more issues when there is a 1D dispatch, but if a 2D dispatch is detected, we turn that “line” of 1D IDs into a “square”. The whole idea about turning a linear coordinate into a square reminded me a lot of how Z-order curves work. In fact, the GPU arranges things like image data on a Z-order curve by interleaving the address bits from X and Y already, because nearby pixels are often accessed together and it’s better if they’re close to each other.

However, instead of interleaving a X and Y coordinate pair to make a linear memory address, we want the opposite: We have a linear dispatch ID, and we want to recover a 2D coordinate inside a square from it. That’s not too hard, you just do the opposite operation: Deinterleave the bits, where every odd/even bit of the dispatch ID forms the X/Y coordinate. As it turned out, you can actually do this entirely from inside the shader with just a few bit twiddling tricks, so this approach work for both indirect and direct (non-indirect) trace commands.

With that approach, dispatch IDs and coordinates look something like this:

thread id |  0  |  1  |  2  | ... |  16  |  17  |...|  31  |
coord     |(0,0)|(1,0)|(0,1)| ... |(4,0) |(5,0) |...|(7,3) |

In image space:

-------------------
0 | 1 | 4 | 5 | ..
------------------
2 | 3 | 6 | 7 | ..
------------------
8 | 9 | 12| 13| ..
-------------------
10| 11| 14| 15| ..
-------------------

Not only are the thread IDs now arranged in squares, the squares themselves get recursively subdivided into more squares! I think theoretically this should be a further improvement w.r.t divergence, but I don’t think it has resulted in measurable speedup in practice anywhere.

The most important thing, though, is that now UE5 RT doesn’t run 4x slower than it should. Oops.

Bonus content: Function call bug bonanza

The second most fun thing about function calls is that you can just jump to literally any program anywhere, provided the program doesn’t completely thrash your preserved registers and stack space.

The most fun thing about function calls is what happens when the program does just that.

I’m going to use this section to scream in the void about two very real function call bugs that were reported after I already merged the MR. This is not an exhaustive list, you can trust I’ve had much much more fun just like what I’ll be presenting here while I was testing and developing function calls.

Avowed gets stuck in an infinite loop

On the scale of function call bugs, this one was rather tame, even. Having infinite loops isn’t the most optimal for hang debugging, but it does mean that you can use a tool like umr to sample which wavefronts are active, and get some register dumps. The program counter will at least point to some instruction in the loop that it’s stuck in, and you can get yourself the disassembly of the whole shader to try and figure out what’s going in the loop and why the exit conditions aren’t met.

The loop in Avowed was rather simple: It traced a ray in a loop, and when the loop counter was equal to an exit value, control flow would break out of the loop. The register dumps also immediately highlighted the loop exit counter being random garbage. So far so good.

During the traceRay call, the loop exit counter was backed up to the shader’s stack. Okay, so it’s pretty obvious that the stack got smashed somehow and that corrupted the loop exit counter.

What was not obvious, however, was what smashed the stack. Debugging this is generally a bit of an issue - GPUs are far, far away from tools like AddressSanitizer, especially at a compiler level. There are no tools that would help me catch a faulty access at runtime. All I could really do was look at all the shaders in that ray tracing pipeline (luckily that one didn’t have too many) and see if they somehow store to wrong stack locations.

All shaders in that pipeline were completely fine, though. I checked every single scratch instruction in every shader if the offsets were correct (luckily, the offsets are constants encoded in the disassembly, so this part was trivial). I also verified that the stack pointer was incremented by the correct values - everything was completely fine. No shader was smashing its callers’ stack.

I found the bug more or less by complete chance. The shader code was indeed completely correct, there were no miscompilations happening. Instead, the “scratch memory” area the HW allocated was smaller than what each thread actually used, because I forgot to multiply by the number of threads in a wavefront in one place.

The stack wasn’t smashed by the called function, it was smashed by a completely different thread. Whether your stack would get smashed was essentially complete luck, depending on where the HW placed your scratch memory area, other wavefront’s scratch, and how those wavefronts’ execution was timed relative to yours. I don’t think I would ever have been able to deduce this from any debugger output, so I should probably count myself lucky I stumbled upon the fix regardless.

Silent Hill 2’s reflections sample the sky color

Did I talk about Unreal Engine yet? Let’s talk about Unreal Engine some more. Silent Hill 2 uses Lumen for its reflection/GI system, and somehow Lumen from UE 5.3 specifically was the only thing that seemed to reproduce this particular bug.

In every way the Avowed bug was tolerable to debug, this one was pure suffering. There were no GPU hangs, all shaders ran completely fine. That means using umr and getting a rough idea of where the issue is was off the table from the start. Unfortunately, the RT pipeline was also way too large to analyze - there were a few hundred hit shaders, but there also were seven completely different ray generation shaders.

Having little other recourse, I started trying to at least narrow down the ray generation shader that triggered the fault. I used Mesa’s debugging environment variables to dump the SPIR-V of all the shaders the driver encountered, and then used spirv-cross on all of them to turn them into editable GLSL. For each ray generation shader, I’d comment out the imageStore instructions that stored the RT result to some image, recompiled the modified GLSL to SPIR-V, and instructed Mesa to sneakily swap out the original ray-gen SPIR-V with my modified one. Then I re-ran the game to see if anything changed.

This indeed led me to find the correct ray generation shader, but the lead turned into a dead end - there was little insight other than that the ray was indeed executing the miss shader. Everything seemed correct so far, and if I hadn’t known these rays didn’t miss about 3 commits ago, I honestly wouldn’t even have suspected anything was wrong at all.

The next thing I tried was commenting out random things in ray traversal code. Skipping over all any-hit/intersection shaders yielded no change, and neither did replacing the ray flags/culling masks with known good constants to rule out wrong values being passed as parameters. What did “fix” the result, however, was… commenting out the calls to closest-hit shaders.

Now, if closest-hit shaders get called and that makes miss shaders execute somehow, you’d perhaps think we’d be calling the wrong function. Maybe we confuse the shader binding table where we load the addresses of shaders to call from? To verify that assumption, I also disabled calling any and all miss shaders. I zeroed out the addresses in the shader handles to make extra sure there was no possible way that a miss shader could ever get called. To keep things working, I replaced the code that calls miss shaders with the relevant code fragment from UE’s miss shader (essentially inlining the shader myself).

Nothing changed from that. That means a closest-hit shader being executed somehow resulted in a ray traversal itself returning a miss, not the wrong function being called.

Perhaps the closest-hit shaders corrupt some caller values again? Since the RT pipeline was too big to analyze, I tried to narrow down the suspicious shaders by only disabling specific closest-hit shaders. I also discovered that just making all closest-hit shaders no-ops “fixed” things as well, even if they do get called.

Sure enough, at some point I had a specific closest-hit shader where the issue went away once I deleted all code from it/made it a no-op. I even figured out a specific register that, if explicitly preserved, would make the issue go away.

The only problem was that this register corresponded to one part of the return value of the closest-hit shader - that is, a register that the shader was supposed to overwrite.

From here on out it gets completely nonsensical. I will save you the multiple days of confusion, hair-pulling, desperation and agony over the complete and utter undebuggableness of Lumen’s RT setup and skip to the solution:

It turned out the “faulty” closest-hit shader I found was nothing but a red herring. Lumen’s RT consists of 6+ RT dispatches, most of which I haven’t exactly figured out the purpose of, but what I seemed to observe was that the faulty RT dispatch used the results of the previous RT dispatch to make decisions on whether to trace any rays or not. Making the closest-hit shaders a no-op did nothing but disable the subsequent traceRays that actually exhibited the issue.

Since these RT dispatches used the same RT pipelines, that meant virtually any avenue I had of debugging this driver-side was completely meaningless. Any hacks inside the shader compiler might actually work around the issue, or just affect a conceptually unrelated dispatch that happens to disable the actually problematic rays. Determining which was the case was nearly impossible, especially in a general case.

I never really figured out how to debug this issue. Once again, what saved me was a random epiphany out of the blue. In fact, now that I know what the bug was, I’m convinced I would’ve never found this through a debugger either.

The issue turned out to be in an optimization for what’s commonly called tail-calls. If you have a function that calls another function at the very end just before returning, a common optimization is to simply turn that call into a jump, and let the other function return directly to the caller.

Imagine ray traversal working a bit like this C code:

/* hitT is the t value of the ray at the hit point */
payload closestHit(float hitT);

/* tMax is the maximum range of the ray, if there is
 * no hit with a t <= tMax, the ray misses instead */
payload traversal(float tMax) {
   do something;
   if (hit)
       return closestHit(hitT); // gets replaced with a jmp, closestHit returns directly to traversal's caller
}

More specifically, the bug was with how preserved parameters and tail-calls interact. Function callers are generally supposed to assume that preserved parameters do not change their value over the function call. That means it’s safe to reuse that register after the call and assuming it still has the value the caller put in.

However, in the example above, let’s assume closestHit has the same calling convention as traversal. That means closestHit’s parameter needs to go into the same register as traversal’s parameter, and thus the register gets overwritten.

If traversal’s caller was assuming that the parameter is preserved, that would mean the value of tMax has just been overwritten with the value of hitT without the caller knowing. If traversal now gets called again from the same place, the value of tMax is not the intended value, but the hitT value from the previous iteration, which is definitely smaller than tMax.

Put shortly: If all these conditions are met, a smaller-than-intended tMax could cause rays to miss when they were intended to hit.

Once again, I got incredibly lucky and stumbled upon the bug by complete chance.

The GPU gods seem to be in good spirits for my endeavours. I pray it stays this way.

Footnotes

“Shader” in this context really means any program that runs on the GPU. The RT pipeline is also made of shaders, shaders determine where the points and triangles making up each object end up on screen, there are compute shaders for generic computing, and so on… ↩
There actually is another use-case where this becomes relevant on GPU - and that is GPGPU code like CUDA/HIP/OpenCL. CUDA/HIP allow you to write C++ for the GPU in a much more “CPU-like” programming environment (OpenCL uses C), and you run into all the same problems there. This also means all the major GPU vendors had already written their solutions for these problems when raytracing came around. There are OpenCL kernels that end up really really bad if you don’t have proper function calls in the compiler (which Rusticl suffers from right now), and the function calls work in RADV/ACO may end up proving useful for those as well. ↩
Even your main function works like that, actually. Unless you have some form of freestanding environment, all your program code works like that. ↩
In RADV, the stack pointer is actually constant across a function, and pushing/popping to/from the stack is implemented by adding another offset to the constant stack pointer in load/store instructions. This allows to make the stack pointer an SGPR instead of a VGPR and simplifies stack accesses that aren’t push/pop. ↩
We support raytracing before Vega too. We support function calls on all GPUs, as well, through a little magic in dreaming up a buffer descriptor with specific memory swizzling to achieve the same addressing that scratch_* instructions use on Vega and later. ↩