Introducing playcanvas-opti-pixel — a powerful library for optimizing graphics rendering in the PlayCanvas Engine. It is specifically designed to boost the performance of WebGL2 and WebGPU applications using advanced techniques such as occlusion queries, HZB, and efficient GPU buffer management.
Key Features
Occlusion Queries (WebGL2): Fast culling of invisible objects using occlusion queries, reducing GPU load and increasing FPS in complex scenes.
HZB Construction (WebGL2 | WebGPU): Hierarchical Z-Buffer for hierarchical depth testing, delivering accurate occlusion with minimal computation.
Quad Data Textures: Store structured data (e.g., positions or attributes) in compact square textures for fast shader access.
Index Upload to GPU Buffers: Optimized index transfer to storage buffers with support for dynamic updates and minimal copying.
BVH: structure for accelerating the frustum culling and lod operation.
Instanced Static Mesh: Mesh Instancing with Fast Frustum Culling (BVH acceleration)
Hierarchical Instanced Static Mesh: Mesh Instancing with Fast Frustum Culling, LODs and Shadows LODs (BVH acceleration)
The HZB construction algorithm was further simplified: unnecessary conditions were removed, and the resolution was adjusted to a power of two. The occlusion test formula was also revised. This reduced the time required to build the HZB, although it slightly increased the complexity of the test itself. As a result, a trade-off was achieved that improved overall performance.
Additionally, as an experiment, buffer construction was moved to a microtask after requestAnimationFrame. This approach allows the frame to be released while building the HZB between frames. However, further testing is required, as this may affect input event handling. On the devices tested, this approach showed a performance improvement of up to 30%.
The algorithm for WebGPU has been reworked using this approach. Thanks to this, HZB generation now uses batching—up to 4 mip levels are processed in a single pass. The method is based on the HZB generation approach used in Unreal Engine.
As a result, the number of compute shader dispatches has been reduced by 4×.
I will share performance benchmarks and comparisons later.
The computeMipBatch algorithm for HZB generation has been improved: multiple mip levels can now be generated within a single dispatch. It is now configurable, and you can choose whether to compute HZB using a compute shader or a pixel shader.
Nice one! I’ve built something similar myself (fully GPU driven via emulated mesh shaders) and I’d love to know how it compares. I got stuck on proving it improved anything in real world tests though. Massive improvements in benchmarks but not so much in complex examples.
Hi Adrian, thank you for visiting our library and welcome!
HZB is a fairly complex and expensive technique: building the depth pyramid itself is a costly operation, so you need to approach its use carefully. On WebGPU, as long as there is no multi_draw_indirect support, this technique is unlikely to give a noticeable performance boost, except in scenes with really heavy geometry. On WebGL, however, HZB is a good alternative to hardware occlusion queries: with a very large number of occlusion checks (500+), the cost of building the HZB is paid off and even becomes cheaper overall, especially on iPhones and PCs, although on some Android devices tests show that this approach can lose in performance. At the same time, when using an HZB tester on the GPU, you can reuse the depth pyramid not only for occlusion checks but also for other tasks, such as LOD calculation, which gives a tangible advantage over classic approaches; this is demonstrated in the tester here: GetFlags.glsl.ts
The conversion of an HZB level into a depth texture with a 2× scale does increase the number of pixels, but it scales well under parallel execution. In this approach, each thread performs only four texture reads, which is considered optimal for mobile GPUs. It avoids memory bandwidth bottlenecks and, more importantly, eliminates branching. This reduces unnecessary computations and minimizes thread divergence and idle time.
Batching multiple mip levels into a single dispatch, on the other hand, performs poorly on mobile devices. The main issue is the large volume of data being processed, which, without compression, becomes a bottleneck and degrades performance. This approach works well primarily on desktop GPUs and high-end devices such as iPhone and iMac. For other cases, a pixel shader implementation was chosen.
I don’t think its the hzb as presumably thats a fixed cost right? Yes you’re probably right the scene is unlikely to be complex enough. my synthetic scene with ~4000 standford bunnies show huge gains but its likely that once materials are in the mix its not quite right (its emulated bindless materials via texture arrays). For example If I move out of the scene and everything is culled, my very low performance chrome book is fast, but if I’m infront of a wall and most things are culled its no different if I move away from the wall and not much is culled…. But occlusion culling does improve performance most of the time except this scenario. I even have a wireframe debug mode to prove they not rendered.
Its been ages since I worked on it, maybe its time to have another review
As I mentioned earlier, until there is proper support for multi-draw indirect, the engine will still issue draw calls even for objects that have been culled. This is essentially how HZB works in WebGPU right now.
When you turn the camera away, frustum culling kicks in and reduces the number of draw calls. However, when you’re facing a wall and the bunnies are occluded, draw calls for those bunnies are still being executed. Because of that, you’re likely hitting a draw call throughput bottleneck rather than a shading or geometry cost.
That’s why there’s little to no performance difference between “mostly occluded by a wall” and “fully visible” cases—despite your wireframe debug confirming they aren’t actually rendered.
If you’re interested in how indirect draw is currently handled in the engine, you can check the implementation here:
You could also try integrating my WebGL-based implementation, where draw calls for culled objects are completely skipped. The tradeoff is a 2–3 frame delay due to asynchronous GPU readback, which can cause objects to pop in noticeably. This approach is typically described as Async GPU Readback or a round-robin occlusion strategy.
Regarding the cost of generating the depth pyramid, it always depends on multiple factors, with memory bandwidth and overall GPU load being the most important.
A GPU is a complex system, and resource reuse plays a big role. If, after a draw call, you reuse the same resources in subsequent draw calls, they may still reside in cache and can be accessed very quickly. However, if the cache is already saturated, the GPU needs to evict or reorganize data before loading new resources, which introduces additional cost.
Because of this, the cost is never truly constant. Even for the same operation, such as building a depth pyramid, the performance can vary depending on cache state, memory pressure, and what other work the GPU is doing at that moment.
A lightweight BVH structure has been added to the library.
Instanced Static Mesh and Hierarchical Instanced Static Mesh are implemented with support for updating individual instance transforms, rebuilding the BVH, changing color and other parameters, as well as shadow LODs.
BVH is used to accelerate LOD calculation and frustum culling, delivering very high performance. The solution has been tested with one million instanced objects. Memory usage is highly optimized on both CPU and GPU.
We implemented depth-based object sorting using a highly optimized radix sort algorithm. The system has been tested on millions of instances and demonstrates excellent performance. Special attention was given to minimizing memory allocations by relying exclusively on typed arrays.
An optimization is applied to reduce the number of significant bits (maxBits) by subtracting the minimum value when maxBits is below 32, further improving performance.
The class is called DepthQueue, as it operates alongside IndexQueue. This pairing allows us to eliminate redundant passes when computing minimums, maximums, and other auxiliary data used for optimizations.
The screenshot shows a demo scene with ~1 million objects, where sorting and LOD selection take around 3 ms.
We are also close to completing HierarchicalInstancer and ClusterWorld. ClusterWorld builds a spatial grid that accelerates frustum culling, occlusion culling, and LOD processing across the entire scene.
HierarchicalInstancer supports an optimization where a single object can be rendered with one drawCommand: if all LODs are stored in a single buffer, multiple draw calls are merged into one. It also supports multiple MeshInstances per LOD (with different materials, etc.). The component is designed to work with prefabs.