Introducing playcanvas-opti-pixel — a powerful library for optimizing graphics rendering in the PlayCanvas Engine. It is specifically designed to boost the performance of WebGL2 and WebGPU applications using advanced techniques such as occlusion queries, HZB, and efficient GPU buffer management.
Key Features
Occlusion Queries (WebGL2): Fast culling of invisible objects using occlusion queries, reducing GPU load and increasing FPS in complex scenes.
HZB Construction (WebGL2 | WebGPU): Hierarchical Z-Buffer for hierarchical depth testing, delivering accurate occlusion with minimal computation.
Quad Data Textures: Store structured data (e.g., positions or attributes) in compact square textures for fast shader access.
Index Upload to GPU Buffers: Optimized index transfer to storage buffers with support for dynamic updates and minimal copying.
The HZB construction algorithm was further simplified: unnecessary conditions were removed, and the resolution was adjusted to a power of two. The occlusion test formula was also revised. This reduced the time required to build the HZB, although it slightly increased the complexity of the test itself. As a result, a trade-off was achieved that improved overall performance.
Additionally, as an experiment, buffer construction was moved to a microtask after requestAnimationFrame. This approach allows the frame to be released while building the HZB between frames. However, further testing is required, as this may affect input event handling. On the devices tested, this approach showed a performance improvement of up to 30%.
The algorithm for WebGPU has been reworked using this approach. Thanks to this, HZB generation now uses batching—up to 4 mip levels are processed in a single pass. The method is based on the HZB generation approach used in Unreal Engine.
As a result, the number of compute shader dispatches has been reduced by 4×.
I will share performance benchmarks and comparisons later.
The computeMipBatch algorithm for HZB generation has been improved: multiple mip levels can now be generated within a single dispatch. It is now configurable, and you can choose whether to compute HZB using a compute shader or a pixel shader.
Nice one! I’ve built something similar myself (fully GPU driven via emulated mesh shaders) and I’d love to know how it compares. I got stuck on proving it improved anything in real world tests though. Massive improvements in benchmarks but not so much in complex examples.
Hi Adrian, thank you for visiting our library and welcome!
HZB is a fairly complex and expensive technique: building the depth pyramid itself is a costly operation, so you need to approach its use carefully. On WebGPU, as long as there is no multi_draw_indirect support, this technique is unlikely to give a noticeable performance boost, except in scenes with really heavy geometry. On WebGL, however, HZB is a good alternative to hardware occlusion queries: with a very large number of occlusion checks (500+), the cost of building the HZB is paid off and even becomes cheaper overall, especially on iPhones and PCs, although on some Android devices tests show that this approach can lose in performance. At the same time, when using an HZB tester on the GPU, you can reuse the depth pyramid not only for occlusion checks but also for other tasks, such as LOD calculation, which gives a tangible advantage over classic approaches; this is demonstrated in the tester here: GetFlags.glsl.ts
The conversion of an HZB level into a depth texture with a 2× scale does increase the number of pixels, but it scales well under parallel execution. In this approach, each thread performs only four texture reads, which is considered optimal for mobile GPUs. It avoids memory bandwidth bottlenecks and, more importantly, eliminates branching. This reduces unnecessary computations and minimizes thread divergence and idle time.
Batching multiple mip levels into a single dispatch, on the other hand, performs poorly on mobile devices. The main issue is the large volume of data being processed, which, without compression, becomes a bottleneck and degrades performance. This approach works well primarily on desktop GPUs and high-end devices such as iPhone and iMac. For other cases, a pixel shader implementation was chosen.