OptiPixel library

Wagner · May 1, 2026, 5:12pm

Introducing playcanvas-opti-pixel — a powerful library for optimizing graphics rendering in the PlayCanvas Engine. It is specifically designed to boost the performance of WebGL2 and WebGPU applications using advanced techniques such as occlusion queries, HZB, and efficient GPU buffer management.

Key Features

Occlusion Queries (WebGL2): Fast culling of invisible objects using occlusion queries, reducing GPU load and increasing FPS in complex scenes.
HZB Construction (WebGL2 | WebGPU): Hierarchical Z-Buffer for hierarchical depth testing, delivering accurate occlusion with minimal computation.
Quad Data Textures: Store structured data (e.g., positions or attributes) in compact square textures for fast shader access.
Index Upload to GPU Buffers: Optimized index transfer to storage buffers with support for dynamic updates and minimal copying.
BVH: structure for accelerating the frustum culling and lod operation.
Instanced Static Mesh: Mesh Instancing with Fast Frustum Culling (BVH acceleration)
Hierarchical Instanced Static Mesh: Mesh Instancing with Fast Frustum Culling, LODs and Shadows LODs (BVH acceleration)

npm i playcanvas-opti-pixel

npm i playcanvas-opti-pixel-examples

Wagner · May 3, 2026, 6:21pm

Split tester shader to chunks
Add WebGPU HZB Tester
Add extra data for WebGL HZB tester queue
Add more examples

Wagner · May 4, 2026, 11:29am

Wagner · May 7, 2026, 8:19am

Update: Added Webgl2 an example of calculating simultaneously occlusion, LOD, and visibility by pixels (maximum rect on the screen)

Wagner · May 10, 2026, 1:53pm

Update:

The HZB construction algorithm was further simplified: unnecessary conditions were removed, and the resolution was adjusted to a power of two. The occlusion test formula was also revised. This reduced the time required to build the HZB, although it slightly increased the complexity of the test itself. As a result, a trade-off was achieved that improved overall performance.

Additionally, as an experiment, buffer construction was moved to a microtask after requestAnimationFrame. This approach allows the frame to be released while building the HZB between frames. However, further testing is required, as this may affect input event handling. On the devices tested, this approach showed a performance improvement of up to 30%.

Add float16 supports for mobile devices.
Remove branching conditions.

github.com/AlexAPPi/playcanvas-opti-pixel

src/OcclusionCulling/HZB/Webgl/WebglHierarchicalZBuffer.frag.glsl.ts

hzb_webgpu_v2

export default `

    uniform int uReadScreenDepth;
    uniform float uReadLevel;
    uniform vec2 uInvSize;
    uniform vec2 uInputViewportMaxBound;
    uniform vec4 uDispatchThreadIdToBufferUV;
    uniform sampler2D uDepthMip;

    varying vec2 uv0;

    #ifdef FLOAT_WORKAROUND
    #include "floatAsUintPS"
    #endif

    float convertDepth(vec4 value) {

        #ifdef FLOAT_WORKAROUND
            float workaroundValue = uint2float(value);
        #else

This file has been truncated. show original

Wagner · May 13, 2026, 8:59am

The algorithm for WebGPU has been reworked using this approach. Thanks to this, HZB generation now uses batching—up to 4 mip levels are processed in a single pass. The method is based on the HZB generation approach used in Unreal Engine.

As a result, the number of compute shader dispatches has been reduced by 4×.

I will share performance benchmarks and comparisons later.

Wagner · May 16, 2026, 7:48am

The computeMipBatch algorithm for HZB generation has been improved: multiple mip levels can now be generated within a single dispatch. It is now configurable, and you can choose whether to compute HZB using a compute shader or a pixel shader.

github.com/AlexAPPi/playcanvas-opti-pixel

src/OcclusionCulling/HZB/Webgpu/WebgpuHierarchicalZBuffer.ts

hzb_webgpu_v2

import pc from "../../../engine.js";
import type { IHierarchicalZBuffer } from "../IHierarchicalZBuffer.js";
import vertexCodeVS from "./WebgpuHierarchicalZBuffer.vert.wgsl.js";
import fragmentCodeVS from "./WebgpuHierarchicalZBuffer.frag.wgsl.js";
import computeCodeCS from "./WebgpuHierarchicalZBuffer.comp.wgsl.js";
import { getCameraDepthTexture } from "../../../Extras/CameraHelpers.js";

const workgroupSizeX: number = 8;
const workgroupSizeY: number = 8;

export class WebgpuHierarchicalZBuffer implements IHierarchicalZBuffer {

    private _debugName: string = 'HZB';
    private _enabled: boolean = false;
    private _device: pc.WebgpuGraphicsDevice;
    private _screenWidth: number = 0;
    private _screenHeight: number = 0;
    private _width: number = 0;
    private _height: number = 0;
    private _mipLevels: number = 0;

This file has been truncated. show original

const hzb = new WebgpuHierarchicalZBuffer(device, true, 4, "MyCustomHZB");

hzb.maxMipBatchSize = 1; // render [1] mip level in 1 pass (dispatch)
hzb.maxMipBatchSize = 2; // render [2] mip levels in 1 pass (dispatch)
hzb.maxMipBatchSize = 3; // render [3] mip levels in 1 pass (dispatch)
hzb.maxMipBatchSize = 4; // render [4] mip levels in 1 pass (dispatch)
hzb.useCompute = false; // Render by pixel shader

Adriaaaaan · May 18, 2026, 7:23am

Nice one! I’ve built something similar myself (fully GPU driven via emulated mesh shaders) and I’d love to know how it compares. I got stuck on proving it improved anything in real world tests though. Massive improvements in benchmarks but not so much in complex examples.

Wagner · May 18, 2026, 8:09am

Hi Adrian, thank you for visiting our library and welcome!

HZB is a fairly complex and expensive technique: building the depth pyramid itself is a costly operation, so you need to approach its use carefully. On WebGPU, as long as there is no multi_draw_indirect support, this technique is unlikely to give a noticeable performance boost, except in scenes with really heavy geometry. On WebGL, however, HZB is a good alternative to hardware occlusion queries: with a very large number of occlusion checks (500+), the cost of building the HZB is paid off and even becomes cheaper overall, especially on iPhones and PCs, although on some Android devices tests show that this approach can lose in performance. At the same time, when using an HZB tester on the GPU, you can reuse the depth pyramid not only for occlusion checks but also for other tasks, such as LOD calculation, which gives a tangible advantage over classic approaches; this is demonstrated in the tester here: GetFlags.glsl.ts

The conversion of an HZB level into a depth texture with a 2× scale does increase the number of pixels, but it scales well under parallel execution. In this approach, each thread performs only four texture reads, which is considered optimal for mobile GPUs. It avoids memory bandwidth bottlenecks and, more importantly, eliminates branching. This reduces unnecessary computations and minimizes thread divergence and idle time.

Batching multiple mip levels into a single dispatch, on the other hand, performs poorly on mobile devices. The main issue is the large volume of data being processed, which, without compression, becomes a bottleneck and degrades performance. This approach works well primarily on desktop GPUs and high-end devices such as iPhone and iMac. For other cases, a pixel shader implementation was chosen.

Adrian_Meredith · May 19, 2026, 2:25pm

I don’t think its the hzb as presumably thats a fixed cost right? Yes you’re probably right the scene is unlikely to be complex enough. my synthetic scene with ~4000 standford bunnies show huge gains but its likely that once materials are in the mix its not quite right (its emulated bindless materials via texture arrays). For example If I move out of the scene and everything is culled, my very low performance chrome book is fast, but if I’m infront of a wall and most things are culled its no different if I move away from the wall and not much is culled…. But occlusion culling does improve performance most of the time except this scenario. I even have a wireframe debug mode to prove they not rendered.

Its been ages since I worked on it, maybe its time to have another review

Wagner · May 19, 2026, 6:21pm

As I mentioned earlier, until there is proper support for multi-draw indirect, the engine will still issue draw calls even for objects that have been culled. This is essentially how HZB works in WebGPU right now.

When you turn the camera away, frustum culling kicks in and reduces the number of draw calls. However, when you’re facing a wall and the bunnies are occluded, draw calls for those bunnies are still being executed. Because of that, you’re likely hitting a draw call throughput bottleneck rather than a shading or geometry cost.

That’s why there’s little to no performance difference between “mostly occluded by a wall” and “fully visible” cases—despite your wireframe debug confirming they aren’t actually rendered.

If you’re interested in how indirect draw is currently handled in the engine, you can check the implementation here:

github.com/playcanvas/engine

src/platform/graphics/webgpu/webgpu-graphics-device.js

acab2058a


      
                  this.pipeline = pipeline;
                  passEncoder.setPipeline(pipeline);
              }
          }
          
          if (indexBuffer) {
              passEncoder.setIndexBuffer(indexBuffer.impl.buffer, indexBuffer.impl.format);
          }
          
          // draw
          if (drawCommands) { // indirect draw path
          
              const storage = drawCommands.impl?.storage ?? this.indirectDrawBuffer;
              const indirectBuffer = storage.impl.buffer;
              const drawsCount = drawCommands.count;
          
              // TODO: when multiDrawIndirect is supported, we can use it here instead of a loop
              for (let d = 0; d < drawsCount; d++) {
                  const indirectOffset = (drawCommands.slotIndex + d) * _indirectEntryByteSize;
                  if (indexBuffer) {
                      passEncoder.drawIndexedIndirect(indirectBuffer, indirectOffset);

You could also try integrating my WebGL-based implementation, where draw calls for culled objects are completely skipped. The tradeoff is a 2–3 frame delay due to asynchronous GPU readback, which can cause objects to pop in noticeably. This approach is typically described as Async GPU Readback or a round-robin occlusion strategy.

Wagner · May 19, 2026, 7:20pm

Regarding the cost of generating the depth pyramid, it always depends on multiple factors, with memory bandwidth and overall GPU load being the most important.

A GPU is a complex system, and resource reuse plays a big role. If, after a draw call, you reuse the same resources in subsequent draw calls, they may still reside in cache and can be accessed very quickly. However, if the cache is already saturated, the GPU needs to evict or reorganize data before loading new resources, which introduces additional cost.

Because of this, the cost is never truly constant. Even for the same operation, such as building a depth pyramid, the performance can vary depending on cache state, memory pressure, and what other work the GPU is doing at that moment.

Wagner · May 25, 2026, 11:48am

A lightweight BVH structure has been added to the library.

Instanced Static Mesh and Hierarchical Instanced Static Mesh are implemented with support for updating individual instance transforms, rebuilding the BVH, changing color and other parameters, as well as shadow LODs.

BVH is used to accelerate LOD calculation and frustum culling, delivering very high performance. The solution has been tested with one million instanced objects. Memory usage is highly optimized on both CPU and GPU.

Wagner · May 27, 2026, 8:51pm

We implemented depth-based object sorting using a highly optimized radix sort algorithm. The system has been tested on millions of instances and demonstrates excellent performance. Special attention was given to minimizing memory allocations by relying exclusively on typed arrays.

An optimization is applied to reduce the number of significant bits (maxBits) by subtracting the minimum value when maxBits is below 32, further improving performance.

The class is called DepthQueue, as it operates alongside IndexQueue. This pairing allows us to eliminate redundant passes when computing minimums, maximums, and other auxiliary data used for optimizations.

The screenshot shows a demo scene with ~1 million objects, where sorting and LOD selection take around 3 ms.

We are also close to completing HierarchicalInstancer and ClusterWorld. ClusterWorld builds a spatial grid that accelerates frustum culling, occlusion culling, and LOD processing across the entire scene.

HierarchicalInstancer supports an optimization where a single object can be rendered with one drawCommand: if all LODs are stored in a single buffer, multiple draw calls are merged into one. It also supports multiple MeshInstances per LOD (with different materials, etc.). The component is designed to work with prefabs.

github.com/AlexAPPi/playcanvas-opti-pixel

src/Extras/DepthQueue.ts

0272ac254


      
          protected _optimizationMaxBitsReady: boolean;
          protected _maxBits: number;
          protected _count: number;
          protected _min: number;
          protected _max: number;
          
          public get count() { return this._count; }
          public get min() { return this._min; }
          public get max() { return this._max; }
          
          public constructor(capacity: number) {
              this._tempIndices1 = new Uint32Array(capacity);
              this._tempIndices2 = new Uint32Array(capacity);
              this._minMaxBuffer = new ArrayBuffer(8);
              this._minMaxDataF = new Float32Array(this._minMaxBuffer);
              this._minMaxDataU = new Uint32Array(this._minMaxBuffer);
              this._buffer = new ArrayBuffer(capacity * 4);
              this._dataF = new Float32Array(this._buffer);
              this._dataU = new Uint32Array(this._buffer);
              this.clear();
          }

github.com/AlexAPPi/playcanvas-opti-pixel

src/Extras/GPUIndexQueue.ts

0272ac254


      
          public get buffer() { return this._buffer; }
          public get size() { return this._indexQueue.size; }
          public get dirty() { return this._indexQueue.dirty; }
          public get count() { return this._indexQueue.count; }
          public get indexes() { return this._indexQueue.indexes; }
          public get itemSize() { return this._indexQueue.itemSize; }
          public get extraSize() { return this._indexQueue.extraSize; }
          public get capacity() { return this._indexQueue.capacity; }
          public get isUint32() { return this._indexQueue.isUint32 }
          
          public constructor(device: pc.GraphicsDevice, indexManager: IndexManager, instancing: boolean, extraSize: number = 0) {
              this._device = device;
              this._instancing = instancing;
              this._indexQueue = new IndexQueueEx(indexManager, extraSize);
              this._recreateKeyBuffer();
          }
          
          protected _getBufferFormat() {
          
              const type = this._indexQueue.isUint32 ? pc.TYPE_UINT32 : pc.TYPE_UINT16;
              const semantic = this._instancing ? instancingIndexSemantic : positionSemantic;

Wagner · June 1, 2026, 10:24am

HierarchicalInstancer:

Add CrossFade for switching between LOD levels. In the shader, you can use noise or smooth transitions—it’s up to you.

Wagner · June 2, 2026, 9:22pm

Testing the component on trees, the scene contains 100k trees:

Wagner · June 10, 2026, 9:11pm

We revised the occlusion result readback strategy. Instead of polling readiness via timeouts, the readback is now performed during the frameUpdate phase. This reduced the number of misses from 3–4+ frames down to 1–2.

As a result, the effective latency relative to when the occlusion query is issued is now close to 1–2 frames. Previously, due to relying on an outdated HZB from the previous frame, the total readback delay reached 4–5 frames. This has now been reduced to 2–3 frames.

Testing showed no side effects, provided that the readback is executed in frameUpdate before the rest of the engine logic runs.

Wagner · June 16, 2026, 9:00pm

We have updated our WebGPU HZB tester and introduced several improvements and new tools.

The tester now supports working with custom buffers and enables occlusion testing using HZB for local storage.

Key changes:

Optimized the BitSet class.
Separated storage logic for AABB centers and half-extents — see the AABBStore class.
Added the IndirectDataBuffer class, which supports batched data updates with configurable maximum batch size in bytes.
Improved overall system performance.
Added new debugging tools.

Feedback and suggestions are welcome.

Версия для русскоязычных:

Мы обновили наш тестер HZB для WebGPU и добавили ряд улучшений и новых инструментов.

Теперь тестер поддерживает работу с пользовательскими буферами, а также позволяет выполнять тесты окклюзии с использованием HZB для локального хранилища

Основные изменения:

Оптимизирован класс BitSet.
Разделена логика хранения центров AABB и половинных размеров — подробности см. в классе AABBStore.
Добавлен класс IndirectDataBuffer, который поддерживает обновление данных батчами с возможностью задавать максимальный размер батча в байтах.
Улучшена общая производительность системы.
Добавлены новые инструменты для отладки.

Будем рады обратной связи и предложениям.

    function setIndirect(meshInstance: pc.MeshInstance, tester: IGPUIndirectDrawOcclusionCullingTester, drawData: DrawData[]) {

        const numSlots  = drawData.length;
        const firstSlot = meshInstance.mesh.device.getIndirectDrawSlot(numSlots);

        for (let i = 0; i < numSlots; i++) {
            const slot = firstSlot + i;
            const queueItem = drawData[i];
            const id = queueItem.id;
            const primitive = queueItem.primitive;
            const instanceCount = 1;
            const firstInstance = 0;
            tester.enqueue(id, slot, primitive, instanceCount, firstInstance);
        }

        meshInstance.setIndirect(null, firstSlot, numSlots);
    }

    ...
	// Fill aabb store
	const aabbStore: IAABBStore;
	const indirectDataBuffer: IndirectDataBuffer;
	const queue: GPUIndexQueue;
    ...

    function testOcclusionCulling(meshInstance: pc.MeshInstance, tester: IGPUIndirectDrawOcclusionCullingTester, drawData: DrawData[]) {

        const numSlots  = drawData.length;
        const drawCommands = meshInstance.setMultiDraw(null, numSlots);

		queue.clear();

        for (let slot = 0; slot < numSlots; slot++) {
            const queueItem = drawData[slot];
            const id = queueItem.id;
            const primitive = queueItem.primitive;
            const instanceCount = 1;
            const firstInstance = 0;
            // or take from drawCommands
            indirectDataBuffer.tryEnqueueUpdate(id, primitive, instanceCount, firstInstance);
            queue.enqueue(id, slot);
        }

		// Update buffers
		indirectDataBuffer.update();
		queue.update();

		// Run HZB test and write new indirect data to drawCommands buffer
        tester.test(numSlots, viewProjection, cameraPosition, aabbStore, queue.buffer, indirectDataBuffer, drawCommands.impl.storage);
    }

Wagner · July 3, 2026, 8:30pm

We have made a final decision to discontinue support for GPU occlusion queries (WebGPU Queries) in our project.

After extensive testing and performance comparisons, we have fully transitioned to using HZB (Hierarchical Z-Buffer) as our primary occlusion method. This approach has demonstrated more stable results, better scalability, and more predictable behavior across different devices.

As a result, all files related to WebGPU Queries (WebgpuQueries *) will be removed from the library.

If you rely on this functionality, please take this change into account when updating.

Wagner · July 5, 2026, 8:38pm

WEBGL2: WebglReadbackBuffer update

We conducted an in-depth study of buffer copy behavior in ANGLE and analyzed the Chromium source code. Based on this, we improved the GPU-to-CPU readback pipeline using a generalized approach. Previously, our implementation lagged by 2–3 frames; now it is reduced to 1 frame, and only rarely reaches 2.

We also want to share our approach to reading 4096 Uint32Array elements from Transform Feedback buffers. Testing was performed across a wide range of devices, all of which demonstrated stable and solid results.