Yes, I am working on a cluster system. If a cluster is hidden, its rendering, as well as the rendering of all objects inside it, can be disabled. In this way, the world is divided into cubes, and if one of these cubes is not visible, it is automatically hidden along with all the objects it contains.
I am going to improve the performance by using a prediction algorithm. At the moment it already works and does not require an engine update, but there are points that I would like to suggest for improvement, I will create a PR soon.
During the process of rendering optimization, I implemented an occlusion culling system based on a Hi-Z buffer, and I’d like to share some observations on why this approach is significantly more efficient than traditional occlusion queries on mobile devices.
Main advantages of the Hi-Z approach
- Reduced GPU stalls: When using occlusion queries, the CPU has to wait for a response from the GPU, which creates synchronization points and stalls the rendering pipeline. Hi-Z checks are performed entirely on the GPU, and the results can be used immediately without costly blocking.
- Hierarchical structure: The Hi-Z buffer stores depth in a mipmapped texture, which allows large regions of the scene to be quickly discarded at higher levels, without going down to per-pixel accuracy.
- Good scalability: This method works well both with large numbers of objects and large-scale scenes, where massive mesh rejection is critical.
- Efficiency on mobile GPUs: Mobile chips handle occlusion queries poorly due to weak mechanisms for asynchronous CPU–GPU communication. Hi-Z, on the other hand, relies on texture sampling and shader arithmetic, which fit much better with their architectures.
Why is it faster?
Occlusion queries require feedback from the GPU, and on mobile devices this often causes noticeable delays. Hi-Z relies on the existing depth buffer, builds a mip hierarchy (usually in a single additional mipmap generation pass), and all checks boil down to texture lookups inside shaders. So instead of CPU↔GPU sync stalls, we get purely shader-based operations that run very fast.
As a result, the Hi-Z buffer not only provides better performance but also more predictable rendering behavior, which is especially critical on mobile platforms.
I’d like to point out the excellent API for depth buffer processing, thanks
Sounds fantastic, a way to go. Have you managed to get this to work on WebGL as well, or is indirect draw call a requirement here?
Yes, it works on both WegGl2 and WebGPU.
I’ve implemented a series of optimizations in WebGL2, and the results turned out better than expected.
Now it’s possible to add virtually unlimited occlusion checks for different objects. This gave a huge breakthrough in discarding hidden elements in the scene.
On top of that, I managed to run frustum tests directly on the GPU — another step towards highly efficient real-time rendering.
Impressive work, i’m amazed.
A debug tool has been added that allows you to check which mipmap level is used for occlusion checking, as well as the pixel area.
- The verification algorithm has also been improved.
An update has been made: terrain rendering is now performed with a single draw call using the new rendering type MultiDrawInstancingAccelerator , which requires support for the GL_ANGLE_multi_draw extension. This improvement does not affect the functionality of previous rendering methods.
@see MultiDrawPatchInstancing.mts
Improved algorithm for generating MultiDraw sets.
Drawing is completed in 1 call.
We are changing the wording of IPatchInstancing to IPatchDrawAccelerator.
@see PatchMultiDrawAccelerator.mts
Special thanks to @LeXXik for help.
We are expecting the release of a new engine version with multi-draw support and will base our work on it.
WebGPU demonstrates outstanding performance: rendering the landscape takes only 0.3 ms on an integrated UHD Graphics 730, maintaining a stable 75 fps.
I would not completely trust the GPU timer. Not that it’s miles off, but on WebGPU we can only measure duration of render and compute passes, but nothing else. Compared to WebGL, where the whole frame is measured, including data uploads and stall or similar.
What’s the best way to measure WebGPU performance today? Something like RenderDoc or PIX?
The up to date knowledge we have is here
Notification!
As mentioned earlier, we are discontinuing support for version 1 of the engine and fully transitioning to version 2. The new base version on which the component will be built is engine version 2.12. All code related to supporting version 1 will be removed in order to simplify and reduce the overall codebase.
News:
We are preparing an improved occlusion checking system that supports different methods, which can be combined for optimal performance. For example, HZB can be used for lightweight models, while Occlusion Queries are better suited for large objects.
Currently, HZB-based culling on the primitive level is being tested in WebGL2/WebGPU. Using getBufferSubData is considered undesirable: even with a minimal data volume, the read operation takes about 3 ms. Although it finishes before the next frame is rendered, it introduces additional overhead that can negatively affect performance.
See:
> OcclusionCulling/BoundingBoxQueue.mts
> OcclusionCulling/HierachicalZBuffer.mts
> OcclusionCulling/HierachicalZBufferDebugger.mts
> OcclusionCulling/HierachicalZBufferTester.mts
> OcclusionCulling/OcclusionQueriesTester.mts
> OcclusionCulling/IOcclusionCullingTester.mts
export type TUnicalId = number;
/**
* Interface for working with an occlusion culling testing system.
* Allows registering BoundingBox objects, enqueueing them for testing,
* and checking if the object is occluded by other scene geometry.
*/
export interface IOcclusionCullingTester {
/**
* Registers a BoundingBox for subsequent occlusion testing.
* @param boundingBox - The bounds of the object in local or world coordinates.
* @param matrix - Optional transformation matrix (if a local BoundingBox is used).
* @returns A unique identifier for the registered object.
*/
lock(boundingBox: pcx.BoundingBox, matrix?: pcx.Mat4): TUnicalId;
/**
* Releases a previously registered identifier and removes associated data.
* @param id - The unique identifier obtained from the lock() call.
*/
unlock(id: TUnicalId): void;
/**
* Adds the object to the queue for occlusion testing.
* @param id - The unique identifier returned earlier by the lock() method.
*/
enqueue(id: TUnicalId): void;
/**
* Checks the result of the last occlusion test for the specified object.
* @param id - The unique identifier of the object.
* @returns true if the object is occluded, otherwise false.
*/
isOccluded(id: TUnicalId): boolean;
}
This likely very much depends on how busy the GPU is. If not busy, it can be fast (3ms), but under heavy load, this will be done after the already submitted rendering is done, and take a lot longer.
In WebGPU, as always, everything is neat and precise — indirect solves our problems perfectly.
in WebGL2, the most efficient solution for primitive culling at the moment turned out to be performing culling directly in the vertex shader.
...
uniform uint uOcclusionId;
uniform usampler2d uOcclusionTests;
void main {
uint occluded = texelFetch(uOcclusionTests, ivec2(uOcclusionId, 0), 0).r;
if (occluded == 1) {
gl_Position = vec4(2.0, 2.0, 2.0, 1.0);
return;
}
...
}
...
I hear you … WebGL is very limited here.
Doing it in vertex shader brings disadvantages too … its hard to say if that’s worth the win in many cases, likely only in some specialized cases.
A few more ideas that could work well. It might be possible to find a compromise by using the tester not only for occlusion detection but also, for example, for determining the model’s LOD or similar tasks. This could significantly reduce CPU workload.
I’m using a buffer with the uint32 type, which allows combining up to 32 different flags:
text
uint32 flags = 0;
flags |= (1 << k); // For example: OCCLUDED, IN_FRUSTUM, LOD_0, LOD_1, LOD_2, LOD_3, LOD_4, ...
flags &= ~(1 << k);
bool isSet = flags & (1 << k);
This way, you can record and check various object states in a single pass — from frustum and occlusion tests to LOD level selection.
attribute uint aBoundingBoxIndex;
flat out uint out_flags;
uniform sampler2D uDataTexture;
uniform sampler2D uHZB;
uniform mat4 uMatrixViewProjection;
uniform vec2 uScreenSize;
uniform vec2 uHZBSize;
${getDepthVS}
${pixelFractionalCheckFnVS}
${cullBoundingBoxVS}
${getBoundingBoxPropsVS}
void main(void) {
float instanceDepth;
float hizDepth;
vec3 boundingBoxCenter;
vec3 boundingBoxHalfExtents;
getBoundingBoxProps(aBoundingBoxIndex, boundingBoxCenter, boundingBoxHalfExtents);
int cullStatus = cullBoundingBox(boundingBoxCenter, boundingBoxHalfExtents, uMatrixViewProjection, uScreenSize, uHZBSize, instanceDepth, hizDepth);
out_flags = uint(cullStatus);
}
`;


