So right now there are few ways you can go. If you do really need GUI rendered in WebGL, then yes, you need a scene data for it. But not sure you want any physics on it. So technically entities are very lightweight and don't cost much RAM or CPU. It is more imagery that will consume your RAM on mobile.
You could make UI using DOM as well, I prefer it that way, but it is up to you guys.
Regarding levels. There is currently multi-scene system, but you will have to get used to the way it works.
You definitely want to keep levels as separately loadable hierarchy pieces. So if you simply load separate JSON, you can parse it into entities:
var parser = new pc.SceneParser(this.app);
var parent = parser.parse(data);
This will return you a parent, root node of that scene. To load JSON of scene, you need to know where it is located, which you can find using Network profiler in Dev Tools. It will be something like ID.json, where ID - is ID of a scene. Which you can find in URL in Editor when Editing that specific scene.
I know, it could be a better workflow for this.
There is actually an idea that been long around, which is Async Hierarchy - that would allow to mark specific entity as being Async, and it would not load it if it is disabled, and load it once it is enabled or called by special component.