This page contains a series of ideas, each of which should be tied to a specific JIRA where the discussion / resolution will occur.
Right now, JFX has two threads: The UI thread and the Render thread. The render thread currently works in direct mode. It traverses the render graph hierarchy and issues graphics commands to the card. This incurs latency on some calls when the render thread needs to wait for the graphics card to execute the current operation. When a node is rendered, the output is normally a request to run a certain shader and a set of vertices for the shader. Rather than talking directly to the card, the command buffer saves a list of commands and their arguments and then puts the final result together to give to the graphics card.
How does this improve performance? Since graphics commands are saved, the Java code that computes vertices does not need to run. More importantly, threads can compute command buffers concurrently and not incur latency from the graphics card.
Modern CPU's have many more cores that could be taken advantage of so multi-threaded rendering is a necessity. Using a command buffer, multiple threads can process branches of the render graph hierarchy. A single render thread is responsible for executing the command buffers. While command buffer threads are executing, they can request resources from the render thread so that when it comes time to execute the buffer, textures and other resources have been created.
Reducing State Switches ("super shader")
This idea is based on the fact that with Region caching enabled, almost everything we do is rendering images and text. Right now, the first time a checkbox is rendered (for example), we first render it to an image, store the image in a cache, and thereafter whenever we have to render the checkbox we do so by rendering the cached image (simplified, but you get the point). When we render text, we are also rendering images, but with a different shader. At the moment that means that to render a checkbox, we first setup the shader for rendering from an image, render, and then switch to the text shader and render text. If you have a page with 20 check boxes, we end up doing 40 state switches.
Further, every UI control is essentially the exact same thing – either it is only images, or it is images + text.
The first idea, then, is to have a single shader which can handle images and text. If this is possible, it would mean that we wouldn't have to perform state switching for most of a normal business UI. most of the UI is made up of Regions (controls), Text, and Images. All of those could be handled by a single shader.
The second idea is that we could have a pre-baked image cache for Modena that we simply upload at startup. In this way we avoid the initial rasterization pass entirely (that is, we don't have to to first draw to an image and then draw from the image to the back buffer).
The third idea is to add 9-slice support to Region for those cases where it can be supported, such that we don't have to redraw things just because they are taller but can still use the cached images.
Preliminary testing with CheckBox seems to indicate a potential 6x improvement in performance for this case (where you have a hundred or so check boxes on the scene). The numbers for TableView were only marginally better – perhaps due to the overhead in CSS / Layout related to the table, although this analysis is speculative.
Preserve the Back Buffer
If we are not updating the whole screen on each pulse then we could benefit from preserving the framebuffer. We would need to explicitly clear dirty regions before rendering to them.
Optimize String Measuring
It is no secret that the cost of string measuring can have a huge effect on performance. String measuring operations are called often in FX to determine the preferred size of controls and layout happens often in FX as application code changes the contents of controls.
It's easy to see the same strings being measured over and over again. We could fix the callers to cache/call less or cache way down deep inside of Prism.
Implement Hardware Layers
We could be taking advantage of hardware layers to speed up composition.
This is not a performance optimization silky smooth animation makes a program seem faster and look more polished.
Reducing Redundant Relayout
I believe we presently do much more work per scene than is required when a single component nested deep in the structure of the scene graph has changed its preferred size and requires layouts to execute. The way this is supposed to work at present is that, when a Node calls requestLayout, it is assumed that this node may have changed in such a way that its preferred size, min size, or max size has changed such that these changes would impact how the node is laid out in its container.
If the container is a Group, then during the layout pass when the child is resized to its new preferred size, this change in size might also impact the size of the Group. If the Group is a child of a layout container, the change in the group size will also impact the layout of items within that parent layout container, and require another layout pass.
In the normal course of affairs, when a node's requestLayout is called, it walks up the tree marking each parent in the tree as also needing to have layout applied. This is because the change in the pref width of a button may in fact impact the pref width of its parent container which may affect the pref width of the parent container's parent container, and so on. If one of those parent containers is a layout root (such as the content pane of a ScrollPane) then we don't walk any further up in the hierarchy since we know a change to the nodes within the layout root will have no impact on the pref width / height of the layout root.
WIth this basic understanding, a few ideas come to mind:
- Verify that all of the Nodes (such as Controls) which can be content roots are properly identified as such. For example, the TableView, ListView, and TreeView should be content roots.
- When running a layout pass, have the ability to determine whether a dirty layout node's pref / min / max size has changed. If not, then we have no need to run layout on this node, but can proceed to asking the dirty layout children to lay out themselves.
- During a typical layout pass, the parent asks each child for its prefWidth, prefHeight, minWidth, minHeight, maxWidth, and maxHeight (or maybe 4 of those 6). It then proceeds to perform the layout algorithm 3 or more times. Suppose I have root R and layout container L with children C1-C3. When R attempts to lay itself out, it first asks L for its prefWidth (say). To figure out its pref width, L must get the pref width / height of C1 - C3 and perform the layout algorithm, so that it knows what its preferred width is. R then asks L for its minWidth (say), and L must then ask for the min width of C1 - C3 and run the layout algorithm to figure out what its min width is. And so forth. Multiple passes on complex layout algorithms is likely hurting us substantially. In retrospect I might have said that min/max was never computed only ever specified manually, which would have probably had a big positive impact in terms of performance. Nevertheless, the fact that we have to run this multiple times even when not strictly necessary is probably a cause of poor performance.
Preinitialize Controls to Well Know CSS Default Values
Rather than running CSS at start up, precompute the defaults and initialize FX to have these values. This should improve start up time.
Investigate Native GUI Timer, Pulses and Event Flow
Especially when the system is stressed, FPS can be sensitive to the timing of events and the pulse timer. Right now, it is undefined when native GUI events, FX pulse events and runLater() actions happen other than flooding the system with runLater()'s will not starve native GUI events.
There is some annicdotal evidence that suggests using the native GUI timer on OS X improves performance.
Support Instancing in JavaFX / Better Texture Caching
Right now, it is possible to ask FX to cache a node. This causes the node to be represented to a texture on the graphics card and the texture is retained for future draws. This can make drawing of the node much faster provided that the node is not changing and the node is drawn a lot. Turning caching on is not always a win and needs to be done carefully by the application programmer (if at all).
There is evidence that application caching is more performant that system caching. Application code that renders static content to an image and then uses the same image in many different nodes is effectively caching. The image is represented as a single texture and that texture is on the graphics card.
Instancing would allow the application programer to declare that identical nodes are shared in the render tree. This would allow the system to cache and optimize drawing.