~eliasnaur/gio

7 3

Performance expectations

Dominik Honnef <dominik@honnef.co>
Details
Message ID
<871qvbdx5r.fsf@honnef.co>
DKIM signature
missing
Download raw message
Hi,

I wanted to inquire what Gio's expected performance is. Seeing how it
targets writing GUIs, is it not expected to handle thousands of
paint.FillShape ops in a single frame?

In a demo[1] I am drawing 4900 10x10 rectangles, and the performance isn't
very satisfying.

I am running on Linux and X11, so I am getting the OpenGL renderer. With
that, by default, I get 32 ms per frame, as measured by io/profile.
Setting GODEBUG=cgocheck=0 reduces that to 7.8 ms.

  1. The massive impact of the cgo pointer checking makes me wonder how
     often we call into OpenGL per frame. Surely we're not uploading and
     drawing each shape separately?
  
  2. Even at 7.8 ms this is way slower than I would've hoped.

To compare with Vulkan I've patched my copy of gio and reenabled Vulkan
on X11. With that I get 7.4 ms per frame by default, and 7.1 ms with
GODEBUG=cgocheck=0. Better, but see point 2 above.

In comparison, glxgears (which is about as much a benchmark as 4900
rectangles are) renders at 20,000 frames per second, or 50 µs per frame.
It admittedly has much less work to do than Gio, but does it have 100x
less to do? For a more comparable example, complex games run at 60 fps
and software using Dear ImGui and rendering UIs in complexity similar to
my demo run at thousands of fps.

What's suspicious is the distribution of time in the profile data.
Consider this random sample:

    {tot:  7.7ms draw:4.6198s gpu:  200µs st:     0s cov:  200µs}

Let's ignore the 'draw' field, that one seems to be monotonically
increasing (although I do not know why.) – If 'gpu', 'st' and 'cov' take
a combined 400µs, where are we spending our remaining time? Note that
this is with vsync disabled (forced via vblank_mode=0, and confirmed
working by the fact that this monitor runs a 60 Hz, not 128 Hz.) I'm not
sure if 'st' isn't captured at all, or finishes too fast to be measured.

FWIW, I created this benchmark after I ran into performance problems
in a real application I am writing, which renders execution traces,
which can contain thousands of visible spans. I'm hoping these
performance problems can be solved, as I'm otherwise quite enjoying the
experience of using Gio.

Cheers.

[1] https://play.golang.org/p/1WUZPAz6IV6
Details
Message ID
<CAMAFT9UKN-aN9FkuSZEZRQQxKS=NrOX=65Zd6uuMe1KFA-bs=A@mail.gmail.com>
In-Reply-To
<871qvbdx5r.fsf@honnef.co> (view parent)
DKIM signature
pass
Download raw message
Hi Dominik

Thank you for your demonstration and careful analysis.

On Sun, 26 Jun 2022 at 21:47, Dominik Honnef <dominik@honnef.co> wrote:
>
> Hi,
>
> I wanted to inquire what Gio's expected performance is. Seeing how it
> targets writing GUIs, is it not expected to handle thousands of
> paint.FillShape ops in a single frame?
>
> In a demo[1] I am drawing 4900 10x10 rectangles, and the performance isn't
> very satisfying.
>
> I am running on Linux and X11, so I am getting the OpenGL renderer. With
> that, by default, I get 32 ms per frame, as measured by io/profile.
> Setting GODEBUG=cgocheck=0 reduces that to 7.8 ms.
>
>   1. The massive impact of the cgo pointer checking makes me wonder how
>      often we call into OpenGL per frame. Surely we're not uploading and
>      drawing each shape separately?
>
>   2. Even at 7.8 ms this is way slower than I would've hoped.
>
> To compare with Vulkan I've patched my copy of gio and reenabled Vulkan
> on X11. With that I get 7.4 ms per frame by default, and 7.1 ms with
> GODEBUG=cgocheck=0. Better, but see point 2 above.
>
> In comparison, glxgears (which is about as much a benchmark as 4900
> rectangles are) renders at 20,000 frames per second, or 50 µs per frame.
> It admittedly has much less work to do than Gio, but does it have 100x
> less to do? For a more comparable example, complex games run at 60 fps
> and software using Dear ImGui and rendering UIs in complexity similar to
> my demo run at thousands of fps.
>

glxgears is not a great comparison, because of its low scene complexity.
As for 3D seemingly more performant than 2D, see [0] and [1]. Basically,
GPUs are not designed for 2D (CPUs less so, of course).

Dear ImGui is a good example of trading rendering quality and flexibility for
much better performance. If you only need simple shapes and a sprite sheet
of characters, Dear ImGui does very well.

> What's suspicious is the distribution of time in the profile data.
> Consider this random sample:
>
>     {tot:  7.7ms draw:4.6198s gpu:  200µs st:     0s cov:  200µs}
>
> Let's ignore the 'draw' field, that one seems to be monotonically
> increasing (although I do not know why.) – If 'gpu', 'st' and 'cov' take
> a combined 400µs, where are we spending our remaining time? Note that
> this is with vsync disabled (forced via vblank_mode=0, and confirmed
> working by the fact that this monitor runs a 60 Hz, not 128 Hz.) I'm not
> sure if 'st' isn't captured at all, or finishes too fast to be measured.
>
> FWIW, I created this benchmark after I ran into performance problems
> in a real application I am writing, which renders execution traces,
> which can contain thousands of visible spans. I'm hoping these
> performance problems can be solved, as I'm otherwise quite enjoying the
> experience of using Gio.
>
> Cheers.
>
> [1] https://play.golang.org/p/1WUZPAz6IV6

Gio is designed to render complex shapes with high quality anti-aliasing.
Thousands of axis-aligned rectangles are probably its worst case scenario :)

That said, our default renderer is not particularly good, and is
basically unchanged
since I wrote it years ago based on the Pathfinder Rust project. We
have a version
of piet-gpu implemented behind the GIORENDERER=forcecompute environment setting,
which I someday hope to get Gio to render as fast as described in

https://raphlinus.github.io/rust/graphics/gpu/2020/06/13/fast-2d-rendering.html

Unfortunately, various driver and GPU API issues have prevented me from moving
forward with making the compute renderer the default. There is some hope that
I can soon make another attempt because Raph has been working on a revised
version of piet-gpu with much better compatibility.

In any case, I'm sure there is a bunch of performance tuning available
to us, but there
is little point in optimizing the compute renderer while its hidden
behind a flag, and the
old renderer is destined to be removed.

However, a vector renderer given thousands of simple rectangles with no need for
anti-aliasing will never perform as well as a specialized implementation.

Your case is not completely hopeless. I can achieve < 1ms frames on my
machine by cheating: one single path and caching through a macro[2]. Perhaps
with a cached shape per color you can get closer to acceptable performance in
the short term.

Longer term, I'd love to make your case work better by optimizing our renderer
to recognize cases with many simple shapes and a bunch of more or less static
text.

Elias

[0] https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-harder-than-3d-graphics/
[1] https://raphlinus.github.io/rust/graphics/gpu/2019/05/08/modern-2d.html
[2] https://go.dev/play/p/cbWFIvw5XvV
Details
Message ID
<CANtNKfo7AeDQ6LYXNFJ5awPUz4XXQCQc=eyB4rX=sLv3LSr00w@mail.gmail.com>
In-Reply-To
<CAMAFT9UKN-aN9FkuSZEZRQQxKS=NrOX=65Zd6uuMe1KFA-bs=A@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
With regards to cgocheck, one significant optimization would be to avoid passing
the whole glFunctions
(https://git.sr.ht/~eliasnaur/gio/tree/main/item/internal/gl/gl_unix.go#L37).
Go needs to check all pointery things passed into a C func.
So, if it were just a single func ptr, it wouldn't need to check all the fields.

+ Egon

On Mon, Jun 27, 2022 at 9:26 AM Elias Naur <mail@eliasnaur.com> wrote:
>
> Hi Dominik
>
> Thank you for your demonstration and careful analysis.
>
> On Sun, 26 Jun 2022 at 21:47, Dominik Honnef <dominik@honnef.co> wrote:
> >
> > Hi,
> >
> > I wanted to inquire what Gio's expected performance is. Seeing how it
> > targets writing GUIs, is it not expected to handle thousands of
> > paint.FillShape ops in a single frame?
> >
> > In a demo[1] I am drawing 4900 10x10 rectangles, and the performance isn't
> > very satisfying.
> >
> > I am running on Linux and X11, so I am getting the OpenGL renderer. With
> > that, by default, I get 32 ms per frame, as measured by io/profile.
> > Setting GODEBUG=cgocheck=0 reduces that to 7.8 ms.
> >
> >   1. The massive impact of the cgo pointer checking makes me wonder how
> >      often we call into OpenGL per frame. Surely we're not uploading and
> >      drawing each shape separately?
> >
> >   2. Even at 7.8 ms this is way slower than I would've hoped.
> >
> > To compare with Vulkan I've patched my copy of gio and reenabled Vulkan
> > on X11. With that I get 7.4 ms per frame by default, and 7.1 ms with
> > GODEBUG=cgocheck=0. Better, but see point 2 above.
> >
> > In comparison, glxgears (which is about as much a benchmark as 4900
> > rectangles are) renders at 20,000 frames per second, or 50 µs per frame.
> > It admittedly has much less work to do than Gio, but does it have 100x
> > less to do? For a more comparable example, complex games run at 60 fps
> > and software using Dear ImGui and rendering UIs in complexity similar to
> > my demo run at thousands of fps.
> >
>
> glxgears is not a great comparison, because of its low scene complexity.
> As for 3D seemingly more performant than 2D, see [0] and [1]. Basically,
> GPUs are not designed for 2D (CPUs less so, of course).
>
> Dear ImGui is a good example of trading rendering quality and flexibility for
> much better performance. If you only need simple shapes and a sprite sheet
> of characters, Dear ImGui does very well.
>
> > What's suspicious is the distribution of time in the profile data.
> > Consider this random sample:
> >
> >     {tot:  7.7ms draw:4.6198s gpu:  200µs st:     0s cov:  200µs}
> >
> > Let's ignore the 'draw' field, that one seems to be monotonically
> > increasing (although I do not know why.) – If 'gpu', 'st' and 'cov' take
> > a combined 400µs, where are we spending our remaining time? Note that
> > this is with vsync disabled (forced via vblank_mode=0, and confirmed
> > working by the fact that this monitor runs a 60 Hz, not 128 Hz.) I'm not
> > sure if 'st' isn't captured at all, or finishes too fast to be measured.
> >
> > FWIW, I created this benchmark after I ran into performance problems
> > in a real application I am writing, which renders execution traces,
> > which can contain thousands of visible spans. I'm hoping these
> > performance problems can be solved, as I'm otherwise quite enjoying the
> > experience of using Gio.
> >
> > Cheers.
> >
> > [1] https://play.golang.org/p/1WUZPAz6IV6
>
> Gio is designed to render complex shapes with high quality anti-aliasing.
> Thousands of axis-aligned rectangles are probably its worst case scenario :)
>
> That said, our default renderer is not particularly good, and is
> basically unchanged
> since I wrote it years ago based on the Pathfinder Rust project. We
> have a version
> of piet-gpu implemented behind the GIORENDERER=forcecompute environment setting,
> which I someday hope to get Gio to render as fast as described in
>
> https://raphlinus.github.io/rust/graphics/gpu/2020/06/13/fast-2d-rendering.html
>
> Unfortunately, various driver and GPU API issues have prevented me from moving
> forward with making the compute renderer the default. There is some hope that
> I can soon make another attempt because Raph has been working on a revised
> version of piet-gpu with much better compatibility.
>
> In any case, I'm sure there is a bunch of performance tuning available
> to us, but there
> is little point in optimizing the compute renderer while its hidden
> behind a flag, and the
> old renderer is destined to be removed.
>
> However, a vector renderer given thousands of simple rectangles with no need for
> anti-aliasing will never perform as well as a specialized implementation.
>
> Your case is not completely hopeless. I can achieve < 1ms frames on my
> machine by cheating: one single path and caching through a macro[2]. Perhaps
> with a cached shape per color you can get closer to acceptable performance in
> the short term.
>
> Longer term, I'd love to make your case work better by optimizing our renderer
> to recognize cases with many simple shapes and a bunch of more or less static
> text.
>
> Elias
>
> [0] https://blog.mecheye.net/2019/05/why-is-2d-graphics-is-harder-than-3d-graphics/
> [1] https://raphlinus.github.io/rust/graphics/gpu/2019/05/08/modern-2d.html
> [2] https://go.dev/play/p/cbWFIvw5XvV
Dominik Honnef <dominik@honnef.co>
Details
Message ID
<87y1xicy0m.fsf@honnef.co>
In-Reply-To
<CAMAFT9UKN-aN9FkuSZEZRQQxKS=NrOX=65Zd6uuMe1KFA-bs=A@mail.gmail.com> (view parent)
DKIM signature
missing
Download raw message
Thank you for your detailed response.

Elias Naur <mail@eliasnaur.com> writes:

> glxgears is not a great comparison, because of its low scene complexity.
> As for 3D seemingly more performant than 2D, see [0] and [1]. Basically,
> GPUs are not designed for 2D (CPUs less so, of course).
>
> Dear ImGui is a good example of trading rendering quality and flexibility for
> much better performance. If you only need simple shapes and a sprite sheet
> of characters, Dear ImGui does very well.

I was somewhat familiar with the difficulties of 2D rendering on modern
GPUs, but had assumed that people solved it. Then again, I haven't
benchmarked software rendering. Maybe my expectations for 2D rendering
are way off and Gio is already performing better than CPU-based options?

> Unfortunately, various driver and GPU API issues have prevented me from moving
> forward with making the compute renderer the default. There is some hope that
> I can soon make another attempt because Raph has been working on a revised
> version of piet-gpu with much better compatibility.

That's probably for the best. I gave it a try on my system (Linux, X11,
AMD RX 6700 XT). With OpenGL, I saw frame times around 300ms. With
Vulkan, my system froze for several seconds, and afterwards Gio failed
with "compute: shader program failed with error 1065353216".

> Your case is not completely hopeless. I can achieve < 1ms frames on my
> machine by cheating: one single path and caching through a macro[2]. Perhaps
> with a cached shape per color you can get closer to acceptable performance in
> the short term.

Unfortunately, the real use case has rectangles of different widths, so
that approach probably doesn't apply?

Can paths be disconnected? In that case I could attempt to draw all
spans of the same color in a single operation.

>
> Longer term, I'd love to make your case work better by optimizing our renderer
> to recognize cases with many simple shapes and a bunch of more or less static
> text.

Of course you then run into the trouble where a minor change to the
rendered UI can drastically change the performance, once the
optimization no longer applies.
Details
Message ID
<CANtNKfpyebYjN6cBoZE25k+iNvBt5A0_v4T9f3=SEtiEUEZ11g@mail.gmail.com>
In-Reply-To
<87y1xicy0m.fsf@honnef.co> (view parent)
DKIM signature
pass
Download raw message
Yes, paths can be disconnected.

https://go.dev/play/p/fy5scYZnd0Y


On Mon, Jun 27, 2022 at 11:27 AM Dominik Honnef <dominik@honnef.co> wrote:
>
> Thank you for your detailed response.
>
> Elias Naur <mail@eliasnaur.com> writes:
>
> > glxgears is not a great comparison, because of its low scene complexity.
> > As for 3D seemingly more performant than 2D, see [0] and [1]. Basically,
> > GPUs are not designed for 2D (CPUs less so, of course).
> >
> > Dear ImGui is a good example of trading rendering quality and flexibility for
> > much better performance. If you only need simple shapes and a sprite sheet
> > of characters, Dear ImGui does very well.
>
> I was somewhat familiar with the difficulties of 2D rendering on modern
> GPUs, but had assumed that people solved it. Then again, I haven't
> benchmarked software rendering. Maybe my expectations for 2D rendering
> are way off and Gio is already performing better than CPU-based options?
>
> > Unfortunately, various driver and GPU API issues have prevented me from moving
> > forward with making the compute renderer the default. There is some hope that
> > I can soon make another attempt because Raph has been working on a revised
> > version of piet-gpu with much better compatibility.
>
> That's probably for the best. I gave it a try on my system (Linux, X11,
> AMD RX 6700 XT). With OpenGL, I saw frame times around 300ms. With
> Vulkan, my system froze for several seconds, and afterwards Gio failed
> with "compute: shader program failed with error 1065353216".
>
> > Your case is not completely hopeless. I can achieve < 1ms frames on my
> > machine by cheating: one single path and caching through a macro[2]. Perhaps
> > with a cached shape per color you can get closer to acceptable performance in
> > the short term.
>
> Unfortunately, the real use case has rectangles of different widths, so
> that approach probably doesn't apply?
>
> Can paths be disconnected? In that case I could attempt to draw all
> spans of the same color in a single operation.
>
> >
> > Longer term, I'd love to make your case work better by optimizing our renderer
> > to recognize cases with many simple shapes and a bunch of more or less static
> > text.
>
> Of course you then run into the trouble where a minor change to the
> rendered UI can drastically change the performance, once the
> optimization no longer applies.
>
Details
Message ID
<CAMAFT9VC4DAL8NW2bs7De953tttjz3M4g962H4bqMoeectZ8=Q@mail.gmail.com>
In-Reply-To
<CANtNKfo7AeDQ6LYXNFJ5awPUz4XXQCQc=eyB4rX=sLv3LSr00w@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
On Mon, 27 Jun 2022 at 09:55, Egon Elbre <egonelbre@gmail.com> wrote:
>
> With regards to cgocheck, one significant optimization would be to avoid passing
> the whole glFunctions
> (https://git.sr.ht/~eliasnaur/gio/tree/main/item/internal/gl/gl_unix.go#L37).
> Go needs to check all pointery things passed into a C func.
> So, if it were just a single func ptr, it wouldn't need to check all the fields.
>

Thank you for noticing! I've done as you suggested in commit dab79680, bringing
OpenGL performance on par with Vulkan.

Elias
Details
Message ID
<CAMAFT9V9FnRra8uYjwX1zhCeg=wOSwwNnh+++TXMtm01NO=cnQ@mail.gmail.com>
In-Reply-To
<87y1xicy0m.fsf@honnef.co> (view parent)
DKIM signature
pass
Download raw message
On Mon, 27 Jun 2022 at 10:26, Dominik Honnef <dominik@honnef.co> wrote:
>
> Thank you for your detailed response.
>
> Elias Naur <mail@eliasnaur.com> writes:
>
> > glxgears is not a great comparison, because of its low scene complexity.
> > As for 3D seemingly more performant than 2D, see [0] and [1]. Basically,
> > GPUs are not designed for 2D (CPUs less so, of course).
> >
> > Dear ImGui is a good example of trading rendering quality and flexibility for
> > much better performance. If you only need simple shapes and a sprite sheet
> > of characters, Dear ImGui does very well.
>
> I was somewhat familiar with the difficulties of 2D rendering on modern
> GPUs, but had assumed that people solved it. Then again, I haven't
> benchmarked software rendering. Maybe my expectations for 2D rendering
> are way off and Gio is already performing better than CPU-based options?
>

If GIo is not already much faster than a CPU-renderer it can most
certainly be made so.
I'd even say that a GPU-based renderer is a requirement for a GUI
toolkit to perform
well on devices with relatively slow CPUs and high resolution
displays, which is most
mobile devices.

Also note most/all GUI toolkits rely on your UI staying mostly static
from frame to frame,
whereas Gio mostly does not: it renders everything from scratch and
assumes anti-aliased
complex shapes everywhere. I believe optimizing Gio performance is a question of
specialization and re-using more of the previous frame.

> > Your case is not completely hopeless. I can achieve < 1ms frames on my
> > machine by cheating: one single path and caching through a macro[2]. Perhaps
> > with a cached shape per color you can get closer to acceptable performance in
> > the short term.
>
> Unfortunately, the real use case has rectangles of different widths, so
> that approach probably doesn't apply?
>

Different widths (or entirely different shapes altogether) is not a
problem. The only problem
is that shapes with different fills (color) cannot share a path.

> Can paths be disconnected? In that case I could attempt to draw all
> spans of the same color in a single operation.
>

Yes.

> >
> > Longer term, I'd love to make your case work better by optimizing our renderer
> > to recognize cases with many simple shapes and a bunch of more or less static
> > text.
>
> Of course you then run into the trouble where a minor change to the
> rendered UI can drastically change the performance, once the
> optimization no longer applies.
>

Yes, but my claim is every GUI toolkit relies on some level of
frame-to-frame coherence.
The immediate mode programming model makes it easier to program a
dynamic UI, but the
performance hit applies nevertheless. Gio's job is to automatically
detect and exploit
coherence, without burdening the API.

Elias
Dominik Honnef <dominik@honnef.co>
Details
Message ID
<87v8smctsd.fsf@honnef.co>
In-Reply-To
<871qvbdx5r.fsf@honnef.co> (view parent)
DKIM signature
missing
Download raw message
I've implmeneted batching of rectangles by color, and also tested the
latest commit that improves the cgo pointer checking performance. Here
are my results:


Old commit = 72669e19bc294837b84f8f8a1c5c64dab869f8ab
New commit = dab796808acab7948946ae45891686e5f3b4ad95

| Rect batching | Renderer | Cgo checks | Commit | Time (ms) |
|---------------+----------+------------+--------+-----------|
| No            | OpenGL   | Enabled    | Old    |        63 |
| No            | OpenGL   | Disabled   | Old    |      15.3 |
| No            | Vulkan   | Enabled    | Old    |      14.7 |
| No            | Vulkan   | Disabled   | Old    |      14.3 |
|               |          |            |        |           |
| No            | OpenGL   | Enabled    | New    |      14.8 |
| No            | OpenGL   | Disabled   | New    |      14.5 |
| No            | Vulkan   | Enabled    | New    |      14.3 |
| No            | Vulkan   | Disabled   | New    |      14.1 |
|               |          |            |        |           |
| Yes           | OpenGL   | Enabled    | Old    |       4.3 |
| Yes           | OpenGL   | Disabled   | Old    |       3.5 |
| Yes           | Vulkan   | Enabled    | Old    |       3.8 |
| Yes           | Vulkan   | Disabled   | Old    |       3.8 |
|               |          |            |        |           |
| Yes           | OpenGL   | Enabled    | New    |       3.5 |
| Yes           | OpenGL   | Disabled   | New    |       3.5 |
| Yes           | Vulkan   | Enabled    | New    |       3.8 |
| Yes           | Vulkan   | Disabled   | New    |       3.8 |

The cgo change has an impressive impact. Batching also works very well,
and I'm willing to call my initial problems user error. Drawing ~10
shapes is clearly a better approach than drawing thousands of shapes.

I'm excited to see where performance will go in the future.
Reply to thread Export thread (mbox)