What's your benchmark result?

I want to know how my benchmark result compared to others in the real world, and if they are really good so I can try to do better. And what is the limitation of this benchmark test?
I would like to hear tips and tricks to get the best (Besides the clear practices), and your display type, display interface, MCU, and any other relative information. Mine is :

MCU: STM32 H7
Display: Ili9488
Display interface: FMC, DBI type B 16-bit RGB565.
GPU: GPU is enabled in the config file, and regular DMA is used to transfer data pixels to FMC display interface.
Compiler optimization: -O2 (“Optimize More”).
bufffers: 2 buffers, with size = DISP_HOR_RES * 70.

picture

1 Like

Wow, these are pretty good results!

1 Like

Hi, I’ve managed to get 143 FPS (weighted) on STM32F746-Discovery board.

My setup:

  • Core: Cortex-M7
  • System clock: 200MHz
  • SDRAM clock: 100MHz
  • Lvgl buffer located in DTCMRAM (64KB)
  • LCD 480x272 driven by LTDC with framebuffer in SDRAM
  • ICache & DCache enabled
  • buffer flushing by DMA2D
  • ‘-Os’ code optimization

Video:

Code:

1 Like

Wow, very nice!

How did moving the LVGL buffer to DTCMRAM affect the performance? Is is slower in “normal RAM”?

Actually it looks like there is no difference comparing to “normal RAM” but DTCMRAM is not cachable, so I don’t have to bother with invalidating cache when using DMA2D :wink:
With ‘-O3’ optimization it gets even more FPS: 167, but that’s obvious.
Anyway I’m happy with that performance :wink:

1 Like

I wanted to share my experience here as well.
Warning: wall of text ahead, but I hope people may find this useful.

I am using a custom board with STM32H723 @ 550 MHz, with 16 MiB SDRAM at 130 MHz
Display: 800x480 24-bit parallel via LTDC
LVGL release 8.3.11

lv_demo_benchmark FPS: 82
Bonus feature: Adaptive refresh rate (!!)

The main bottleneck is SDRAM bandwidth
Currently using ARGB8888 everywhere, because the LVGL DMA2D implementation does not support RGB888. Since we are bandwidth limited, RGB888 support would most likely improve performance by another 25%. We could disable DMA2D, but we want to be able to give CPU time to other RTOS tasks while rendering (semaphore in wait_cb).

Buffers
Dual SRAM buffers for LVGL+DMA2D rendering (800*40 pixels).
AND
Dual framebuffers in external SDRAM (offset by 4 MiB, so they are placed in different internal banks). Not only does this result in tear-free rendering but actually improves worst-case performance (full-screen updates) when compared to a single framebuffer (see: AN4861 4.5.3). There is a catch though, see below.

Flush with MDMA
DMA2D is already in use for rendering, but we still want to copy in parallel. So we use MDMA.
The MDMA “repeated block transfer” is perfectly suited to do a 2D copy, started in the flush_cb.
The burst size should be set to some higher value, e.g. MDMA_DEST_BURST_16BEATS, but be careful not to starve LTDC.

Catch
This setup does not seem to be directly supported by LVGL (2x internal buffer + 2x framebuffer).
The main problem seems to be, that after a buffer swap, some pixels will need to be copied between the two framebuffers in the external SDRAM to keep them in sync before rendering the next change. Since we are bandwidth limited, re-drawing the entire screen can actually give better performance than trying to re-use those pixels.

For this, we use:
lv_obj_invalidate(lv_scr_act()); → _lv_disp_refr_timer(NULL); and friends.
In a way, this is optimized for full screen redrawing (full-screen menu scrolling, etc.). But the disadvantage is less efficiency for small updates, and slowdown when there is a lot of overdraw. e.g. birthday date picker in widgets demo.

However, we can add an optimization where smaller updates still go directly to the front buffer without a full redraw.

Adaptive refresh rate
This lets everything stay smooth even if you dip below the refresh rate of the screen.
It is actually quite simple to implement:

We have setup the line event interrupt in LTDC to trigger just before the scanout of the new frame begins.
Depending on your display timings, this could be e.g. line 510 (blanking must be accounted for).

In the interrupt, we check if LVGL has finished rendering. lv_disp_flush_is_last() is helpful here.
If it has, just swap the buffers as normal

If it has not finished rendering we temporarily stop the clock to the display:
__HAL_RCC_LTDC_CLK_DISABLE();
Careful: LTDC registers can no longer be accessed after this.

Then after rendering is done re-enable the clock again:
__HAL_RCC_LTDC_CLK_ENABLE();
and then swap the buffers.

1 Like

297 FPS
16 cores (32 threads) @ 4.2 ghz
want to say I had the resolution at 800x600x24

I know I was cheating…

@Rubeer
The adaptive refresh rate is ingenious!