Benchmark with v7

Hello everyone.
Does anyone have experience using lvgl version 7 on an ARM platform with a high-resolution screen, with 16-bit color without a graphics accelerator (GPU)? On my equipment (ARM Cortex A5 498MHz, DDR 2 SDRAM 166MHz, TFT LCD 800x480 RGB565 mapped to memory) very strange results are obtained. For example, benchmark demo, the second column is fps (assembled without optimization at all), the third column is fps (assembled with full speed optimization):

 Benchmark report:                             
 Rectangle                                31    32  
 Rectangle + opa                          6     7   
 Rectangle rounded                        20    21  
 Rectangle rounded + opa                  5     6   
 Circle                                   6     10  
 Circle + opa                             4     5   
 Border                                   33    60  
 Border + opa                             30    34  
 Border rounded                           23    33  
 Border rounded + opa                     21    32  
 Circle border                            6     10  
 Circle border + opa                      6     8   
 Border top                               32    60  
 Border top + opa                         32    40  
 Border left                              12    19  
 Border left + opa                        11    14  
 Border top + left                        10    14  
 Border top + left + opa                  9     13  
 Border left + right                      10    14  
 Border left + right + opa                9     12  
 Border top + bottom                      30    33  
 Border top + bottom + opa                23    34  
 Shadow small                             6     8   
 Shadow small + opa                       7     8   
 Shadow small offset                      4     5   
 Shadow small offset + opa                4     5   
 Shadow large                             4     4   
 Shadow large + opa                       3     4   
 Shadow large offset                      2     3   
 Shadow large offset + opa                2     3   
 Image RGB                                38    71  
 Image RGB + opa                          7     7   
 Image ARGB                               15    22  
 Image ARGB + opa                         10    11  
 Image chorma keyed                       10    11  
 Image chorma keyed + opa                 6     7   
 Image indexed                            5     9   
 Image indexed + opa                      4     6   
 Image alpha only                         8     13  
 Image alpha only + opa                   5     8   
 Image RGB recolor                        4     6   
 Image RGB recolor + opa                  3     3   
 Image ARGB recolor                       6     8   
 Image ARGB recolor + opa                 5     5   
 Image chorma keyed recolor               6     8   
 Image chorma keyed recolor + opa         4     5   
 Image indexed recolor                    4     6   
 Image indexed recolor + opa              3     4   
 Image RGB rotate                         2     3   
 Image RGB rotate + opa                   1     2   
 Image RGB rotate anti aliased            0     1   
 Image RGB rotate anti aliased + opa      0     1   
 Image ARGB rotate                        2     3   
 Image ARGB rotate + opa                  2     3   
 Image ARGB rotate anti aliased           0     1   
 Image ARGB rotate anti aliased + opa     0     1   
 Image RGB zoom                           3     6   
 Image RGB zoom + opa                     3     4   
 Image RGB zoom anti aliased              1     2   
 Image RGB zoom anti aliased + opa        1     1   
 Image ARGB zoom                          3     5   
 Image ARGB zoom + opa                    3     4   
 Image ARGB zoom anti aliased             1     1   
 Image ARGB zoom anti aliased + opa       1     2   
 Text small                               8     12  
 Text small + opa                         7     11  
 Text medium                              8     12  
 Text medium + opa                        7     11  
 Text large                               8     12  
 Text large + opa                         7     11  
 Text small compressed                    7     10  
 Text small compressed + opa              7     10  
 Text medium compressed                   5     9   
 Text medium compressed + opa             5     8   
 Text large compressed                    3     5   
 Text large compressed + opa              3     4   
 Line                                     11    16  
 Line + opa                               11    15  
 Arc think                                9     13  
 Arc think + opa                          9     12  
 Arc thick                                9     13  
 Arc thick + opa                          8     11  
 Substr. rectangle                        4     4   
 Substr. rectangle + opa                  4     4   
 Substr. border                           20    22  
 Substr. border + opa                     21    22  
 Substr. shadow                           2     3   
 Substr. shadow + opa                     2     2   
 Substr. image                            6     8   
 Substr. image + opa                      5     6   
 Substr. line                             9     11  
 Substr. line + opa                       9     11  
 Substr. arc                              7     9   
 Substr. arc + opa                        7     9   
 Substr. text                             5     7   
 Substr. text + opa                       5     6   

From the data it can be seen that on some difficult tests, optimization does not affect almost anything. And on some, it gives a 2-fold increase in speed, which of course is not enough, but at least explainable. Used compiler IAR ARM 7.8. Earlier, on version 6, I used my implementation of color mixing, which gave a 30% increase in speed, with the active use of shadows, transparency. And now, on the same agglomeration, the change in speed is comparable with the error of its measurement.

Hi,

With STM32F769, 200 MHz, 800x480 TFT, -O3, no GPU I measured way better results. For example:

  • Rectangle: 145 FPS
  • Circle: 43 FPS
  • Shadow small offset: 16 FPS
  • Image RGB rotate anti aliased: 10 FPS
  • Text large compressed: 13 FPS
  • Arc thick: 37 FPS

So this MCU is 2.5 times slower but still produced 4-5 times better results.

If optimization has so small effect it seems the bottleneck is somewhere else.

  • Where do you store the display buffers (lv_disp_buf_t) and what is their size?
  • How do you flush the rendered mage to the display?

I use full double buffering. Both buffers are located in the cached area of DDR2, however, disabling caching for these areas makes little difference. Drops at 1-2 fps. The screen update function is implemented as standard: switching DMA pages, with the generation of an interrupt by VSYNC.


static volatile uint8_t vsync = 0;
static lv_disp_t  *lvgl_D850_disp = NULL;
static lv_disp_drv_t *lvgl_D850_disp_drv = NULL;
static void LcdISR (void) {
    static volatile uint32_t status = 0; 
    CHIPPS_AIC pAIC = CHIP_BASE_AIC;
    pAIC->AIC_IVR = 0;
    status = LCDC->LCDC_BASEISR;
   
    // IMPORTANT!!!
    // Inform the graphics library that you are ready with the flushing
    if (lvgl_D850_disp_drv) lv_disp_flush_ready (lvgl_D850_disp_drv);
    vsync = 1;
}

void lccdBase_SetBufAddress (uint32_t addr, size_t size) {
    if (size) {
        CP15_flush_dcache_for_dma (addr, addr + size - 1);
    }
    sLayer              *pLD   = pLayer (LCDD_BASE);
    Lcdc                *pHw = LCDC;
    sLCDCDescriptor     *pTD   = &pLD->dmaD;
    pTD->addr = (uint32_t)addr;
    pTD->ctrl = LCDC_HEOCTRL_DFETCH;
    pTD->next = (uint32_t)pTD;
}


static void disp_flush_cb (lv_disp_drv_t * disp_drv, const lv_area_t * area, lv_color_t * color_p) {
    lccdBase_SetBufAddress ((uint32_t)color_p, disp_drv->ver_res * disp_drv->hor_res * 2);
    lvgl_D850_disp_drv = disp_drv;
}

What kind of memory is in this STM32F769?

It should be faster to place a smaller (e.g 1/5 screen sized) buffer in internal RAM because there are a lot of reads and writes into it, so having a smaller access time is critical. Besides in true double buffering mode when one frame buffer is flushed LVGL needs to copy the updated areas into the other to keep them synchronized.
With an internal buffer in the flush function just memcpy the buffer to the an inactive frame buffer.

I v7 I’ve added lv_disp_flush_is_last(). With this you can check if it was the last refreshed area. If so you are good to swap the frame buffers.

I would like to clarify, is the module on the STM32F769 one of the discovery? If so, which one? I just can not understand such a huge difference in performance. After all, the LCD buffer is located in SDRAM and the update is on DMA?

This is the board: https://www.st.com/en/evaluation-tools/32f769idiscovery.html
The MCU is connected to a display controller (which has frame buffer) via MIPI-DSI

The driver works like this:

  • Have 1 frame buffer in external RAM (it’s not display controllers frame buffer)
  • Have 2 1/10 display sized buffers for LVGL
  • In the flush_cb I copy the areas to the frame buffer in SDRAM with DMA
  • After the last area it waits for the tearing effect signal and sends the content of the SDRAM framebuffer to the frame buffer in the display controller. (this part is tricky in the practice because the DSI is too slow to send the frame buffer in 1 part)

The performance is not much worse without GPU.

I’ve updated the master branch with a related optimization. Does it improve the performance?

All the same, I do not understand how this can be. When simulating on a computer (Core-i5, 3.2GHz, Win64, SDL2, 800x480x16), I get 193 fps on the rectangles and 109 fps on the rectangles with transparency. How can you get a similar result (145 fps) on a 200 MHz controller? The total amount of data is 800x480x2x145 = 111360000 bytes. Divide the frequency by the volume and get 1.79 cycles per pixel. Even taking into account the superscalarity of the M7 core, a very strange result. Recent changes in speed on my controller have practically not affected.

  1. It counts only the refreshed areas. With rectangles it’s roughly the half screen
  2. Pixel count matters not byte count. Paining a pixel to RED takes only 1 cycle.

So it’s 0.5 screen x 800 x 480 x 145 = 27M -> 200MHz / 27M = 7 cycles / px

Can you try this:

  • Use only 1 frame buffer
  • Set one 1/10 screen-sized display buffer for LVGL.
  • in flush_cb just memcpy the rendered image to the one frame buffer
  • ignore VSYNC now

I tried to do this in version 6. The speed is about the same, but very ugly on animations, swipe, etc.

Could you try in v7 too?

In general is the speed the same in v6 an v7?

Hi!
Seems to have discovered the cause of braking in my controller. If in general terms, these are the jambs of the MMU settings. Now we get the following benchmark results:

 Benchmark report:
 Rectangle                                67  
 Rectangle + opa                          33  
 Rectangle rounded                        59  
 Rectangle rounded + opa                  31  
 Circle                                   34  
 Circle + opa                             10  
 Border                                   78  
 Border + opa                             77  
 Border rounded                           77  
 Border rounded + opa                     77  
 Circle border                            32  
 Circle border + opa                      23  
 Border top                               77  
 Border top + opa                         77  
 Border left                              59  
 Border left + opa                        59  
 Border top + left                        48  
 Border top + left + opa                  32  
 Border left + right                      43  
 Border left + right + opa                31  
 Border top + bottom                      77  
 Border top + bottom + opa                77  
 Shadow small                             22  
 Shadow small + opa                       23  
 Shadow small offset                      19  
 Shadow small offset + opa                18  
 Shadow large                             17  
 Shadow large + opa                       15  
 Shadow large offset                      13  
 Shadow large offset + opa                14  
 Image RGB                                71  
 Image RGB + opa                          34  
 Image ARGB                               70  
 Image ARGB + opa                         34  
 Image chorma keyed                       52  
 Image chorma keyed + opa                 34  
 Image indexed                            34  
 Image indexed + opa                      23  
 Image alpha only                         35  
 Image alpha only + opa                   23  
 Image RGB recolor                        27  
 Image RGB recolor + opa                  22  
 Image ARGB recolor                       34  
 Image ARGB recolor + opa                 23  
 Image chorma keyed recolor               34  
 Image chorma keyed recolor + opa         23  
 Image indexed recolor                    23  
 Image indexed recolor + opa              22  
 Image RGB rotate                         22  
 Image RGB rotate + opa                   15  
 Image RGB rotate anti aliased            9   
 Image RGB rotate anti aliased + opa      7   
 Image ARGB rotate                        22  
 Image ARGB rotate + opa                  21  
 Image ARGB rotate anti aliased           10  
 Image ARGB rotate anti aliased + opa     9   
 Image RGB zoom                           34  
 Image RGB zoom + opa                     23  
 Image RGB zoom anti aliased              14  
 Image RGB zoom anti aliased + opa        12  
 Image ARGB zoom                          33  
 Image ARGB zoom + opa                    23  
 Image ARGB zoom anti aliased             14  
 Image ARGB zoom anti aliased + opa       14  
 Text small                               34  
 Text small + opa                         34  
 Text medium                              35  
 Text medium + opa                        34  
 Text large                               35  
 Text large + opa                         33  
 Text small compressed                    28  
 Text small compressed + opa              26  
 Text medium compressed                   23  
 Text medium compressed + opa             23  
 Text large compressed                    14  
 Text large compressed + opa              14  
 Line                                     40  
 Line + opa                               36  
 Arc think                                34  
 Arc think + opa                          30  
 Arc thick                                34  
 Arc thick + opa                          28  
 Substr. rectangle                        33  
 Substr. rectangle + opa                  28  
 Substr. border                           70  
 Substr. border + opa                     70  
 Substr. shadow                           13  
 Substr. shadow + opa                     14  
 Substr. image                            34  
 Substr. image + opa                      25  
 Substr. line                             34  
 Substr. line + opa                       34  

So of course, much better, but still not up to the Cortex M7. It is not clear why.

Do you have a chance to run a profiler to see what the MCU does?

Unfortunately no. The J-TAG adapter with the ability to profile the program is indecently expensive, and it is connected via the ETM interface, which I do not have on the board. The main problem was that the internal SRAM of the controller must also be cached, although it is declared as memory working at the system frequency without delay. Since the stacks are located in this memory, a catastrophic drop in speed occurred.

Finally how fast is the internal SRAM? And how fast can you access to the code?
If program read is also a bottleneck you can set LV_ATTRIBUTE_FAST_MEM. A few functions are prefixed with this to allow placing them into internal RAM.

So far, with full optimization and double buffering, the following results are obtained:

 Benchmark report:
 Rectangle                                68  
 Rectangle + opa                          33  
 Rectangle rounded                        59  
 Rectangle rounded + opa                  31  
 Circle                                   32  
 Circle + opa                             10  
 Border                                   77  
 Border + opa                             77  
 Border rounded                           77  
 Border rounded + opa                     77  
 Circle border                            34  
 Circle border + opa                      23  
 Border top                               78  
 Border top + opa                         77  
 Border left                              59  
 Border left + opa                        59  
 Border top + left                        45  
 Border top + left + opa                  32  
 Border left + right                      39  
 Border left + right + opa                31  
 Border top + bottom                      77  
 Border top + bottom + opa                77  
 Shadow small                             22  
 Shadow small + opa                       23  
 Shadow small offset                      20  
 Shadow small offset + opa                19  
 Shadow large                             15  
 Shadow large + opa                       15  
 Shadow large offset                      13  
 Shadow large offset + opa                14  
 Image RGB                                71  
 Image RGB + opa                          34  
 Image ARGB                               70  
 Image ARGB + opa                         35  
 Image chorma keyed                       54  
 Image chorma keyed + opa                 31  
 Image indexed                            33  
 Image indexed + opa                      23  
 Image alpha only                         35  
 Image alpha only + opa                   23  
 Image RGB recolor                        29  
 Image RGB recolor + opa                  22  
 Image ARGB recolor                       34  
 Image ARGB recolor + opa                 23  
 Image chorma keyed recolor               34  
 Image chorma keyed recolor + opa         23  
 Image indexed recolor                    23  
 Image indexed recolor + opa              22  
 Image RGB rotate                         22  
 Image RGB rotate + opa                   15  
 Image RGB rotate anti aliased            9   
 Image RGB rotate anti aliased + opa      7   
 Image ARGB rotate                        22  
 Image ARGB rotate + opa                  21  
 Image ARGB rotate anti aliased           10  
 Image ARGB rotate anti aliased + opa     9   
 Image RGB zoom                           35  
 Image RGB zoom + opa                     23  
 Image RGB zoom anti aliased              14  
 Image RGB zoom anti aliased + opa        11  
 Image ARGB zoom                          33  
 Image ARGB zoom + opa                    22  
 Image ARGB zoom anti aliased             14  
 Image ARGB zoom anti aliased + opa       14  
 Text small                               34  
 Text small + opa                         34  
 Text medium                              35  
 Text medium + opa                        34  
 Text large                               34  
 Text large + opa                         34  
 Text small compressed                    28  
 Text small compressed + opa              28  
 Text medium compressed                   23  
 Text medium compressed + opa             23  
 Text large compressed                    15  
 Text large compressed + opa              14  
 Line                                     38  
 Line + opa                               36  
 Arc think                                34  
 Arc think + opa                          31  
 Arc thick                                33  
 Arc thick + opa                          28  
 Substr. rectangle                        33  
 Substr. rectangle + opa                  30  
 Substr. border                           77  
 Substr. border + opa                     77  
 Substr. shadow                           13  
 Substr. shadow + opa                     14  
 Substr. image                            34  
 Substr. image + opa                      24  
 Substr. line                             34  
 Substr. line + opa                       35  
 Substr. arc                              26  
 Substr. arc + opa                        26  
 Substr. text                             30  
 Substr. text + opa                       27  

This will not help in my case. When using caching, there is not much difference between internal SRAM and SDRAM, especially with sequential access. But there is no other memory. Prior to this controller, I ran LVGL on the AT91SAM9G45, but everything was less tough there. And the screen was significantly smaller: 480x272. In version 6, for 16-bit color, a significant increase in speed was given by such an implementation of the blender:

static inline lv_color_t lv_color_mix(lv_color_t c1, lv_color_t c2, uint8_t mix)
{
    lv_color_t ret;
    mix >>= 4;
    uint32_t src = c2.full;
    uint32_t dst = c1.full;
    src = (src | (src << 16)) & 0x07E0F81F;
    dst = (dst | (dst << 16)) & 0x07E0F81F;
    dst = ((dst * mix + src * (15 - mix)) >> 4) & 0x07E0F81F;
    ret.full = dst | (dst >> 16);
    return ret;
}
1 Like

What are the results with one frame buffer and an LVGL display buffer (e.g. 1/8 screen sized) in internal RAM, and lv_memcpy in disp_flush?

Wow, that looks interesting. Can you describe in a few sentences how it works, or send a link about it?

It might be still faster, but maybe you can’t see the difference due to other issues. E.g. 5 ms gain is not that much compared to 100 ms, but a lot compared to 30 ms.

This is a famous trick from the era to graphics accelerators. Larva SIMD operations, so to speak :smile:. In essence, a vector operation is obtained. Discreteness in transparency is very low, but the eye is a very imperfect tool, and does not notice such things. Where I saw it originally, I did not find it. I remember that somewhere on https://www.compuphase.com.


https://issue.life/questions/18937701
// Fast RGB565 pixel blending
// Found in a pull request for the Adafruit framebuffer library. Clever!
// https://github.com/tricorderproject/arducordermini/pull/1/files#diff-d22a481ade4dbb4e41acc4d7c77f683d
color alphaBlendRGB565( uint32_t fg, uint32_t bg, uint8_t alpha ){
    // Alpha converted from [0..255] to [0..31]
    alpha = ( alpha + 4 ) >> 3;

    // Converts  0000000000000000rrrrrggggggbbbbb
    //     into  00000gggggg00000rrrrr000000bbbbb
    // with mask 00000111111000001111100000011111
    // This is useful because it makes space for a parallel fixed-point multiply
    bg = (bg | (bg << 16)) & 0b00000111111000001111100000011111;
    fg = (fg | (fg << 16)) & 0b00000111111000001111100000011111;

    // This implements the linear interpolation formula: result = bg * (1.0 - alpha) + fg * alpha
    // This can be factorized into: result = bg + (fg - bg) * alpha
    // alpha is in Q1.5 format, so 0.0 is represented by 0, and 1.0 is represented by 32
    uint32_t result = (fg - bg) * alpha; // parallel fixed-point multiply of all components
    result >>= 5;
    result += bg;
    result &= 0b00000111111000001111100000011111; // mask out fractional parts
    return (color)((result >> 16) | result); // contract result
}

The buffer on 1/8 of the screen cannot be placed in the internal memory (800 * 480/8 * 2 = 96000), and the memory inside is only 128k. Therefore, the buffer in SDRAM. I don’t understand why double buffering (second column) is slower. In addition, which is slower, the results are clamped (i.e., the same values correspond to different tests):

 Benchmark report:
 Rectangle                                78  67
 Rectangle + opa                          42  33
 Rectangle rounded                        70  59
 Rectangle rounded + opa                  38  31
 Circle                                   43  32
 Circle + opa                             10  10
 Border                                   100 78
 Border + opa                             92  77
 Border rounded                           90  77
 Border rounded + opa                     84  77
 Circle border                            43  33
 Circle border + opa                      34  23
 Border top                               102 77
 Border top + opa                         100 77
 Border left                              66  59
 Border left + opa                        65  59
 Border top + left                        59  50
 Border top + left + opa                  42  33
 Border left + right                      58  40
 Border left + right + opa                41  33
 Border top + bottom                      96  77
 Border top + bottom + opa                91  77
 Shadow small                             32  22
 Shadow small + opa                       32  23
 Shadow small offset                      22  20
 Shadow small offset + opa                20  19
 Shadow large                             15  18
 Shadow large + opa                       15  14
 Shadow large offset                      13  13
 Shadow large offset + opa                13  14
 Image RGB                                154 71
 Image RGB + opa                          48  34
 Image ARGB                               82  70
 Image ARGB + opa                         51  35
 Image chorma keyed                       74  52
 Image chorma keyed + opa                 46  34
 Image indexed                            47  34
 Image indexed + opa                      34  23
 Image alpha only                         51  35
 Image alpha only + opa                   37  23
 Image RGB recolor                        44  27
 Image RGB recolor + opa                  26  22
 Image ARGB recolor                       48  34
 Image ARGB recolor + opa                 36  23
 Image chorma keyed recolor               53  34
 Image chorma keyed recolor + opa         37  23
 Image indexed recolor                    38  23
 Image indexed recolor + opa              28  22
 Image RGB rotate                         31  22
 Image RGB rotate + opa                   21  15
 Image RGB rotate anti aliased            10  9 
 Image RGB rotate anti aliased + opa      8   7 
 Image ARGB rotate                        28  22
 Image ARGB rotate + opa                  24  21
 Image ARGB rotate anti aliased           10  10
 Image ARGB rotate anti aliased + opa     10  9 
 Image RGB zoom                           49  34
 Image RGB zoom + opa                     33  23
 Image RGB zoom anti aliased              16  14
 Image RGB zoom anti aliased + opa        14  11
 Image ARGB zoom                          44  33
 Image ARGB zoom + opa                    37  23
 Image ARGB zoom anti aliased             16  14
 Image ARGB zoom anti aliased + opa       16  14
 Text small                               55  34
 Text small + opa                         51  34
 Text medium                              55  35
 Text medium + opa                        52  34
 Text large                               55  35
 Text large + opa                         52  34
 Text small compressed                    39  28
 Text small compressed + opa              38  28
 Text medium compressed                   32  23
 Text medium compressed + opa             30  23
 Text large compressed                    17  14
 Text large compressed + opa              15  14
 Line                                     67  42
 Line + opa                               61  36
 Arc think                                48  36
 Arc think + opa                          44  30
 Arc thick                                47  34
 Arc thick + opa                          42  27
 Substr. rectangle                        39  33
 Substr. rectangle + opa                  37  30
 Substr. border                           84  77
 Substr. border + opa                     83  77
 Substr. shadow                           14  13
 Substr. shadow + opa                     15  14
 Substr. image                            51  34
 Substr. image + opa                      41  25
 Substr. line                             56  34
 Substr. line + opa                       56  35
 Substr. arc                              40  26
 Substr. arc + opa                        39  26
 Substr. text                             44  32
 Substr. text + opa                       44  25

That’s pretty cool! Thanks for sharing! I’m thinking where else can take advantage of this trick!

Than try 1/10 or 1/15 screen size. It’d be important to see how much does it matter is the display buffer is placed in internal RAM.

Double buffering is slower because LVGL needs to synchronize the frame buffers which involves an extra copy. Let’s say you have 2 texts on the screen "a" and "b".

  1. Initial state both FBs contains the same text
  2. You change "a" to "A". The current FB used for drawing will contain, "A" and "b" but other is still "a" and "b".
  3. The FBs are swapped now "A" and "b" are displayed
  4. You change "b" and "B". The current drawing buffer still contains "a" and "b" so simply changing "b" and "B" is not enough.
  5. To solve it all changed areas are copied to the new draw buffer after the flush.

On PC usually, the whole screen is redrawn so there is no such issue.

Hello!
I tried turning off VSYNC with double buffering. The framerate change is extremely small. I also tried to place a 1/16 buffer in the internal SRAM. On animations, everything became absolutely terrible, but the speed did not change. There is a simple explanation for this: both the internal and external memory now work for me through the MMU / cache. If the internal SRAM is not cached, it is very slow. I did not find explanations for this in the controller documentation. Probably this memory is equivalent to L2, i.e. not closely related to the core. ARM has such a type of memory - TCM (Tightly-Coupled Memory), so this is not it.