Hello everyone.
Does anyone have experience using lvgl version 7 on an ARM platform with a high-resolution screen, with 16-bit color without a graphics accelerator (GPU)? On my equipment (ARM Cortex A5 498MHz, DDR 2 SDRAM 166MHz, TFT LCD 800x480 RGB565 mapped to memory) very strange results are obtained. For example, benchmark demo, the second column is fps (assembled without optimization at all), the third column is fps (assembled with full speed optimization):
Benchmark report:
Rectangle 31 32
Rectangle + opa 6 7
Rectangle rounded 20 21
Rectangle rounded + opa 5 6
Circle 6 10
Circle + opa 4 5
Border 33 60
Border + opa 30 34
Border rounded 23 33
Border rounded + opa 21 32
Circle border 6 10
Circle border + opa 6 8
Border top 32 60
Border top + opa 32 40
Border left 12 19
Border left + opa 11 14
Border top + left 10 14
Border top + left + opa 9 13
Border left + right 10 14
Border left + right + opa 9 12
Border top + bottom 30 33
Border top + bottom + opa 23 34
Shadow small 6 8
Shadow small + opa 7 8
Shadow small offset 4 5
Shadow small offset + opa 4 5
Shadow large 4 4
Shadow large + opa 3 4
Shadow large offset 2 3
Shadow large offset + opa 2 3
Image RGB 38 71
Image RGB + opa 7 7
Image ARGB 15 22
Image ARGB + opa 10 11
Image chorma keyed 10 11
Image chorma keyed + opa 6 7
Image indexed 5 9
Image indexed + opa 4 6
Image alpha only 8 13
Image alpha only + opa 5 8
Image RGB recolor 4 6
Image RGB recolor + opa 3 3
Image ARGB recolor 6 8
Image ARGB recolor + opa 5 5
Image chorma keyed recolor 6 8
Image chorma keyed recolor + opa 4 5
Image indexed recolor 4 6
Image indexed recolor + opa 3 4
Image RGB rotate 2 3
Image RGB rotate + opa 1 2
Image RGB rotate anti aliased 0 1
Image RGB rotate anti aliased + opa 0 1
Image ARGB rotate 2 3
Image ARGB rotate + opa 2 3
Image ARGB rotate anti aliased 0 1
Image ARGB rotate anti aliased + opa 0 1
Image RGB zoom 3 6
Image RGB zoom + opa 3 4
Image RGB zoom anti aliased 1 2
Image RGB zoom anti aliased + opa 1 1
Image ARGB zoom 3 5
Image ARGB zoom + opa 3 4
Image ARGB zoom anti aliased 1 1
Image ARGB zoom anti aliased + opa 1 2
Text small 8 12
Text small + opa 7 11
Text medium 8 12
Text medium + opa 7 11
Text large 8 12
Text large + opa 7 11
Text small compressed 7 10
Text small compressed + opa 7 10
Text medium compressed 5 9
Text medium compressed + opa 5 8
Text large compressed 3 5
Text large compressed + opa 3 4
Line 11 16
Line + opa 11 15
Arc think 9 13
Arc think + opa 9 12
Arc thick 9 13
Arc thick + opa 8 11
Substr. rectangle 4 4
Substr. rectangle + opa 4 4
Substr. border 20 22
Substr. border + opa 21 22
Substr. shadow 2 3
Substr. shadow + opa 2 2
Substr. image 6 8
Substr. image + opa 5 6
Substr. line 9 11
Substr. line + opa 9 11
Substr. arc 7 9
Substr. arc + opa 7 9
Substr. text 5 7
Substr. text + opa 5 6
From the data it can be seen that on some difficult tests, optimization does not affect almost anything. And on some, it gives a 2-fold increase in speed, which of course is not enough, but at least explainable. Used compiler IAR ARM 7.8. Earlier, on version 6, I used my implementation of color mixing, which gave a 30% increase in speed, with the active use of shadows, transparency. And now, on the same agglomeration, the change in speed is comparable with the error of its measurement.
I use full double buffering. Both buffers are located in the cached area of DDR2, however, disabling caching for these areas makes little difference. Drops at 1-2 fps. The screen update function is implemented as standard: switching DMA pages, with the generation of an interrupt by VSYNC.
It should be faster to place a smaller (e.g 1/5 screen sized) buffer in internal RAM because there are a lot of reads and writes into it, so having a smaller access time is critical. Besides in true double buffering mode when one frame buffer is flushed LVGL needs to copy the updated areas into the other to keep them synchronized.
With an internal buffer in the flush function just memcpy the buffer to the an inactive frame buffer.
I v7 I’ve added lv_disp_flush_is_last(). With this you can check if it was the last refreshed area. If so you are good to swap the frame buffers.
I would like to clarify, is the module on the STM32F769 one of the discovery? If so, which one? I just can not understand such a huge difference in performance. After all, the LCD buffer is located in SDRAM and the update is on DMA?
Have 1 frame buffer in external RAM (it’s not display controllers frame buffer)
Have 2 1/10 display sized buffers for LVGL
In the flush_cb I copy the areas to the frame buffer in SDRAM with DMA
After the last area it waits for the tearing effect signal and sends the content of the SDRAM framebuffer to the frame buffer in the display controller. (this part is tricky in the practice because the DSI is too slow to send the frame buffer in 1 part)
The performance is not much worse without GPU.
I’ve updated the master branch with a related optimization. Does it improve the performance?
All the same, I do not understand how this can be. When simulating on a computer (Core-i5, 3.2GHz, Win64, SDL2, 800x480x16), I get 193 fps on the rectangles and 109 fps on the rectangles with transparency. How can you get a similar result (145 fps) on a 200 MHz controller? The total amount of data is 800x480x2x145 = 111360000 bytes. Divide the frequency by the volume and get 1.79 cycles per pixel. Even taking into account the superscalarity of the M7 core, a very strange result. Recent changes in speed on my controller have practically not affected.
Hi!
Seems to have discovered the cause of braking in my controller. If in general terms, these are the jambs of the MMU settings. Now we get the following benchmark results:
Benchmark report:
Rectangle 67
Rectangle + opa 33
Rectangle rounded 59
Rectangle rounded + opa 31
Circle 34
Circle + opa 10
Border 78
Border + opa 77
Border rounded 77
Border rounded + opa 77
Circle border 32
Circle border + opa 23
Border top 77
Border top + opa 77
Border left 59
Border left + opa 59
Border top + left 48
Border top + left + opa 32
Border left + right 43
Border left + right + opa 31
Border top + bottom 77
Border top + bottom + opa 77
Shadow small 22
Shadow small + opa 23
Shadow small offset 19
Shadow small offset + opa 18
Shadow large 17
Shadow large + opa 15
Shadow large offset 13
Shadow large offset + opa 14
Image RGB 71
Image RGB + opa 34
Image ARGB 70
Image ARGB + opa 34
Image chorma keyed 52
Image chorma keyed + opa 34
Image indexed 34
Image indexed + opa 23
Image alpha only 35
Image alpha only + opa 23
Image RGB recolor 27
Image RGB recolor + opa 22
Image ARGB recolor 34
Image ARGB recolor + opa 23
Image chorma keyed recolor 34
Image chorma keyed recolor + opa 23
Image indexed recolor 23
Image indexed recolor + opa 22
Image RGB rotate 22
Image RGB rotate + opa 15
Image RGB rotate anti aliased 9
Image RGB rotate anti aliased + opa 7
Image ARGB rotate 22
Image ARGB rotate + opa 21
Image ARGB rotate anti aliased 10
Image ARGB rotate anti aliased + opa 9
Image RGB zoom 34
Image RGB zoom + opa 23
Image RGB zoom anti aliased 14
Image RGB zoom anti aliased + opa 12
Image ARGB zoom 33
Image ARGB zoom + opa 23
Image ARGB zoom anti aliased 14
Image ARGB zoom anti aliased + opa 14
Text small 34
Text small + opa 34
Text medium 35
Text medium + opa 34
Text large 35
Text large + opa 33
Text small compressed 28
Text small compressed + opa 26
Text medium compressed 23
Text medium compressed + opa 23
Text large compressed 14
Text large compressed + opa 14
Line 40
Line + opa 36
Arc think 34
Arc think + opa 30
Arc thick 34
Arc thick + opa 28
Substr. rectangle 33
Substr. rectangle + opa 28
Substr. border 70
Substr. border + opa 70
Substr. shadow 13
Substr. shadow + opa 14
Substr. image 34
Substr. image + opa 25
Substr. line 34
Substr. line + opa 34
So of course, much better, but still not up to the Cortex M7. It is not clear why.
Unfortunately no. The J-TAG adapter with the ability to profile the program is indecently expensive, and it is connected via the ETM interface, which I do not have on the board. The main problem was that the internal SRAM of the controller must also be cached, although it is declared as memory working at the system frequency without delay. Since the stacks are located in this memory, a catastrophic drop in speed occurred.
Finally how fast is the internal SRAM? And how fast can you access to the code?
If program read is also a bottleneck you can set LV_ATTRIBUTE_FAST_MEM. A few functions are prefixed with this to allow placing them into internal RAM.
So far, with full optimization and double buffering, the following results are obtained:
Benchmark report:
Rectangle 68
Rectangle + opa 33
Rectangle rounded 59
Rectangle rounded + opa 31
Circle 32
Circle + opa 10
Border 77
Border + opa 77
Border rounded 77
Border rounded + opa 77
Circle border 34
Circle border + opa 23
Border top 78
Border top + opa 77
Border left 59
Border left + opa 59
Border top + left 45
Border top + left + opa 32
Border left + right 39
Border left + right + opa 31
Border top + bottom 77
Border top + bottom + opa 77
Shadow small 22
Shadow small + opa 23
Shadow small offset 20
Shadow small offset + opa 19
Shadow large 15
Shadow large + opa 15
Shadow large offset 13
Shadow large offset + opa 14
Image RGB 71
Image RGB + opa 34
Image ARGB 70
Image ARGB + opa 35
Image chorma keyed 54
Image chorma keyed + opa 31
Image indexed 33
Image indexed + opa 23
Image alpha only 35
Image alpha only + opa 23
Image RGB recolor 29
Image RGB recolor + opa 22
Image ARGB recolor 34
Image ARGB recolor + opa 23
Image chorma keyed recolor 34
Image chorma keyed recolor + opa 23
Image indexed recolor 23
Image indexed recolor + opa 22
Image RGB rotate 22
Image RGB rotate + opa 15
Image RGB rotate anti aliased 9
Image RGB rotate anti aliased + opa 7
Image ARGB rotate 22
Image ARGB rotate + opa 21
Image ARGB rotate anti aliased 10
Image ARGB rotate anti aliased + opa 9
Image RGB zoom 35
Image RGB zoom + opa 23
Image RGB zoom anti aliased 14
Image RGB zoom anti aliased + opa 11
Image ARGB zoom 33
Image ARGB zoom + opa 22
Image ARGB zoom anti aliased 14
Image ARGB zoom anti aliased + opa 14
Text small 34
Text small + opa 34
Text medium 35
Text medium + opa 34
Text large 34
Text large + opa 34
Text small compressed 28
Text small compressed + opa 28
Text medium compressed 23
Text medium compressed + opa 23
Text large compressed 15
Text large compressed + opa 14
Line 38
Line + opa 36
Arc think 34
Arc think + opa 31
Arc thick 33
Arc thick + opa 28
Substr. rectangle 33
Substr. rectangle + opa 30
Substr. border 77
Substr. border + opa 77
Substr. shadow 13
Substr. shadow + opa 14
Substr. image 34
Substr. image + opa 24
Substr. line 34
Substr. line + opa 35
Substr. arc 26
Substr. arc + opa 26
Substr. text 30
Substr. text + opa 27
This will not help in my case. When using caching, there is not much difference between internal SRAM and SDRAM, especially with sequential access. But there is no other memory. Prior to this controller, I ran LVGL on the AT91SAM9G45, but everything was less tough there. And the screen was significantly smaller: 480x272. In version 6, for 16-bit color, a significant increase in speed was given by such an implementation of the blender:
What are the results with one frame buffer and an LVGL display buffer (e.g. 1/8 screen sized) in internal RAM, and lv_memcpy in disp_flush?
Wow, that looks interesting. Can you describe in a few sentences how it works, or send a link about it?
It might be still faster, but maybe you can’t see the difference due to other issues. E.g. 5 ms gain is not that much compared to 100 ms, but a lot compared to 30 ms.
This is a famous trick from the era to graphics accelerators. Larva SIMD operations, so to speak . In essence, a vector operation is obtained. Discreteness in transparency is very low, but the eye is a very imperfect tool, and does not notice such things. Where I saw it originally, I did not find it. I remember that somewhere on https://www.compuphase.com.
// Fast RGB565 pixel blending
// Found in a pull request for the Adafruit framebuffer library. Clever!
// https://github.com/tricorderproject/arducordermini/pull/1/files#diff-d22a481ade4dbb4e41acc4d7c77f683d
color alphaBlendRGB565( uint32_t fg, uint32_t bg, uint8_t alpha ){
// Alpha converted from [0..255] to [0..31]
alpha = ( alpha + 4 ) >> 3;
// Converts 0000000000000000rrrrrggggggbbbbb
// into 00000gggggg00000rrrrr000000bbbbb
// with mask 00000111111000001111100000011111
// This is useful because it makes space for a parallel fixed-point multiply
bg = (bg | (bg << 16)) & 0b00000111111000001111100000011111;
fg = (fg | (fg << 16)) & 0b00000111111000001111100000011111;
// This implements the linear interpolation formula: result = bg * (1.0 - alpha) + fg * alpha
// This can be factorized into: result = bg + (fg - bg) * alpha
// alpha is in Q1.5 format, so 0.0 is represented by 0, and 1.0 is represented by 32
uint32_t result = (fg - bg) * alpha; // parallel fixed-point multiply of all components
result >>= 5;
result += bg;
result &= 0b00000111111000001111100000011111; // mask out fractional parts
return (color)((result >> 16) | result); // contract result
}
The buffer on 1/8 of the screen cannot be placed in the internal memory (800 * 480/8 * 2 = 96000), and the memory inside is only 128k. Therefore, the buffer in SDRAM. I don’t understand why double buffering (second column) is slower. In addition, which is slower, the results are clamped (i.e., the same values correspond to different tests):
Benchmark report:
Rectangle 78 67
Rectangle + opa 42 33
Rectangle rounded 70 59
Rectangle rounded + opa 38 31
Circle 43 32
Circle + opa 10 10
Border 100 78
Border + opa 92 77
Border rounded 90 77
Border rounded + opa 84 77
Circle border 43 33
Circle border + opa 34 23
Border top 102 77
Border top + opa 100 77
Border left 66 59
Border left + opa 65 59
Border top + left 59 50
Border top + left + opa 42 33
Border left + right 58 40
Border left + right + opa 41 33
Border top + bottom 96 77
Border top + bottom + opa 91 77
Shadow small 32 22
Shadow small + opa 32 23
Shadow small offset 22 20
Shadow small offset + opa 20 19
Shadow large 15 18
Shadow large + opa 15 14
Shadow large offset 13 13
Shadow large offset + opa 13 14
Image RGB 154 71
Image RGB + opa 48 34
Image ARGB 82 70
Image ARGB + opa 51 35
Image chorma keyed 74 52
Image chorma keyed + opa 46 34
Image indexed 47 34
Image indexed + opa 34 23
Image alpha only 51 35
Image alpha only + opa 37 23
Image RGB recolor 44 27
Image RGB recolor + opa 26 22
Image ARGB recolor 48 34
Image ARGB recolor + opa 36 23
Image chorma keyed recolor 53 34
Image chorma keyed recolor + opa 37 23
Image indexed recolor 38 23
Image indexed recolor + opa 28 22
Image RGB rotate 31 22
Image RGB rotate + opa 21 15
Image RGB rotate anti aliased 10 9
Image RGB rotate anti aliased + opa 8 7
Image ARGB rotate 28 22
Image ARGB rotate + opa 24 21
Image ARGB rotate anti aliased 10 10
Image ARGB rotate anti aliased + opa 10 9
Image RGB zoom 49 34
Image RGB zoom + opa 33 23
Image RGB zoom anti aliased 16 14
Image RGB zoom anti aliased + opa 14 11
Image ARGB zoom 44 33
Image ARGB zoom + opa 37 23
Image ARGB zoom anti aliased 16 14
Image ARGB zoom anti aliased + opa 16 14
Text small 55 34
Text small + opa 51 34
Text medium 55 35
Text medium + opa 52 34
Text large 55 35
Text large + opa 52 34
Text small compressed 39 28
Text small compressed + opa 38 28
Text medium compressed 32 23
Text medium compressed + opa 30 23
Text large compressed 17 14
Text large compressed + opa 15 14
Line 67 42
Line + opa 61 36
Arc think 48 36
Arc think + opa 44 30
Arc thick 47 34
Arc thick + opa 42 27
Substr. rectangle 39 33
Substr. rectangle + opa 37 30
Substr. border 84 77
Substr. border + opa 83 77
Substr. shadow 14 13
Substr. shadow + opa 15 14
Substr. image 51 34
Substr. image + opa 41 25
Substr. line 56 34
Substr. line + opa 56 35
Substr. arc 40 26
Substr. arc + opa 39 26
Substr. text 44 32
Substr. text + opa 44 25
That’s pretty cool! Thanks for sharing! I’m thinking where else can take advantage of this trick!
Than try 1/10 or 1/15 screen size. It’d be important to see how much does it matter is the display buffer is placed in internal RAM.
Double buffering is slower because LVGL needs to synchronize the frame buffers which involves an extra copy. Let’s say you have 2 texts on the screen "a" and "b".
Initial state both FBs contains the same text
You change "a" to "A". The current FB used for drawing will contain, "A" and "b" but other is still "a" and "b".
The FBs are swapped now "A" and "b" are displayed
You change "b" and "B". The current drawing buffer still contains "a" and "b" so simply changing "b" and "B" is not enough.
To solve it all changed areas are copied to the new draw buffer after the flush.
On PC usually, the whole screen is redrawn so there is no such issue.
Hello!
I tried turning off VSYNC with double buffering. The framerate change is extremely small. I also tried to place a 1/16 buffer in the internal SRAM. On animations, everything became absolutely terrible, but the speed did not change. There is a simple explanation for this: both the internal and external memory now work for me through the MMU / cache. If the internal SRAM is not cached, it is very slow. I did not find explanations for this in the controller documentation. Probably this memory is equivalent to L2, i.e. not closely related to the core. ARM has such a type of memory - TCM (Tightly-Coupled Memory), so this is not it.