Enabling/Disabling DMA2D doesn't have any impact on the performance

samirdamadi · October 20, 2021, 11:00am

Hello,

I’m using LVGL on my STM32H750B-DK board, I used this project (which belongs to STM32F746G-DK) as a template and has ported it to STM32H750B-DK since 6,7 month ago. (so the project is based on LVGL v7)

I designed a simple GUI which there are 4 images being rotated in an infinity loop; other images and labels have a fixed position on the screen.

Now I’m getting around 20-33 FPS (with compiler Optimization “none”) and the thing is; if I turn DMA2D enabled or disabled (With this define: LV_USE_GPU_STM32_DMA2D in the lv_conf.h), that is not going to make any difference in the FPS and CPU usage which is pretty strange;

I checked the draw functions with the debugger and look like everything is alright, when DMA2D is enabled, all of its related functions will be called and the DMA2D register values are being modified, but when its disabled, nether of those things will happen which means DMA2D is actually disabled.

So what is the problem here? Rendering with the DMA2D support has to be much faster than software rendering, So what is the issue?

It’s much appreciated if anybody helps.

Regards,
SAM.

kisvegabor · October 20, 2021, 11:39am

Hi,

I made many measurements when implemented the DMA2D support and I found that usually it’s really not faster then LVGL’s software rendering.

DMA2D supports very simple opearations:

fill an area
copy an area
blend two areas with opacity
and color format conversations during this operations but we don’t use it

For fill area and copy area apparently DMA2D can not work faster than a well-written memset or memcpy.

For blending DMA2D could be faster but SW rendering is very well optimized here and has the same performance.

It’s important to note that RGB565 with alpha channel is a custom LVGL format and therefore not supported by DMA2D directly. To use ARGB images with DMA2D ARGB8888 images needs to be used with color format conversation. However, with LV_COLOR_DEPTH 32 and ARGB8888 images DMA2D can used directly (and LVGL really handles this case separately), so it can be slightly faster.

I also saw the TouchGFX videos where they compare cases with and without DMA2D and there is huge difference in performance. I made some tests with similar UIs and LVGL’s SW rendering really has the performance as DMA2D in the videos. So it seems their SW rendering very slow.

An other thing is how CPU usage is measured. STM examples probably count the CPU as idle while DMA2D is working in the background the CPU is waiting for it. It’s not the case for LVGL as it simply counts time spent in lv_task_handler.
LVGL lets you use the sparse time provided by DMA2D by calling disp_drv->gpu_wait_cb while DMA2D busy.

There could be other factors that hugely affect the performance. Rendering directly to the external RAM will be very slow. Maybe that’s the case in STM examples by default. However LVGL uses smaller draw buffer(s) typically located in the internal RAM and copy only the final result into the frame buffer in external RAM. It’s way faster.

I hope this summary helps.

PS: Please correct me if I stated something incorrectly about DMA2D.

samirdamadi · October 20, 2021, 12:25pm

Yes, I nearly realized your point when I was searching in the drawing source codes to find out what is going on, for example in the “lv_draw_blend.c” functions, DMA2D is not being used most of the times, the general format is something like this:

    /*Simple fill (maybe with opacity), no masking*/
    if(mask_res == LV_DRAW_MASK_RES_FULL_COVER) {

    /* DMA2D being used here */

    }
    /*Masked*/
    else {

    /* DMA2D never being used here */

    }

and I realized that the application didn’t enter the first section most of the times, and enters the “Masked” part; which admits that "DMA2D only supports very simple operations"

Yes, you’re right, I found out that LVGL handle the ARGB8888 completely separately (by looking at “lv_draw_img” for example) so I test with LV_COLOR_DEPTH 32 and still not major difference in the performance.

I’m using 2 draw buffers with 40 pixels height each (40 x 480) in the SRAM, and it’s much faster than 2 True screen sized draw buffers located in SDRAM, tested it few days ago.

Anyways thanks for your explanation, definitely helped me out.