V8 display driver (double buffer) low FPS & high CPU

jupeos · September 25, 2021, 11:00pm

Description

Having converted a project from V7 (working normally) to V8 the performance monitor is showing ~3 FPS at ~80% CPU. I have confirmed that this actually is the case (GUI task is busy) with an independent analysis tool (event analyser). Drive is configured for 2, full size, buffers.

Target

NXP RT1064 under IAR Workbench. Display is 1024x768.

What do you want to achieve?

Similar operation as with V7

What have you tried so far?

Setting disp_drv.full_refresh = 0 results in the screen switching between the previous & current one every ~300ms and from what I can tell is not the correct setting when using 2 full size buffers.

Any guidance or suggestions are welcome.

Code to reproduce

Here is my flush function which has not changed from V7

static void FlushDisplay( lv_disp_drv_t* disp_drv, const lv_area_t* area, lv_color_t* color_p )
{
    ELCDIF_SetNextBufferAddr( LCDIF, (uint32_t)color_p );

    /* IMPORTANT!!!
     * Inform the graphics library that you are ready with the flushing*/
    lv_disp_flush_ready( disp_drv );
}

And here is the driver initialisation.

static lv_disp_draw_buf_t disp_buf;  // Drawing buffer.
static lv_disp_drv_t disp_drv;       // Descriptor of a display driver

void lv_port_disp_init( void )
{
    lv_disp_draw_buf_init( &disp_buf, LCDIF_Buffer[ 0 ], LCDIF_Buffer[ 1 ], LCDIF_PANEL_HEIGHT * LCDIF_PANEL_WIDTH );

    InitLcd();
    InitLcdBackLight();

    /* Register the display in LVGL */
    lv_disp_drv_init( &disp_drv ); /*Basic initialization*/

    /*Set the resolution of the display*/
    disp_drv.hor_res = LCD_WIDTH;
    disp_drv.ver_res = LCD_HEIGHT;

    /*Used to copy the buffer's content to the display*/
    disp_drv.flush_cb = FlushDisplay;
    disp_drv.full_refresh = 1;  // Need for double buffered mode to prevent repeated partial refreshes.

    /*Set a display buffer*/
    disp_drv.draw_buf = &disp_buf;

    /*Finally register the driver and configure the theme.*/
    lv_disp_t* display = lv_disp_drv_register( &disp_drv );
    if( display )
    {
#if LV_USE_THEME_ALC
#include "GUI/alc_theme.h"
        display->theme = alc_theme_init( display, ALC_LIGHT_MODE );
#endif
    }
}

jupeos · October 6, 2021, 10:49pm

@kisvegabor Since the RT1064 is part of the ratified hardware for LVGL it would be helpful to publish which version of LVGL the hardware was tested with. I can only assume this board was tested with V7 since V8 would surely have exhibited the problems that I describe above.

Publishing a benchmark along with LVGL certified hardware would be helpful (as suggested in this post) to many who are reviewing the library and still in the process of choosing hardware.
It also substantiates the operation of LVGL on the target platform and may well assist in debugging across different versions of LVGL (as I have encountered). Still no resolution to this BTW so I am having to stick with V7.

kisvegabor · October 7, 2021, 4:51pm

Hi,

Have you used exactly the same disp_buf configuration in v7 and v8? There shouldn’t be so large difference, in fact v8 should be faster.

Some ideas:

Be sure O3 optimization is enabled. It should double the FPS.
Using 2 buffers in external RAM is typically very slow. Instead I suggest using only 1 frame buffer in external RAM and a smaller (e.g. 1/5 screen sized) disp_buf in internal RAM for LVGL and in flush_cb use lv_memcpy or normal memcpy (which one is faster) to copy the rendered area to the frame buffer. In this case disp_drv.full_refresh is not required.

I wanted to avoid it as SDKs get outdated very quickly, but it’d be really helpful to examine at least how the driver is set up. I can send the project on Monday.

jupeos · October 7, 2021, 6:57pm

Thanks. I look forward to reviewing the project used for testing.

jupeos · October 11, 2021, 2:35pm

Item #1: I don’t feel relying on compiler optimisation to have the code execute performantly is the solution. This code worked fine under V7.
Item #2: Two external buffers were used previously in V7 and worked perfectly so there is clearly some significant change in V8 wrt updating/refreshing regions.

BTW If you are not already aware the example projects included with NXP’s SDK are based on V7 and configured to use full size double buffers for all LVGL certified boards from NXP.

kisvegabor · October 12, 2021, 6:29pm

It just cam to mind there is one difference that might be important in your case. In v7 if you used 2 buffers LVGL did the following:

Update each invalidated area in the frame buffer
Call the flush_cb
Copy the redrawn areas to the other buffer (now inactive) to keep the 2 buffers synchronized.

I dropped this method because it wasn’t standard and the extra copy operation also caused a huge slow down if the whole screen was updated (the whole screen was copied)

In v8 full_refresh always refreshed the whole screen. So even if you just drag a slider the whole screen will the redrawn.

Could it be related to your problem?

jupeos · October 12, 2021, 7:16pm

This must be it. As mentioned, I use full sized double buffers (with PXP hardware accel) and simply swap the pointer in the flush callback which all was fine in V7. It sounds like V8 and full size double buffers don’t play well together.
It would be most helpful if you could provide the code used for testing the RT1064 certified hardware which uses V8 with a single buffer so I can see how it’s done. I have tried implementing but must have something wrong since I get visual “noise” in the refresh region.

jupeos · October 17, 2021, 11:42pm

@kisvegabor Still waiting for you to send me the project.

kisvegabor · October 18, 2021, 10:34am

The dev board has only a 480x272 display, so I could place both frame buffer into internal RAM.

For your large screen for example this config should be used:

1 framebuffer in external RAM
1 LVGL draw buffer (1/10 screen sized) that is copied to the frame buffer in the flush_cb

jupeos · October 18, 2021, 1:22pm

Can you please provide the actual code that you used?
I’m not sure why you don’t just publish the test project used for board certification.
If you had I would probably not be pestering you with questions

kisvegabor · October 20, 2021, 1:26pm

Attached
https://drive.google.com/file/d/1ueGFCbxssNfAYvXz8H85ka8PfdCODBRM/view?usp=sharing

Here is also a new littlevgl_support.c (10.1 KB) file with a different driver implementation. It should be easy to use on your 1024x768 display too.

jupeos · October 20, 2021, 1:43pm

Thanks. I’ll take a look an let you know how it goes.

jupeos · October 22, 2021, 8:14pm

As previously mentioned the full, double buffered approach is unusable due to low FPS and high CPU usage. The alternate, single buffer approach, does work (i.e. low CPU @ 33FPS) BUT falls down when refreshing the whole screen which takes nearly 1 second! and is very obvious visually.

I have based the single buffer approach along the lines used in your attached file, lvgl_support.c.
I would really like to get this resolved as it is preventing me from upgrading to V8.

kisvegabor · October 26, 2021, 1:32pm

So for full screen refresh 1 framebuffer is slower than 2 frame buffers?

With 1 framebuffer what is the size the draw buffer? It should be ~1/10 screen sized and placed into internal RAM.

Could you send a video about the 1 framebuffer’s full screen redraw?

jupeos · October 27, 2021, 4:32am

So the 2 buffer situation is more expensive overall - slow and unresponsive due to high CPU (not the case on V7 but I understand you’ve made some changes to post flush synchronisation).
The 1 buffer is good up until a full screen refresh is required, at which point it becomes visually slow (banding on the size of the smaller display buffer which is 1/5 of screen).

I’ll capture a video of both so you can see what I’m dealing with.

xennex · December 6, 2021, 11:32pm

I’m finding the same thing - double buffer frame rate in v7 has fallen dramatically in v8. I had to set full_refresh = 1 to force the whole buffer to be dumped as well. So I changed the display handling from double buffer frame switching to a single buffer (full_refresh = 0).

Here’s my flush callback functions.

The provided software copy worked for me at about 7 FPS.

uint32_t w = lv_area_get_width(area);
uint32_t y;
for(y = area->y1; y <= area->y2 && y < disp_drv->ver_res; y++) 
{
  lv_memcpy(&vram[2][y * LV_HOR_RES_MAX + area->x1], color_p, w * sizeof(lv_color_t));  
  color_p += w;
}
lv_disp_flush_ready(disp_drv);

Switching to the (unused) GPU code works at about 8 FPS

lv_gpu_stm32_dma2d_copy((lv_color_t *)&vram[2][(area->y1 * LV_HOR_RES_MAX) + area->x1], LV_HOR_RES_MAX, color_p, lv_area_get_width(area), lv_area_get_width(area), lv_area_get_height(area));  	
lv_disp_flush_ready(disp_drv);

Changing the code so that lv_disp_flush_ready was not called until the DMA completes should speed things up. But this only gave me 9 FPS.

static lv_disp_drv_t * dma_disp_drv = NULL;

void DMA2D_IRQHandler(void)
{
	HAL_DMA2D_IRQHandler(&hdma2d);
}

static void TxComplete(DMA2D_HandleTypeDef * hdma2d)
{
	lv_disp_flush_ready(dma_disp_drv);
}

static void tft_flush_cb(lv_disp_drv_t * disp_drv, const lv_area_t * area, lv_color_t * color_p)
{
  uint32_t w = lv_area_get_width(area);
  uint32_t h = lv_area_get_height(area);	

  hdma2d.Init.Mode = DMA2D_M2M;
  hdma2d.Init.OutputOffset = LV_HOR_RES_MAX - w;
	
  if(HAL_DMA2D_Init(&hdma2d) == HAL_OK)
  {
    if(HAL_DMA2D_ConfigLayer(&hdma2d, 1) == HAL_OK)
    {
      HAL_DMA2D_RegisterCallback(&hdma2d, HAL_DMA2D_TRANSFERCOMPLETE_CB_ID, TxComplete);
      lv_color_t * d = &vram[2][(area->y1 * LV_HOR_RES_MAX) + area->x1];
      if(HAL_DMA2D_Start_IT(&hdma2d, (uint32_t)color_p, (uint32_t)d, w, h) == HAL_OK)
      {
      }
    }
  }
}

The next thing to try would be to maintain the two display buffers outside the LVGL framework and update each buffer from the flash callback.

pete-pjb · January 6, 2022, 5:23pm

Hi @kisvegabor,

Hope you are well!

I have been absent for a long period now due to massive work commitments which are still ongoing!

I would like to add to this post, I can confirm that my Zynq implementation using standard WXGA+ (1440x900) is also only managing a few frames per second with version 8. I have had to revert back to version 7 for the time being as I have no time to look into this right now. My current driver uses double buffering with high speed DMA transfers so the bottleneck at first glance is caused by the copying of the entire buffer to the second buffer in version 8 verses the selective (only the parts that have changed) updating of the second buffer in version 7. I am unsure of the reasons for the change of methodology for the version 8 implementation, but I am sure there must be some good reason?

Also I have written a basic VNC server that runs on my platform based on full screen refreshes at this stage, this puts a high load on the CPU even on version 7, it has absolutely no chance of working on version 8. Although there may be a way to hook into the drawing engine to help this along, having access to the drawing of the rectangles for the screen refresh to pass to the VNC server would be good but I haven’t had chance to dig into your code to see if this is feasible.

If you have any suggestions on how to improve this situation which appears to be affecting more people as time goes on, It would be much appreciated. If it can’t be resolved I think I will have to freeze my GUI at version 7 for my current projects

Kind Regards,

Pete

kisvegabor · January 7, 2022, 1:46pm

Hi @pete-pjb and @xennex,

To be sure we are on the same page I’d like to clarify the difference between v7 and v8 in this regard.

v7
If you set 2 screens sized buffers, v7 worked in a special “true double buffered mode”. It worked like this:

Render invalid areas to the inactive buffer
Swap the buffers
Copy the redrawn areas to new inactive buffer to have them the same content.

I considered it wasteful especially with screen sized animations because here the whole screen was copied and then overwritten it a new content.

v8
There are 2 modes that can be set explicitly:

full_refresh: LVGL always redraws the whole screen. So no copying is required.
direct mode: LVGL redraws only the dirty areas in a screen sized buffer. So NO full-screen refresh, but the buffers are not synchronized. It’s useful, if you can send full frames to a GPU to display.

IMO performance issue is the largest if you set full_refresh but only a small area changes. In v7 it was fast (small rendering + small copy) but in v8 it’s slower (large rendering).

I was thinking about adding a v7-like mode to v8, but in v8 we started to think more abstractly. I.e. the draw buffer can be anything like an lv_color_t array, SDL Texture in GPU, specially packed bitmap, or any internal non accessible buffer. So we can’t simply memcpy it.

However, in the flush_cb you can mimic v7’s behavior like this:

Set 2 full screen buffer with full_refr = 0, direct_mode = 1
In flush_cb check if lv_disp_flush_is_last(drv) == true
If so, copy the areas:

my_set_active_buffers(color_p);
lv_disp_t* disp = _lv_refr_get_disp_refreshing();
for(i = 0; i < disp->inv_p; i++) {
   if(disp->inv_joined[i]) continue;
 
   my_copy_area(disp->inv_areas[i]);
}

I haven’t tested it but I’d be happy to help with it further if you can help with testing on embedded HW.

pete-pjb · January 7, 2022, 2:10pm

Hi @kisvegabor,

Many thanks for your explanation and suggested solution, I will take a look at this when I can and report back here. (This also looks like a good way to reduce the CPU load for my VNC server implementation as I can pass the data to my network buffers as small rectangles as opposed to doing entire screen refreshes all the time!)
Currently I am in the middle of a fairly demanding bare-metal USB hardware/driver exercise and it is sapping all of my time and energy. So it maybe a while before I come back with the results. I will keep an eye on this topic also to see if @xennex or @jupeos have any comments also.

Kind Regards,

Pete

pete-pjb · March 17, 2022, 12:11pm

Hi @kisvegabor,

Hope you are well!

I can confirm I have managed to implement a method based on your post which closely emulates V7 methodology and the performance is very similar to before. Thank you for the suggestion.

I won’t mark it as the solution as yet as it would be interesting to see if @xennex and @jupeos have had similar results…

Kind Regards,

Pete