LVGL is very slow - STM32F7 SSD1963 P16 - Advice on how to optimise it?

LVGL is very slow - STM32F7 SSD1963 P16 - Advice on how to optimise it?

What MCU/Processor/Board and compiler are you using?

Nucleo 144 STM32F767ZI

What LVGL version are you using?

7.11.0

What do you want to achieve?

I want to increase performance for LVGL running on a Nucleo 144 F767ZI with a 7" SSD1963 display, currently the speed for scrolling/ animations is too slow.

What have you tried so far?

I have tried adjusting the display driver (my_disp_flush) and increased the refresh rate but cant seem to speed things up

I have also tried disabling the ICache and DCache in case that was slowing things down.

Code to reproduce

This draws everything correct but is very slow

void my_disp_flush(lv_disp_drv_t *disp, const lv_area_t *area, lv_color_t *color_p)
{
  uint16_t c;
    tft.setWindow(area->x1, area->y1, area->x2, area->y2 ); // set the working window 
    for (int y = area->y1; y <= area->y2; y++) {
      for (int x = area->x1; x <= area->x2; x++) {
        c = color_p->full; 
        tft.pushColor(c, 1);
        color_p++;
     }
  }
  lv_disp_flush_ready(disp); // tell lvgl that flushing is done 
}

Screenshot and/or video

Video attached

Nucleo 144 LVGL demo video.zip
Nucleo 144 LVGL demo video.zip (2.3 MB)

I have just realised disabling the ICache & DCache was actually slowing things down, re-enabling it all again has sped things up.
That was my bad for testing several combinations and leaving the Caches disabled.

I still get a bit of screen tearing but not as bad as shown in the video above.

Try doing the tft window update using dma and calling disp_flush_ready in the dma complete interrupt.

1 Like

What kind of connection do you have to the TFT, SPI? 8 bit parallel? 16 bit parallel? Something else? Different kinds of connections vary in their max speed. Also, a DMA and dual buffering will rendering and transfer to the TFT at the same time.

As for your code, you can optimize that loop. Also, I don’t know how tft.pushColor() is implemented by probably you can optimize it as well (e.g. no need to always pass a ‘1’ in your most inner loop).

You can see here an example of an optimized loop, only one level of loop, and it performs multiple pixels in one iteration.

1 Like

I am using 16 bit parallel, I have just got FMC working which has considerably improved the performance. I believe my control method for the display driver can be better optimised.
For instance removing the ‘1’ on my len inner loop for pushcolor

I will look at your examples and see if I can utilise that and improve on how I am doing it now.

I have attached a video of the current performance utilising FMC which I am quite impressed with now.Nucleo 144 LVGL FMC demo video.zip (1.9 MB)

What about optimising the loop so I can push 32 pixels in one iteration instead of individually like below?

Its quicker to redraw the screen but seems to leave stray pixels on the screen when scrolling

void my_disp_flush(lv_disp_drv_t *disp, const lv_area_t *area, lv_color_t *color_p)
{
 static uint16_t temp[32];
  int i=0;
  tft.setWindow(area->x1, area->y1, area->x2, area->y2); // set the working window 

  for (int y = area->y1; y <= area->y2; y++) {
    for (int x = area->x1; x <= area->x2; x++) {
       temp[i] = color_p->full;
       i++;
       if (i==32){   
       tft.pushColors(temp, i);
       i = 0;
       }
      color_p++;
    }
  }
  lv_disp_flush_ready(disp); // tell lvgl that flushing is done 
}

At first glance, this loop will fail if there are less than 32 pixels total to redraw, since you only send pixels once i == 32.

1 Like

That is true, I’ve not had it occur where less than 32 pixels is not updated… but an issue could arise from it.
I thought I was being clever storing an array to better optimise it.

I will stick with the below as its working well and the speed is good.
I can use some examples to work on the driver and eventually improve.

void my_disp_flush(lv_disp_drv_t *disp, const lv_area_t *area, lv_color_t *color_p) { 
uint16_t c; 
tft.setWindow(area->x1, area->y1, area->x2, area->y2 ); // set the working window
for (int y = area->y1; y <= area->y2; y++) { 
for (int x = area->x1; x <= area->x2; x++) { 
c = color_p->full; 
tft.pushColor(c); 
color_p++; 
}
} 
lv_disp_flush_ready(disp); // tell lvgl that flushing is done
}

Better to say the function fails if ((area_width * area_height) % 32) != 0

But it should work if you add the if (i != 0) {…} just after the for loops

  for (...) {
    for (...) {
    }
  }
 
  if (i != 0) {
     tft.pushColors (temp, i);
  }

  lv_disp_flush_ready (disp); // tell lvgl that flushing is done 

You observed a better performance with FMC.
When the SSD1963 is connected as a memory mapped device, you can also try to
transfer the pixel data via DMA.

Nevertheless you should omit the intermediate buffer temp[32].
The pixeldata is already located at a sequential memory area (*color_p),
and copy from one buffer to another one doesn’t make really sense.

1 Like

Do you have the source code of tft.pushColors() ? Can yo post it here?

1 Like

Thank you for that information.

The library is called GxTFT, I was in contact with the developer trying to get the STM32F7 chipset to work.
It now works directly with FMC which has made a great difference compared to bitbashing.

The library is great as everything is separated with classes, it makes adding new hardware easier.

tft.pushColors()

void GxTFT::pushColors(uint16_t *data, uint8_t len)
{ 
IO.startTransaction(); 
while (len--) { 
uint16_t color = *data++; 
IO.writeData16(color); 
} 
IO.endTransaction();
}

Are IO.startTransaction() and IO.endTransaction() real methods that are called for each pixels. This sounds very non efficient. You may have a lot of room for optimization here. Same for inline whatever writeData16() does in your own render loop.

1 Like

No they are not real methods, in this current situation as we’re using FMC they’re just an empty function but if another user was using regular bitbashing then it would be incorporated.

void GxIO_STM32Nucleo144_FSMC::startTransaction()
{
} 
void GxIO_STM32Nucleo144_FSMC::endTransaction()
{
}

The library seperates the tft controller and IO (MCU), because it has support for lots of different setups… it has to cover all scenarios, so some functions are relevant whereas others are not.
This is where I can modify use the relevant bits and disregard the rest.

Initialising FMC

void GxIO_STM32Nucleo144_FSMC::init()
{ 
RCC->AHB1ENR |= 0x00000078; // enable GPIOD, GPIOE, GPIOF and GPIOG interface clock 
volatile uint32_t t = RCC->AHB1ENR; // delay 

GPIOD->AFR[0] = ( GPIOD->AFR[0] & ~PD_AFR0_MASK) | PD_AFR0_FSMC; 
GPIOD->AFR[1] = ( GPIOD->AFR[1] & ~PD_AFR1_MASK) | PD_AFR1_FSMC; 
GPIOD->MODER = ( GPIOD->MODER & ~PD_MODE_MASK) | PD_MODE_FSMC; 
GPIOD->OSPEEDR = ( GPIOD->OSPEEDR & ~PD_MODE_MASK) | PD_OSPD_FSMC; 
GPIOD->OTYPER &= ~PD_MODE_MASK; 
GPIOD->PUPDR &= ~PD_MODE_MASK; 

GPIOE->AFR[0] = (GPIOE->AFR[0] & ~PE_AFR0_MASK) | PE_AFR0_FSMC; 
GPIOE->AFR[1] = (GPIOE->AFR[1] & ~PE_AFR1_MASK) | PE_AFR1_FSMC; 
GPIOE->MODER = (GPIOE->MODER & ~PE_MODE_MASK) | PE_MODE_FSMC; 
GPIOE->OSPEEDR = (GPIOE->OSPEEDR & ~PE_MODE_MASK) | PE_OSPD_FSMC; 
GPIOE->OTYPER &= ~PE_MODE_MASK; 
GPIOE->PUPDR &= ~PE_MODE_MASK; 

RCC->AHB3ENR |= 0x00000001; 
t = RCC->AHB1ENR; // delay 
(void)(t); 

FMC_Bank1->BTCR[0] = 0x000010D9; 
FMC_Bank1->BTCR[1] = (DATAST << 8) | ADDSET;
// swap FMC_REGION used to an address range that has data cache disabled 
SYSCFG->MEMRMP |= SYSCFG_MEMRMP_SWP_FMC_0; 

digitalWrite(_bl, LOW); 
pinMode(_bl, OUTPUT);
}

FMC region

//#define FMC_REGION ((uint32_t)0x60000000) 
// Bank1 FMC NOR/PSRAM
// swap FMC_REGION used to an address range that has data cache disabled
// see e.g. https://community.st.com/s/question/0D50X00009XkWQESA3/stm32h743ii-fmc-8080-lcd-spurious-writes

#define FMC_REGION ((uint32_t)0xC0000000) // Bank1 FMC NOR/PSRAM swapped with SDRAM 
#define CommandAccess FMC_REGION
#define DataAccess (FMC_REGION + 0x80000)

writeData16

void GxIO_STM32Nucleo144_FSMC::writeData16(uint16_t d, uint32_t num)
{ 
while (num > 0) 
{ 
*(uint16_t*)DataAccess = d;
num--; 
}
}

Has anyone got any examples of LV drivers in action with LVGL?

I am not sure what exactly you are looking for but if just an example LV drivers that work, mine are here https://github.com/zapta/simple_stepper_motor_analyzer/tree/master/platformio/src/display .

tft_driver is the low level driver and lv_adapter is the interface to LVGL. Running on a 84Mhz STM32F401CE, the pixel transfer rate is 10M x 16bit pixels per sec. No DMA, just bit banging, and on the fly conversion from LVGL 8 bit color depth to the TFT 16 bit color depth. These barley achieves smooth horizontal scrolling of the chart in my app, and plenty fast for all the other screens.

1 Like

Thank you,

That’s exactly what I was after, I wanted an example to observe: I was trying to find one on github.
I fancy rewriting what I’ve got now using LV drivers with FMC on the Nucleo-F767ZI.

My current setup is working well but I want to neaten things up on the code.

So I’ve been working on optimising the display using FMC directly with my current setup.

Using DataAccess (0xC0080000) / CommandAccess (0xC0000000) on FMC the below display flush is working well, when updating a big portion of the screen - animations can stutter a little but overall it seems quite good for what I need it for.
…but if I can make it better/ faster, I would like to carry on working to improve it.

The benchmark test showed ‘Weighted FPS’ to be at 19, the display is 800 x 480 SSD1963 and the processor is a STM32F767ZI: it was 17 FPS, so I have marginally increased it by 2 FPS.

void my_disp_flush(lv_disp_drv_t *disp, const lv_area_t *area, lv_color_t *color_p)
{

    /***create a working window*/
    *(uint16_t*)((uint32_t)0xC0000000) = (0x2a);            //CommandAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x1 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x1 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x2 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x2 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0000000) = (0x2b);            //CommandAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y1 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y1 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y2 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y2 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0000000) = (0x2c);            //CommandAccess
    /***************************/
    for (int y = area->y1; y <= area->y2; y++) {
      for (int x = area->x1; x <= area->x2; x++) {
        *(uint16_t*)((uint32_t)0xC0080000) = color_p->full; //DataAccess
        color_p++;
     }
  }
  lv_disp_flush_ready(disp); // tell lvgl that flushing is done 
}

Has anyone got any examples of using the disp_drv.gpu_fill_cb instead of disp_drv.flush_cb?

Any other suggestions that I could give a go, to increase the performance?

With the latest 7.x versions, there is no point using gpu_fill_cb on STM32 as LVGL includes built-in support for the DMA2D engine. You can try enabling it in lv_conf.h.

LVGL will still use flush_cb to copy the final buffer to the display; DMA2D is used to accelerate operations on the working buffer.

Hi embeddedt,

Thanks for the reply, I’ve got a few questions on DMA2D.

I did attempt to enable DMA2D but unfortunately not sure if it worked, for some reason it didn’t like
LV_GPU_DMA2D_CMSIS_INCLUDE “stm32f767xx.h”
I went to v_gpu_stm32_dma2d.c and manually included “stm32f767xx.h” instead of using the definition in lv_config.h which compiled fine after.

So once it’s enabled do I need to initialise it or is all of that automatically handled by LVGL?

What other changes do I have to make to my_disp_flush as I was a bit confused on how LVGL would automatically copy the frame buffer directly?

I assume I would still need to set a working window? This is the part where I’m not sure on how to implement it properly?

I only have single buffer, should I add a dual buffer to the equation?

void my_disp_flush(lv_disp_drv_t *disp, const lv_area_t *area, lv_color_t *color_p)
{

    /***create a working window*/
    *(uint16_t*)((uint32_t)0xC0000000) = (0x2a);            //CommandAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x1 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x1 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x2 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->x2 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0000000) = (0x2b);            //CommandAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y1 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y1 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y2 >> 8);   //DataAccess
    *(uint16_t*)((uint32_t)0xC0080000) = (area->y2 & 0xFF); //DataAccess
    *(uint16_t*)((uint32_t)0xC0000000) = (0x2c);            //CommandAccess
    /***************************/
    for (int y = area->y1; y <= area->y2; y++) {
      for (int x = area->x1; x <= area->x2; x++) {
        *(uint16_t*)((uint32_t)0xC0080000) = color_p->full; //DataAccess
        color_p++;
     }
  }
  lv_disp_flush_ready(disp); // tell lvgl that flushing is done 
}

The correct way of setting in lv_conf.h in your case would be:

#define LV_GPU_DMA2D_CMSIS_INCLUDE "stm32f7xx.h"

And within the IDE you set a preprocessor define (defined Symbols) for the specific controller type.
In your case STM32F767xx.

So in every other file where you need the STM include you only include stm32f7xx.h (and maybe stm32h7xx_hal_conf.h)

I just see, the comments in lv_conf.h should be changed and extented