Black flickering boxes intermittently appearing on display when scrolling

pete-pjb · March 21, 2025, 11:39am

Hi all,

I have been trying to get to the bottom of an issue for a very long time. My hardware is the AMD Zynq Ultrascale+ MPSOC on which I have implemented a display port driver and use XHCI USB for the touch screen. I am running LVGL on a home grown SMP FreeRTOS port on the Application processor which has 2 x 1.2GHz A53 cores and two seperate iterations of FreeRTOS on the two R5 cores for some control processes.

The display interface uses double buffered(Full Screen sized buffers) direct mode with DMA and the hardware automatically changes DMA buffer addresses only during V-sync. I am also using a full HD 1080p with ARGB8888 configuration which I think might be a bit of a corner case.

Since I first bought up my code on the hardware just over a year ago I have an issue where I see black rectangles appearing when the display is being scrolled by dragging with the touch screen. I have tried implementing the driver in many ways and it always ends up with the same result, so I am now wondering if there may be some problem with the LVGL library.
I have attached a video of the problem, sorry it is not great quality as I have had to compress it a lot to make it uploadable on here. It is slightly worse than it shows on the video because the frame rate of my camera is not that great it doesn’t grab a true representation of the frequency of the rectangles, but I think it’s enough to see the issue. The rectangles are also always confined to the area of the screen which is actually scrolling only and not anywhere else.

If anyone has any ideas about how to go about debugging this issue I would most appreciative as I am pretty much out of ideas right now. I find I can’t reproduce the problem when stepping through the code for example, it’s all very difficult to deal with. Any suggestions please?

Tabview_flicker

Thank you,

Kind Regards,

Pete

Baldhead · March 22, 2025, 5:48pm

My guess:

It seems that lvgl Invalidated areas of scrooling areas in the screen to render this areas again, and, in the process, write black pixels on this rectangles.

should simply render the new image without writing black pixels on invalidated rectangle background areas.

Maybe some sync issues too.

kdschlosser · March 23, 2025, 2:32am

Couple of questions…

What is the render mode set to in LVGL?

How are your buffers arranged, size etc…

From what it looks like is you are experiencing is what is called an image tare. This is typically caused by is a buffer getting written to at the same time to is being read from.

Where are you calling lv_display_flush_ready from? This should be done from the vsync callback but it should only be called a single time each time the buffer is replaced and you need to make sure that it is being called only after the new buffer ha been rendered. In order to get it to work correctly depending on how the callback is set up you may have to put in place a kind of a hack in order to test to see what buffer it is that just finished rendering. I had to do this kind of a thing with the RGB driver for the ESP32 MCU’s. That is because there is no way to know what buffer just got finished being sent to the display from inside of the callback. There is no public API to be able to do that. You might be facing the same kind of a setup.

The reason why you are only seeing is when you scroll is because of how much is being updated and how long it is taking LVGL to render. If it takes LVGL longer to render a section then it takes to send the data to the display that is when you will usually end up with an overlap. It took me a while to perfect a driver that would allow LVGL to render fast and to swap out the buffers properly.

Since you are running a processor that has multiple cores what I recommend you do is the following.

Here is a flow chart that shows running 2 tasks, 2 DMA full frame buffers, 2 partial frame buffers in internal ram, a couple of binary semaphores and a group event. The flow chart shows what needs to be done and when… This is only going to show you the loops in those tasks and also the VSYNC ISR callback.

I have code to copy partial buffer data to the full buffers if you need it. It also will do rotation at the same time and if you are using RGB565 dithering can be tossed in there as well.

The 2 full buffers still get created in DMA memory. The partial buffers do not, internal ram is going to be faster so I recommend placing the partial buffers in internal memory if they will fit. You want to size the partial buffers so that it takes no more than 2-3 partial buffers for an update run to be finished.

A frame is considered as a full buffer. When dealing with partial buffers it may take more than one partial buffer to update all of the data that would be a frame. To find out if the partial buffer that is being flushed is the last on in the run you call lv_display_flush_is_last When this returns true then you know you can swap the the full buffers. The trick here is because the full buffers need to contain ALL of the data being seen and when using partial partial buffers with LVGL only the sections that get updated are what gets sent to the flush callback what we need to do is copy the data from one full buffer to the other before we write any new data to it. We are allowed to access the same areas of memory if we are reading the data and that is not going to cause any corruption. So while a full buffer is transmitting using DMA memory we can also copy from that buffer to the other full buffer. We do this right after the buffers get swapped in the VSYNC ISR. we don’t need to worry about LVGL rendering to them because LVGL only knows about the partial buffers.

That copy of the full buffers is the only place that has the potential to cause a stall in task 0. If it does happen it would be for an exceedingly short amount of time. If you look at how the locks are done LVGL is able to deposit a buffer to be copied and then it goes about it’s merry way but if it circles back around to have the second partial buffer copied that is when it will gets stalled. So LVGL is able to render to 2 partial buffers before any stall would occur. If you have the partial buffers sized to small it could happen but the stall would only be for a very small amount of time.

Marian_M · March 23, 2025, 7:22am

primary this cant be auto, but based on full rendered next buff ready

pete-pjb · March 26, 2025, 9:28am

Hi @Baldhead ,

Thanks for the comments, I think I agree with your statements about the rendering process and may try and see what is happening with the writing of black pixels… Although I have also considered some kind of memory/cache issue may be responsible for this also, the system has 8 GB of DDR4 RAM where all buffers reside. So this could be a source of an issue also.

Kind Regards,

Pete

pete-pjb · March 26, 2025, 10:21am

Hi @kdschlosser ,

Thank you for your detailed response, you must have spent considerable time contemplating my dilemma, it is much appreciated. I think you have helped to cement my understanding of the way things operate. I’m using direct mode with 2 full size screen buffers, my system doesn’t have different types of memory etc. like an ESP32… It’s a multicore ARM 64-bit device with 8GBs of DDR4 RAM. Here are excerpts from my code:

Firstly my flush callback…

void dp_flush( lv_display_t *disp_drv, const lv_area_t *area, uint8_t *data ) {

	uint32_t	event = 0;
	if( lv_display_flush_is_last( disp_drv ) ) {  // Only update the buffer when it is the last part
		if( ( (uintptr_t)data == cpu0_globals->dp_data.dma_fb1.Address ) ) {
			cpu0_globals->dp_data.dma_next_buf = &cpu0_globals->dp_data.dma_fb1;
		} else {
			cpu0_globals->dp_data.dma_next_buf = &cpu0_globals->dp_data.dma_fb2;
		}
		if( cpu0_globals->dp_data.trained ) {    // Only bother if display port interface is trained up
			vPortEnableInterrupt(DPDMA_INTR_ID);	// Enable V-Synch interrupt 
			xTaskNotifyWait( pdTRUE, 0xFFFFFFFF, &event, portMAX_DELAY );	// Wait for V-Synch
			lv_display_flush_ready(disp_drv);
		} else	lv_display_flush_ready(disp_drv);
	} else lv_display_flush_ready(disp_drv);
}

Here’s the interrupt handler…

static void OTG_dma_irq_handler( XDpDma *InstancePtr ) {

	BaseType_t			xHigherPriorityTaskWoken = pdFALSE;
	uint32_t 			RegVal = XDpDma_ReadReg( InstancePtr->Config.BaseAddr, XDPDMA_ISR );

	if( ( RegVal & XDPDMA_ISR_VSYNC_INT_MASK ) != 0U ) {	// V-Sync Handling
		XDpDma_WriteReg( InstancePtr->Config.BaseAddr, XDPDMA_ISR, XDPDMA_ISR_VSYNC_INT_MASK );
		if( InstancePtr->Gfx.FrameBuffer != cpu0_globals->dp_data.dma_next_buf ) {	// We only switch buffers during V-sync 
			XDpDma_DisplayGfxFrameBuffer( InstancePtr, cpu0_globals->dp_data.dma_next_buf );
		}
		if( InstancePtr->Gfx.TriggerStatus == XDPDMA_RETRIGGER_EN) {  // Retrigger if running already
			XDpDma_SetupChannel(InstancePtr, GraphicsChan );
			XDpDma_ReTrigger(InstancePtr, GraphicsChan);
		} else if( InstancePtr->Gfx.TriggerStatus == XDPDMA_TRIGGER_EN ) {  // Setup and start if not running
			XDpDma_SetupChannel( InstancePtr, GraphicsChan );
			XDpDma_SetChannelState( InstancePtr, GraphicsChan, XDPDMA_ENABLE );
			XDpDma_Trigger( InstancePtr, GraphicsChan );
		}
		vPortDisableInterrupt( DPDMA_INTR_ID );	// Disable interrupt until flush enables again
		xTaskNotifyFromISR( cpu0_globals->gui.gui_task, HPD_V_SYNC, eSetBits, &xHigherPriorityTaskWoken );    // Unblock flush task
		portYIELD_FROM_ISR( xHigherPriorityTaskWoken );
	}
}

Maybe this approach is wrong, could it be the buffer is being written whilst it is being transferred? The more I think about this as I have written it up maybe I should try a different approach…

Thinking about the code currently… LVGL renders the buffer it has chosen, it then calls the flush callback specifying the buffer which needs to be assigned to the DMA transfers, my code enables the V-sync interrupt from the flush callback and waits until the buffer has been selected and enabled, once the the interrupt has switched the buffer and kicked the flush callback the flush callback then calls lv_display_flush_ready(disp_drv);. Maybe on some occasions the flush callback is calling lv_display_flush_ready(disp_drv); and LVGL is switching buffer and rendering before the DMA has finished? (The CPU runs at 1.2GHz so it is pretty fast!) I think I need to waggle some hardware pins and try and get a better sense of the timing with an oscilloscope to see if things are somehow out of sequence here? What do you think?

Kind Regards,

Pete

pete-pjb · March 26, 2025, 10:25am

Hi @Marian_M ,

Thank you for your reply, I have not stated the way the system works in enough detail, when I mentioned the hardware automatically changes the DMA buffer addresses only during V-Sync. The hardware has a DMA engine which continuously copies the contents of the screen buffers to the display at 30fps Full HD it uses descriptors to describe the buffers, there are two predefined descriptors in my configuration which describe the location and size of the two buffers which LVGL uses to draw to (I am using 2 full screen size buffers in direct mode). To switch the DMA buffer you tell the DMA engine which buffer to use by specifying the descriptor to use at any point in time and the DMA engine only switches the descriptor at the first time a V-Sync occurs after the request is made, so in theory there should be no screen tearing etc. I hope that better explains the way the system works.

Kind Regards,

Pete

kdschlosser · March 26, 2025, 6:41pm

The problem is you are calling flush ready from the flush callback function. The buffers are not staying synced. The other issue is the stall you have in the flush callback. This is causing a slow down in the code which may end up being realized as a split second of an area that has not been rendered.

You have enough memory to use the 4 buffer approach. You are going to move the responsibility of keeping the buffers in sync to a complete different core. By doing that you will improve the rendering speeds greatly. Right now LVGL is keeping the buffers in sync and that increases the amount of time it takes LVGL to render.

Write the code exactly like what is seen in the flow chart. and you will be able to use 4 buffers, 2 full buffers and 2 partial buffers. LVGL will be set to partial mode using the 2 partial buffers and you have the partials copied to the full buffer in a different task. The thing that ties it all together is the function you call to check if a partial buffer is the last if a frame update. If it is that is when you set the flag to have the buffers swapped out in the VSYNC ISR.

Below are links to how I went about doing this very same thing for the ESP32. The code is written for MicroPython but it will give you a really good idea of what I am doing.

One of the really important things that I have done in the code I wrote is I moved all of the driver initialization code to that second task. It may seem strange to do that but it’s actually not if you know the mechanics of the driver. The VSYNC ISR takes place at the end of each transmission of the buffer. This ISR happens on the core that the drivers were initialized from. Having that take place on the same core that LVGL is rendering on is going to do what? It is going to slow down the rendering. By moving that ISR to another core another performance gain is able to be had.

Here is the task function that handles copying the partial buffer to the full buffer and also handles the syncing of the 2 full buffers…

github.com

lvgl-micropython/lvgl_micropython/blob/30b1f7da53ab2cb21074fc542e385541ddaa1126/ext_mod/lcd_bus/esp32_src/rgb_bus_rotation.c#L169


      
                  uint8_t *idle_fb = self->idle_fb;
                  self->idle_fb = self->active_fb;
                  self->active_fb = idle_fb;
                  rgb_bus_event_set_from_isr(&self->swap_bufs);
              }
          
              return false;
          }
          
          
          void rgb_bus_copy_task(void *self_in) {
              LCD_DEBUG_PRINT("rgb_bus_copy_task - STARTED\n")
          
              mp_lcd_rgb_bus_obj_t *self = (mp_lcd_rgb_bus_obj_t *)self_in;
          
              esp_lcd_rgb_panel_event_callbacks_t callbacks = { .on_vsync = rgb_bus_trans_done_cb };
          
              self->init_err = esp_lcd_new_rgb_panel(&self->panel_io_config, &self->panel_handle);
              if (self->init_err != 0) {
                  self->init_err_msg = MP_ERROR_TEXT("%d(esp_lcd_new_rgb_panel)");
                  rgb_bus_lock_release(&self->init_lock);

Here is the link to the flush callback. This function gets called from inside of the flush callback.
This function handles the acquiring and releasing of locks and setting the data needed to copy the partial to the full.

github.com

lvgl-micropython/lvgl_micropython/blob/30b1f7da53ab2cb21074fc542e385541ddaa1126/ext_mod/lcd_bus/esp32_src/rgb_bus.c#L447


      
          {
              mp_lcd_rgb_bus_obj_t *self = (mp_lcd_rgb_bus_obj_t *)obj;
              *lane_count = (uint8_t)self->panel_io_config.data_width;
          
              LCD_DEBUG_PRINT("rgb_get_lane_count(self)-> %d\n", (uint8_t)self->panel_io_config.data_width)
          
              return LCD_OK;
          }
          
          
          mp_lcd_err_t rgb_tx_color(mp_obj_t obj, int lcd_cmd, void *color, size_t color_size, int x_start, int y_start, int x_end, int y_end, uint8_t rotation, bool last_update)
          {
              LCD_DEBUG_PRINT("rgb_tx_color(self, lcd_cmd=%d, color, color_size=%d, x_start=%d, y_start=%d, x_end=%d, y_end=%d)\n", lcd_cmd, color_size, x_start, y_start, x_end, y_end)
              LCD_UNUSED(color_size);
          
              mp_lcd_rgb_bus_obj_t *self = (mp_lcd_rgb_bus_obj_t *)obj;
              
              rgb_bus_lock_acquire(&self->tx_color_lock, -1);
          
              self->last_update = (uint8_t)last_update;
              self->partial_buf = (uint8_t *)color;

Here is the VSYNC ISR callback function. You can see how I am handling testing to see if the buffer actually needs to be swapped out. I don’t actually make the call to the driver to swap the buffers because I do not know if any memory allocation occurs in the driver so I let the copy task handle calling that function from outside the context of the ISR. I just stall the copy task until the VSYNC ISR occurs which lets the copy task know it is good to go to sync the full buffers.

github.com

lvgl-micropython/lvgl_micropython/blob/30b1f7da53ab2cb21074fc542e385541ddaa1126/ext_mod/lcd_bus/esp32_src/rgb_bus_rotation.c#L150


      
          static void rotate_32bpp(uint32_t *src, uint32_t *dst, uint32_t x_start, uint32_t y_start,
                              uint32_t x_end, uint32_t y_end, uint32_t dst_width, uint32_t dst_height,
                              uint8_t rotate);
          
          
          static void copy_pixels(void *dst, void *src, uint32_t x_start, uint32_t y_start,
                              uint32_t x_end, uint32_t y_end, uint32_t dst_width, uint32_t dst_height,
                              uint32_t bytes_per_pixel, uint8_t rotate, uint8_t rgb565_dither);
          
          
          static bool rgb_bus_trans_done_cb(esp_lcd_panel_handle_t panel,
                                          const esp_lcd_rgb_panel_event_data_t *edata, void *user_ctx)
          {
              LCD_UNUSED(edata);
              mp_lcd_rgb_bus_obj_t *self = (mp_lcd_rgb_bus_obj_t *)user_ctx;
              rgb_panel_t *rgb_panel = __containerof(panel, rgb_panel_t, base);
              uint8_t *curr_buf = rgb_panel->fbs[rgb_panel->cur_fb_index];
          
              if (curr_buf != self->active_fb && !rgb_bus_event_isset_from_isr(&self->swap_bufs)) {
                  uint8_t *idle_fb = self->idle_fb;
                  self->idle_fb = self->active_fb;

There is a main lock that guards the copy task so it doesn’t end up spinning it’s wheels if there is no work that needs to be done. It also prevents the copy task from accessing data structures that the LVGL task might be populating. That guard gets released once the data has been populated. There is a second lock that gets acquired before the data gets populated. This lock only gets released once the copy task and finished collecting the last data that has been placed. While I do not believe that LVGL would be able to render faster than the copy task would be able to collect the data it is simply an assurance that it cannot happen. You could technically replace these locks with a queue and pass the data through the queue. I never set that up because of the additional complexity involved in doing that.

kdschlosser · March 26, 2025, 6:46pm

You have to take a leap of faith and trust me on this. I know it is going to be a large investment of time to make the changes needed. But I can tell you with 100% certainty that doing this WILL fix you issue and you will also get a performance boost as a side effect of doing it.

pete-pjb · March 27, 2025, 3:28pm

Hi @kdschlosser ,

Many thanks again for your time, I really appreciate your input. This is an interesting approach, the only downside I see is the rendering of the buffer followed by an extra memory copy, but the net result might still be better than what is going on right now… I will definitely take a serious look at your proposal and explore its feasibility and possible implementation with in my system, I have FreeRTOS running so task sync, mutual exclusion, queues etc. are all easy to achieve.
I am also wondering if there is an opportunity to change the way LVGL handles flush etc. to find a better way to enable the creation of more efficient drivers… more hooks or better control over when and where things happen especially in the multi-core department. I think since my early adoption of LVGL back in the days of Version 6(Note I have updated my projects to Version 9 now and I use the master branch for my ongoing development to keep the latest projects bang up to date), there has been the introduction of LVGL events generated at the various stages of rendering and flushing which I may be able to use as hooks to alter the timing/triggering of everything. I am also aware you can delete the rendering timer and call the rendering function manually from your own code but I haven’t tried experimenting with that much yet either. I will probably take a look at this approach also.

I will keep you posted on how things go…

Kindest Regards,

Pete

kdschlosser · March 27, 2025, 6:48pm

It’s faster trust me. It’s faster because you are able to move the VSYNC to a core that LVGL is not running on. LVGL doesn’t need to keep the buffers in sync at all. You don’t need to render the entire screen each time an update occurs you are able to only render the small area that updated. and the most important thing is the buffers are kept synchronized properly so you don’t end up with the issue you are having.

There are some added benefits to it as well. You can add rotation with only a very minimal amount of impact to performance. If you are using RGB565 you can also add in dithering as well with once again next to no impact in performance. This i because those things would be able to be done at the same time that the partial buffer is being copied to a full buffer.

pete-pjb · May 29, 2025, 2:01pm

Hi @kdschlosser, @Baldhead, @Marian_M,

Thank you to all for your suggestions and support…

Just wanted to report I have finally resolved my issues with this platform.

@kdschlosser I was unable to get good performance by copying the buffers in the way you suggested. Having a Full HD 1920 x 1080 pixel (8Megabyte) frame buffer was just taking to long to copy.

It turned out there were multiple issues to resolve:

The manufacturers bare metal drivers are broken, I have created my own.
Some of the hardware registers are not documented in the official documentation, I found some clues by examining the driver provided by the manufacturer in the Linux source tree. Now I am able to create proper frame synchronisation.
Generally the manufacturers documentation is very poor for the Displayport Interface on my chosen platform, the language and description is quite ambiguous and not easy to interpret.
There were also some caching issues to do with the update of the DMA descriptors causing the use of incorrect buffer addresses intermittently.
Triple buffering was also required to allow the smooth change between an active buffer and an offline buffer during VSYNC, it seems LVGL now supports this natively in the latest master so I have used it. I can confirm it works well. Currently it doesn’t seem to be documented, but I did this to initialise LVGL:

	lv_init();
	lv_log_register_print_cb( lv_log_print );
	cpu0_globals->gui.disp = lv_display_create( HOR_RES_MAX, VER_RES_MAX);
	lv_display_set_color_format(cpu0_globals->gui.disp, LV_COLOR_FORMAT_ARGB8888 ); 
	lv_display_set_buffers(cpu0_globals->gui.disp, (void*)cpu0_globals->dp_data->raw_vbuf_ptr1,
		(void*)cpu0_globals->dp_data->raw_vbuf_ptr2, DP_BUF_SIZE, LV_DISPLAY_RENDER_MODE_DIRECT);
	static lv_draw_buf_t third_vid_buf;
	memset( &third_vid_buf, 0, sizeof( third_vid_buf ) );
    lv_color_format_t cf = lv_display_get_color_format(cpu0_globals->gui.disp);
    uint32_t w = lv_display_get_original_horizontal_resolution(cpu0_globals->gui.disp);
    uint32_t h = lv_display_get_original_vertical_resolution(cpu0_globals->gui.disp);
    uint32_t stride = lv_draw_buf_width_to_stride(w, cf);
    lv_draw_buf_init(&third_vid_buf, w, h, cf, stride, cpu0_globals->dp_data->raw_vbuf_ptr3, DP_BUF_SIZE);
	lv_display_set_3rd_draw_buffer( cpu0_globals->gui.disp, &third_vid_buf );
	lv_display_set_flush_cb(cpu0_globals->gui.disp, (lv_display_flush_cb_t)dp_flush);
	// Initialise Display Port Hardware
	init_dp_controller();

Then the 3 buffers are managed like this in the flush callback:

void dp_flush( lv_display_t *disp_drv, const lv_area_t *area, uint8_t *data ) {

	uint32_t		event = 0;
	otg_sysmsg_t	msg = { 0, NULL, 0, 0, log_src_dp, NULL, 0 };

	if( cpu0_globals->dp_data->status & DP_LINK_TRAINED ) {
		if( lv_display_flush_is_last( disp_drv ) && (cpu0_globals->dp_data->dma_next_buf != NULL) ) {
			if( ( (uintptr_t)data == cpu0_globals->dp_data->dma_fb1.addr ) ) {	// Set the new buffer
				cpu0_globals->dp_data->dma_next_buf = &cpu0_globals->dp_data->dma_fb1;
			} else if( ( (uintptr_t)data == cpu0_globals->dp_data->dma_fb2.addr ) ) {
				cpu0_globals->dp_data->dma_next_buf = &cpu0_globals->dp_data->dma_fb2;
			} else {
				cpu0_globals->dp_data->dma_next_buf = &cpu0_globals->dp_data->dma_fb3;
			}
		}
	}
	lv_display_flush_ready(disp_drv);

This works well for me…

I also have two ARM A53 cores running a FreeRTOS SMP port so I also added a single software draw unit and this has improved the frame rate also.

Hopefully this will be helpful to someone else in the future.

Thanks again for the help and suggestions…

Kind Regards,

Pete

kdschlosser · May 29, 2025, 8:35pm

You are doing what I had told you to except it is doing it behind the scenes. You more than likely have LVGL render mode set to direct. When you do that it is copying the buffers. This might be why you experienced the bad performance by trying it the way I had instructed you to. The idea is to offload that copying of the buffer to another CPU core which is not able to be done if LVGL is what is doing the copying. How you handle this is you create 2 full sized frame buffers and those frame buffers LVGL has zero knowledge of. Those frame buffers get created in DMA memory space as those are the ones that get transmitted to the display. Then you create 2 partial frame buffers. Those are the frame buffers that LVGL deals with. You need to set the render mode to partial in LVGL. When LVGL calls the flush function you then pass that buffer off to a task that you have running on the other CPU core that will copy the data from that partial buffer into the idle full frame buffer. When LVGL has finished running a series of updates is when it is time to pass that idle buffer to the driver so it will swap it out on the next vsync that occurs. When that vsync occurs is when you signal the task running on the second core to go ahead and copy the data from the now active buffer to the now idle buffer. You can have the data being sent from a buffer at the same time data is being read from it when copying, that is not going to cause an issue at all. There is going to be more than enough time for the task to copy the buffer without causing the program to stall because LVGL is going to have 2 partial buffers is needs to render before a stall will occur. If one does occur it is going to be for a really short amount of time.

Having LVGL copy the data from one buffer to the next when in direct mode for a high pixel resolution is going to cause some very large latency in the GUI rendering. It’s not really ideal. Now LVGL doesn’t copy the entire buffer, it actually saves the areas that have been rendered from the last go around and then it has to iterate over those areas and copies the areas from the one buffer to the next, It still causes quite a bit of slow down to have LVGL do that.

If you have it set up like what I said above you will see rendering times for LVGL that are extremely low. This is because LVGL is doing a lot less work which keeps the refresh rate up. The partial buffers do not have to be in DMA memory because while one is being rendering to the other is being copied from on another core. It’s not the same core trying to do multiple things at once.

If your MCU is like most others the core that you start the driver on is going to be the core where the interrupts occur for handling the DMA transfer to the display. interrupts = CPU core stops processing user code to handle the transfer. This is another thing that you do not want to have happen on the core that is doing the GUI rendering. It is going to slow things down. So starting the driver on the same core that is doing the copying is the way to go because the GUI rendering is going to be allow to do it’s thing with as little slow down as possible.

The rendering is what is going to be by far the most time consuming thing to do and you want to that run without being interrupted and to run without having to do a bunch of other tasks in the process.

In my example there are 4 frame buffers that are used.

pete-pjb · May 30, 2025, 8:24am

Hi @kdschlosser ,

Thank you again for your time and effort.

I totally understand your methodology and can see it should work well, I will revisit it later in the project and try again to get it working the way you suggest. For the time being I have to design and prototype some of the hardware and get that underway then I’ll come back to the software.

It may be a while but I will update my progress back here.

Kind Regards,

Pete

pete-pjb · June 13, 2025, 1:06pm

Hi Kevin (@kdschlosser),

Hope you are well! Just to let you know, I have finally implemented a good scheme for my hardware platform.

So I am running LVGL double buffered with partial buffers along with two full screen DMA buffers as you suggested. The flush callback writes the partial buffer area to the offline screen buffer and queues the area to a copy queue which is used to update the next offline buffer with the current changes. Once a flush last is received from LVGL the flush callback unblocks another task which switches the buffer and then copies the new online buffer updated areas (which are also merged prior to copying for optimisation) from the queue to the previous online buffer so it has the latest changes also… This all worked well and gave me an acceptable performance. I then remembered my hardware has 2 off 8 channel DMA controllers which are currently unused so I have created a dma_memcpy function which copies buffers using DMA instead of memcpy. I’ve found by using one DMA channel for flush callback offline buffer updates and one DMA channel for the full screen buffer to buffer updates, the whole system is much more responsive and video performance is greatly improved

Subject to tweaking the buffer sizes etc. to find the most performant parameters, I have a good solution now.

I hope that makes sense and thanks again for you input and being a great sounding board

With Kindest Regards,

Pete

kdschlosser · June 14, 2025, 7:48am

@pete-pjb

That is awesome that you have gotten it to work. Now that you see it in action it makes sense as to why it performs better. I do have some code you can use if you like for handling writing the partial buffers to the full buffer. It does this while rotating at the same time if necessary. a DMA copy will work so long as you are not performing a rotation at the same time. IDK if that is something you would be interested in having available to you or not. But the code is located here…

The code starts at this line and goes to the end of the file.

github.com

lvgl-micropython/lvgl_micropython/blob/6a7ad7b3d117408fc9b897f334e830900be7ff9e/ext_mod/lcd_bus/esp32_src/rgb_bus_rotation.c#L292


      
                      }
                  }
          
                  exit = rgb_bus_event_isset(&self->copy_task_exit);
              }
          
              LCD_DEBUG_PRINT("rgb_bus_copy_task - STOPPED\n")
          }
          
          
          __attribute__((always_inline))
          static inline void copy_8bpp(uint8_t *from, uint8_t *to)
          {
              *to++ = *from++;
          }
          
          __attribute__((always_inline))
          static inline void copy_16bpp(uint16_t *from, uint16_t *to)
          {
              *to++ = *from++;
          }

You can also incorporate dithering into it at the same time the pixels are being copied. You can also handle RGB565 byte swapping if needed as well. It makes it a lot faster if you can do everything all at once with a single iteration over the data. It is also keeps things flowing when you are only needing to do that to small partial buffers for only the areas that get updated.

I have a really fast dithering algorithm if you are interest in that. It works for both RGB565 and RGB888 the latter being useful if say an image is being displayed what was encoded using RGB565 and wasn’t dithered.

When going the route of using the 2 partial and the 2 full frame buffers and spreading the work across multiple CPU cores keeping everything in sync can be a bit of a hard hour to get right. You have to use those semaphores to control what is being written and read to the different buffers. Once you get that portion hammered out then it’s a matter of going over the code to optimize it for speed. The rotation code I wrote I believe is about as fast as it is going to get for doing it in C code on RAW RGB buffers

I am really happy that you didn’t give up on getting it to work and now you can actually see the performance gains that are gotten from doing it that way. It’s hard to explain and I tried to do the best I could when explaining it. I must have gotten everything explained in a manner that you were able to put it all together.

A++ on the effort… Post of video of what it looks like in action.

pete-pjb · June 16, 2025, 11:52am

Hi @kdschlosser ,

Here’s a link to a quick video…

Thanks for sharing your rotation & dithering code… I don’t need rotation or dithering for my own application, but I am sure others will definitely benefit from the post. I have simple requirements everything is ARGB8888 and all the images and icons displayed originate from my own images created in house and are converted using the LVGL online image converter. The hardware is natively RGB888.

Thanks again for your input. Much appreciated…

Cheers,

Pete

kdschlosser · June 16, 2025, 3:08pm

OK cool. I wasn’t sure how much control you had over what hardware was being used. But if everything is able to use RGB888 then there is no need for the dithering stuff.

It sounds like you have gotten your project up to where you wanted it in terms of performance and getting rid of the black flickering. It has taken you months of work to get there and I am sure you are happy that you are able to move on to other aspects of your project.

kdschlosser · June 16, 2025, 3:57pm

I just wanted that video. That is working fantastic. The scrolling is super smooth and there are no anomalies at all in the graphics.

I have one more suggestion that you might want to try out. It will improve the UI’s responsiveness and also how well things like scrolling work…

This is written in Python but it can be very easily ported to C code.

This is a custom way to handle the touch input. I have this set up so it can be turned on so it is an option. It doesn’t need to be done as an option but I thought it was best to give the user a choice.

    def enable_input_priority(self):
        self._indev_drv.set_mode(lv.INDEV_MODE.EVENT)
        self.__timer = lv.timer_create(self.__ip_callback, 33, None)
        self.__timer.set_repeat_count(-1)  # NOQA

    def __ip_callback(self, _):
        self.read()
        last_state = self._last_state

        while self._last_state == self.PRESSED:
            lv.refr_now(self._disp_drv)
            self.read()

        if last_state == self.PRESSED:
            lv.refr_now(self._disp_drv)

    def _read(self, drv, data):
        coords = self._get_coords()
        data.continue_reading = False

        if coords is None:
            state = self.RELEASED
            x, y = self._last_x, self._last_y
        else:
            state, x, y = coords

        if None in (x, y):
            x, y = self._last_x, self._last_y

        data.point.x, data.point.y = (
            self._calc_coords(x, y)
        )

        data.state = state

        self._last_state = state
        self._last_x, self._last_y = x, y

so what is happening in this code is I change the “mode” of the indev so instead of it polling through the indev mechanics in LVGL I have it set so that I am able to control the polling.

The mechanics in LVGL tie how fast the indev gets polled to the same duration as the display refresh. The default is every 33 milliseconds. so what hapens is when lv_task_handler gets called it will see if the timeout has expired and if it has then it will read the touch and any changes that need to get made to the UI get made and then LVGL will refresh the display. Then another 33 milliseconds has to pass before the indev gets read again.

when a user is touching the display I feel that priority should be given to that input. After all any actions that happen as the result of the user touching the display will happen in callbacks so that code will get run as it should.

What the code above does is it replaces the original indev timer that has a timeout of 33 milliseconds with one that I have made. This timer still has a 33 millisecond timeout but the timeout is only there for the purposes of detecting the first input. If the input stays active which is what happens when scrolling the code will loop processing the input and updating the display as fast as possible bypassing the 33 millisecond timeout period. Once there is no input detected the loop exists.

The flow chart on the left is the normal LVGL process for reading and updating the display. The one on the right is what the code above does…

If input is detected the 33 millisecond timeout is no longer considered. Once the input has stopped the process goes back to the normal way that LVGL handles input. So there is no additional work being performed even when there is no input.

It’s a way to improve the responsiveness of the UI when there is user input. For this to work properly an interrupt needs to be used to increment the LVGL tick time and that should be running at a speed of once every 1 to 2 milliseconds if possible.

krembed · June 17, 2025, 5:55am

I really enjoyed reading all of this, it was very insightful. I’m glad you got it to work @pete-pjb , it looks really smooth!

Quick question for @kdschlosser about that last suggestion. If I made a scrolling container where only the items currently inside a “window” render full/more details (that means I can scroll things into or out of that window) then your suggestion would theoretically also improve the performance on this?