LVGL double buffering with ESP32-S3 RGB panel – how to draw asynchronously to non-visible buffer?

Hi everyone,

I’m working on a display project with LVGL 9.4 on an ESP32-S3, and I’d like to get some feedback from the community on the correct way to handle double buffering and asynchronous drawing.

Hardware:
– ESP32-S3 (Adafruit Qualia board with PSRAM)
– 720×720 display
– Parallel RGB565 interface

Software:
– ESP-IDF
– LVGL 9.4
– Graphics created with SquareLine Studio (working fine)

The display is configured using the ESP-IDF RGB LCD driver with double buffering, similar to the setup described here:
https://docs.espressif.com/projects/esp-idf/en/stable/esp32s3/api-reference/peripherals/lcd/rgb_lcd.html

One frame buffer is currently being displayed by the RGB peripheral, while the other buffer is used for rendering.

Now my question is about LVGL’s drawing model:

– I want LVGL to render asynchronously into the buffer that is NOT currently being displayed
– LVGL itself also supports double buffering
– However, LVGL still calls the flush_cb, where pixel data is copied again into the display buffer

This makes me unsure about the correct architecture:

– Should LVGL’s draw buffer be directly mapped to the RGB back buffer?
– Is it possible (or recommended) to fully avoid copying in flush_cb and just swap buffers?
– How do you properly synchronize LVGL rendering with the RGB panel’s VSYNC / buffer swap?

At the moment everything works visually (colors, sizes, performance are fine), but I want to make sure I’m using LVGL and the ESP-IDF RGB driver in the intended and most efficient way.

I can share code snippets if needed.

What am I missing?
Thanks in advance for any insights or best practices!
Cheers
HaenZ

it’s not going to happen asynchronously. It needs to be a synchronized thing.

Please don’t get hung up on the wording:

I’m not trying to make LVGL render unsynchronized with the display scanout. By “asynchronous” I mean: LVGL should render into the back buffer while the RGB peripheral is scanning out the front buffer, and once rendering is finished, the buffers are swapped in a controlled, tear-free way (e.g. at VSYNC).

In other words:
– No mid-frame updates
– No tearing or ghosting
– Rendering and scanout happen in parallel, but buffer ownership is synchronized

My uncertainty is mainly about LVGL’s flush model in this setup. LVGL already supports double buffering internally, but it still expects to call flush_cb, where pixels are typically copied into a display buffer. With the ESP-IDF RGB driver already running in double-buffered mode, this feels redundant.

What I’m trying to understand is:
– Whether LVGL’s draw buffer can be the RGB driver’s back buffer directly
– Whether flush_cb can be reduced to a “render done / swap requested” signal instead of a memcpy
– How to correctly synchronize the LVGL flush completion with the RGB panel’s VSYNC buffer swap to avoid tearing

Everything works visually right now, but I want to make sure the architecture is correct and that I’m not fighting LVGL’s intended rendering model.

If the correct answer is “LVGL expects ownership of the final framebuffer and copying is unavoidable”, that’s fine — I just want to be sure that’s actually the case with RGB panels on ESP32-S3.

Thanks for clarifying.

The RGB driver is tricky because RGB displays do not have any GRAM internally. So a constant stream of data needs to be sent to the display. There is no callback function available that only gets called a single time once the buffer transmits the first time. With the RGB display you have to use the vsync callback function and that function gets called every single time the buffer data is sent even if the same buffer is being sent back to back. You need to build in some logic to be able to keep track of the buffer that is being sent and the one that is being rendered to. What make it a pain is there is no way for you to be able to identify what buffer has finished transmitting from inside of the vsync callback. and the structure that holds that information is declared and defined in the C source file and not in a header file. Thankfully it’s a pointer that is being passed to the callback function so all that needs to be done is you need to declare/define the same structure in your C source file where the callback function is located. Now you can access the proper location in memory for where the index is for the buffer that has just finished transmitting and by keeping a record of what buffer has finished transmitting you will know when a new buffer has just finished so you can call lv_flush_ready only if the index number you have stored is different from the one that has just finished sending. If you make more than one call to lv_flush_ready for the same buffer it can and usually will trip the marker for the buffer that has not actually finished transmitting and LVGL will end up writing to the buffer which will cause data corruption because the buffer is also being sent at the same time.

I have code I wrote that is hands down the fastest RGB driver that has been written for the ESP32 line of MCU’s. It allows LVGL to use partial buffers to render to which makes LVGL’s work a lot less and I keep the data in the buffers in sync using the second core which is also where the data from the partial buffers gets copied to a full buffer. It also handles rotation at the same time the copying it being done so you get 2 birds with one stone on that. There is one thing that does happen is that a bottle neck shows up in the SPI bus for the flash and PSRAM. This occurs because they share the same bus and when you have 2 cores reading and writing from flash and ram and at the same time DMA is reading from the ram as well to transmit the buffer data the 8 lane SPI has a hard time keeping up. Lucky for us Espressif has added some settings that relaxes that issue somewhat. The first one is getting a board that has 16mb or 32mb of Octal flash. This is important because you want both the flash and the PSRAM to be Octal this way you can overclock the SPI bus which will give you an effective clock of 240mHz vs the 80mHz default.There is another setting that you can turn on that will copy your entire program into PSRAM when the ESP32 boots. The benefit of doing this is not you don’t have to share the bus with the flash memory when running your code. the full tilt speed of the bus going to the PSRAM. The only time the bus would be shared if you need to access any files that you have stored, data files r images like PNG files you might be using.

Here is the link to the code for the driver I wrote. you will want to pull the information from these 2 areas…

The driver is written for use with MicroPython but you can pull out what is needed to get a standalone driver working.

as you already know the RGB driver in the ESP-IDF allocates 2 full frame buffers. You do not want to allocate 2 more for LVGL to render to. The reason why is because if you pass one of the LVGL buffers to the driver it ends up copying the entire buffer to the buffer that is has allocated and is idle. All of this happens on the same core LVGL is running on so it really hammers performance. You can collect the buffers the driver allocates and those can be passed directly to LVGL but the issue there is LVGL is going to copy the data from one buffer to the next to keep the buffers in sync and that takes a lot of time to do. It is better to offload that work to the second core which is what is being done in the driver I linked to above.

1 Like

If I understand your approach correctly, your code still uses LVGL in the “standard” way, but it takes over the entire double-buffer management itself and distributes the work across both cores. That is a very solid design choice.

Unfortunately I’m not very comfortable with Python, but I already had the code translated to ESP-IDF C and still need to test it. Conceptually, though, the approach makes a lot of sense, especially to offload expensive copy operations from LVGL.

One additional observation that might be interesting in this context:
I found that draw_bitmap() in the ESP-IDF RGB controller effectively only swaps the buffer pointer. This is easy to verify with two (or even three) fully populated full-color framebuffers. From a hardware perspective I can reach the full pixel clock, so in theory ~60 fps on my panel.

With that in mind, I’m wondering whether part of the partial-buffer copy overhead could be avoided by letting LVGL render directly into the inactive RGB back buffer. Obviously both full framebuffers would need to be kept progressively consistent to avoid ghosting, and a pure LVGL full-render mode would likely be too slow.

This is just a thought at this point, but it might point toward a useful hybrid approach.

Thanks again for the detailed explanation and for sharing insights into your driver.
I’ll let you know

what you have to remember is if you have a 16 lane RGB display and then you have 2 buffers both of which are in DMA memory. That means that transmitting the buffer is not a blocking call so while one buffer is being transmitted then LVGL is able to render to the next buffer. the connection to the PSRAM is only 8 lane SPI. so right there in and of itself makes it impossible to achieve anywhere over a 40Mhz clock with the display. That’s a theoretical number and it is actually going to be lower than that because of transmitting overhead. But lets call it 40mHz. Then you toss in there reading from one buffer while rendering to the other. so now you have 2 things used the SPIRAM. that cuts the SPI speed in half. so your actual SPIRAM starting point is 40mHz because it is split between read and write operations. so now you have 8 lanes at 40mHz feeding 16 lanes for the display so your effective clock ends up being 20mHz right out of the gate. That is a best case scenario. If there is anything running on the second core like WiFi and you are connected and receiving data that is going to be another hit. If code is being loaded from flash memory that is another hit because the flash and spiram share the same bus.

It’s very easy to overload the SPI bus for the flash and SPIRAM. That is where the bottle neck is going to be 100% of the time when using an RGB display with 16 lane connected.

The reason why you don’t want LVGL rendering directly to the buffers is because the buffers have to stay in sync. so LVGL doesn’t render the same data to only a single buffer. It had to render it to 2 buffers because both buffer need to have identical data prior to new data being written to one of them. That is a giant performance hit.

Using the second core to handle the “full” frame buffers partial buffers is what gets used for LVGL. so LVGL only has to render a single time. The partial buffers are also small enough to be allocated in internal memory how fast LVGL is able to rendered is nit hindered by any bottle neck with the SPIRAM bus.

The cool thing is when using the second core to handle the transmitting of the buffers as soon as the swap of the pointers takes place the data that is being transmitted also gets copied to the buffer that was just swapped out. I can read from a transmitting buffer without any worry of data corruption. once that copying has finished then any partial buffers that lvgl has rendered to can be copied to the now fully synced idle buffer. Because LVGL is able to render to 2 partial buffers a stall never occurs because by time that second partial buffer has been rendered to the sync has finished and the waiting partial has been copied.

Another thing that i built into the driver is when the swapping of the the full buffers takes place. It doesn’t happen with each partial buffer that gets rendered to. If there is enough information that is being rendered where it spans multiple partial buffers the swap of the full buffers only takes place after the final partial buffer has been copied. This completely removes any possibility of taring occurring.

1 Like

@kdschlosser
This really looks like the end-game solution for RGB panels on the ESP32. It’s by far the most complete and technically consistent explanation I’ve seen on this topic.

I don’t fully “see through” all of LVGL’s internal mechanics yet, but your description aligns exactly with the constraints I kept running into without being able to properly articulate them. At this point, I’m comfortable accepting that understanding every LVGL detail isn’t strictly necessary if the surrounding architecture is sound.

I’ll need a couple of days to integrate and test your code on my setup. I’m currently using lvgl-port, which works reasonably well, but clearly leaves performance on the table compared to what you describe.

Once I have this running and validated, would you be okay with me sharing your overall approach with the wider community (with proper attribution, of course)?

Thanks again for taking the time to explain this in such depth.

HaenZ

If you have any issues porting the code over lemme know and I will help you out with it.

Hey kd,

I managed to get this running! As far as I can tell, it’s currently about 50-100% faster than lvgl_port.

However, I’m experiencing tearing issues. My guess is that Claude.ai didn’t get everything quite right. Would you mind taking a look at it? You can find my current files here: GitHub - HaenZ33/esp32s3bus_with_lvgl: Just a quick demo repo for lvgl forum

Thanks!

well I can tell you that the code isn’t using anything from what I wrote in it I am not sure what is going on with that code because it looks like a bunch of function declarations is all it is.

Thanks for taking a look.

I guess this approach is a bit outside my usual area of expertise, and the ESP32-S3 dual-core setup is fairly unique in this regard. Given that, I may skip digging deeper into this for now and might switch to an esp32-p4 for more cpu power. However dual core setup would also be benefical here…

Back to the original question though: what would you consider a good approach to double buffering in this context? Maybe anyone else can share their expertise.

Appreciate you taking the time to comment.
HaenZ

After spending even more time and thoughts on the topic: I realized that the ESP32-S3’s GDMA engine can transfer framebuffer data in PSRAM without consuming significant CPU time. Based on this, I sketched a diagram of how the double-buffering and GDMA transfer would interact with 3 buffers.

Can someone confirm whether this setup is valid? Do you see any issues with it, or reasons it might not work as expected?
lvgl_gdma.pdf (81.1 KB)

The issue is not with transferring. When you use an RGB display the entire display has to be updated in a never ending loop. There need to be a stream of data. you use double buffering with DMA so while one buffer is being streamed the other is able to be updated by LVGL. The trick with this process is the buffer that is being written 2 needs to have identical data as to the one that is transmitting prior to writing any new data. This causes LVGL to have to render 2 times and that is where the slowdown occurs. You can trick the mechanics in LVGL to use partial buffers so LVGL will only render things that have updated and you can stop LVGL from having to render twice by doing this. But you would have to handle the copying of the partial buffers into the full buffers and positioning that data correctly when you perform the copy. You will also have to keep the full buffers in sync. That is where the use of the second core for the ESP32 comes into play.

Throwing buffers at this is not going to solve the needing to render twice problem. The only way to stop that from occurring is for you to handle keeping the buffers in sync.

IDK if caching the bounding boxes for the areas that have updated would be beneficial in terms of performance. I believe there would be a threshold for the number of areas VS just copying the entire buffer when keeping the buffers in sync. I know it would alleviate some of the bandwidth bottleneck for the PSRAM but because the sync process would involve having to iterate over the buffer to some degree if there are enough areas that are being copied it could end up costing more. To make it as efficient as possible a calculation would need to be done in order to determine the proximity of the areas to each other and having it decide to just copy a larger block instead of a bunch of smaller blocks.

I guess it wouldn’t be that hard to see if the bounding boxes cover a contiguous area. This is something I may put some work into doing to see how it comes out. I will also separate the driver I wrote so it can be used standalone without MicroPython. I will still be able to tie it into MicroPython afterwards and it would be a better design if I did that.

I may start with the existing RGB driver in the ESP-IDF and modify it so it does things the way they should be done for the RGB displays.

I think there is a misunderstanding here. If you look closely at my diagram, the intent of my approach should become clearer.

This is not about “throwing more buffers at the problem.” The key points are:

  • LVGL only ever renders continuously into a single draw buffer. There is no LVGL-side double buffering, so LVGL does not need to render twice.
  • GDMA is responsible for copying the rendered content into the display framebuffer (one of two and the one that is not being displayed atm) located in PSRAM.
  • The display itself still uses two framebuffers to avoid tearing, but these buffers are not filled by LVGL directly.

The advantage of this approach is that the display framebuffers do not need to be kept in sync by LVGL. LVGL always renders sequentially into its own buffer, and GDMA performs a single copy per frame into the currently inactive display framebuffer.

Because of this separation, the usual problem of having to keep two LVGL-rendered buffers identical does not occur, and the “render twice” penalty is avoided.

Or am I wrong?

The question still is: is GDMA as fast as I want it to be? Or are there more bottlenecks to deal with?

the buffers need to be in sync because of the display, not because of LVGL.

I am not really following your diagram.

This is how the process needs to work no matter how many buffers you are using…

buffer 1 is transmitting
buffer 2 is being rendered to

buffer 2 rendering is finished being rendered to
buffer 2 is transmitting

buffer 1 has either the same thing rendered to it that was just rendered to buffer 2 or buffer 2 needs to be copied to buffer 1

buffer 1 is being rendered to

buffer 1 rendering is finished
buffer 1 starts transmitting

buffer 2 has either the same thing rendered to it that was just rendered to buffer 1 or buffer 1 needs to be copied to buffer 2

this loop continues endlessly

if you add another buffer into the chain it would look like this…
buffer 1 is transmitting
buffer 2 is being rendered to
buffer 3 is queued to be rendered to

buffer 2 rendering is finished being rendered to
buffer 2 is transmitting
buffer 3 is being rendered to with the same data that was rendered to buffer 2
buffer 3 is being rendered to

buffer 2 is transmitting
buffer 1 is queued to be rendered to

buffer 3 rendering is finished
buffer 1 has data that was rendered to buffer 2 and buffer 3 rendered to it

buffer 1 is rendering

buffer 3 is transmitting
buffer 2 is in queue to be rendered to

buffer 1 rendering is finished
buffer 2 has data that was rendered to buffer 3 and to buffer 1 rendered to it
buffer 2 is rendering

buffer 1 is transmitting
buffer 3 is in queue to be rendered to

this loop continues endlessly

GDMA is only as fast as the transport for the memory. If a buffer is in PSRAM it is only going to be as fast as the PSRAM and it’s connection to the CPU allow. Most times it is 1/2 the speed of that if not lower because DMA doesn’t require the CPU to work so you can have 2 things accessing the PSRAM at the same time and that cuts the speed in 1/2. Technically you can have 3 because there are 2 cores and the DMA controller.

The internal RAM is heap loads faster because it isn’t connected using an SPI Bus. However the amount of memory available is limited to only 320K at most. In most cases this is not large enough to have 2 buffers placed into it for use with an RGB display.

The absolute best way to handle the buffers with the RGB displays is doing how I have done it where I use 2 full buffers in DMA memory in PSRAM and 2 partial buffers allocated in internal RAM. This allows LVGL to not have to sync buffers or render twice and there is also a speed boost in the rendering because the buffers are allocated in internal memory. These buffers do not need to be allocated in DMA memory because there is no reason to use the DMA because the copying is being handled by the second core while the first core is running LVGL and rendering to the other partial buffer.

The short and skinny is you will always have a bottle neck issue at the PSRAM when using an RGB display. There is simply no way around this. It’s a hardware limitation.