Flushing to multiple displays in parallell

Project overview

Hobby project for an extended car display cluster with a wireless display controller using multiple displays and an ODB2 based CAN sniffer.

The project is up and running on prototype hardware already, so basic LVGL setup etc is complete.

lvgl

What do you want to achieve?

Faster display updates. =)

What have you tried so far?

The SPI bus is running at a maximum 80MHz already, so not much I can do there. The dual displays unmasked resolution is 240x240, so there are a total of up to 225KB to push for each full update (2 screens, 16bit), so a theoretical speed of 40:ish FPS at 80Mbps (minus overhead) might be possible. Did I do that math correctly?

That is with a shared SPI bus like I use today. The question is if LVGL supports flushing to multiple displays in parallell using DMA, if I use a dedicated SPI bus for each? If it is, any caveats to be aware of?

I have searched both the forums and the documentation without finding any clear yes/no on that question, sorry if I missed something obvious. It is quite some work to re-route the prototype hardware if there is not point in it.

Additional questions

There are of course a ton of optimizations to do on the UI rendering side, but those are more general questions so I bet there are good threads on that. Feel free to give tips-n-tricks though, especially how to handle the round outer custom gauge not causing major screen flushes due to its shape. Or if there is something smart to prevent flushing pixels outside of the circular displays.

I also wouldn’t mind pointers to good up to date tips and tricks for using the PSRAM in LVGL to offload the display buffer and still be able to DMA from it, using Arduino Core. I have seen some discussions about it, but it is hard to know what is still relevant with newer LVGL and Arduino versions. The current planned hardware target for the displays (ESP32-S3R2) only has quad SPI for the PSRAM unfortunately, would octal SPI help a lot?

Environment

  • IDE: VSCode
  • Platform: Android Arduino via PlatformIO
  • LVGL version: 9.3.0
  • Driver: LovyanGFX (for not particular reason)
  • Wireless: ESP-NOW from gateway to one or more display units.

Hardware

  • Screens: 2x GC9A01, round 240x240 pixels, SPI
  • Devboard displays: WareShare ESP32-S3 Pico with 2MB PSRAM
  • Devboard can-gateway: ESP32 classic devboard
  • CAN interface on gateway: CAN-Shield from MrDIY

Oh this is a good one.

First thing is to get away from using LovyanGFX.

Second thing is to use DMA memory transfers and use 2 frame buffers for each of the displays.
Third thing is to move the display drivers and the sending of the buffers over to the core that you don’t have LVGL running on.

If you used an ESP32-S3 with 8MB of memory and 16MB of octal flash you would be able to overclock the SPIRAM so it runs at an effective speed of 240mHz instead of 80mHz. You would also have enough SPIRAM to load the whole program into by setting the CONFIG_SPIRAM_XIP_FROM_PSRAM to y. That will allow faster access to the SPIRAM because access is no longer being shared with the flash as the code executes.

Basically you are going to need to use FreeRTOS to start a task that handles the display drivers. You lock that task to core 1 and start a task that runs the LVGL code on core 0. You will need to use semaphores to handle sending commands and buffer data so you don’t have overlapping access from both cores.

1 Like

Your questions isnt about LVGL not VSCode , and Platform isnt Android…
But simple yes DMA with two dedicated SPI can speed up refresh to around 75FPS or more with QSPI displays… All power and limits is defined with choiced MCU and hw, no LVGL…

Your questions isnt about LVGL not VSCode , and Platform isnt Android…

  • Thank you for pointing out my typo, it should of course be Arduino in the “Environment” list like the rest of the post. I fixed it now.
  • I don’t think I asked anything about VSCode, was it wrong to mention the IDE used under “Environment”?
  • How isn’t this a question about LVGL? I asked if LVGL handles flushing multiple displays in parallell. At least for me it isn’t obvious from the documentation that it does.

But simple yes DMA with two dedicated SPI can speed up refresh to around 75FPS or more with QSPI displays…

I never said the displays are QSPI, they are just regular SPI devices. If they were QSPI I wouldn’t have a problem at all at 80MHz.

LVGL not handle flush only call cb , this is your job. LVGL in real render changes on many displays in one thread, but flushing can do in paralel, if you write code in cb for this. Latest version LVGL have multithread rendering suport, then you too can render in parallel, but i mean here for more power is requireed use cores = choiced MCU usw.

And QSPI i mean you choiced hw limit you …

1 Like

Thanks for the details @kdschlosser !

First thing is to get away from using LovyanGFX.

No problem, I only used it because it just worked. Is it considered bad practice to use it with LVGL or is it just too limited?

Second thing is to use DMA memory transfers and use 2 frame buffers for each of the displays.

Should I interpret that as LVGL has no problem flushing multiple displays in parallell?
Edit: @Marian_M answered that while I wrote the reply. It should work. :+1:

Basically you are going to need to use FreeRTOS to start a task that handles the display drivers. You lock that task to core 1 and start a task that runs the LVGL code on core 0.

I was already thinking about going dual-core to offload the ESP-NOW data processing and CAN frame decoding, leaving the second core dedicated to LVGL, the data flow between the two would also be trivial so very little locking.

Would it really make much difference moving the display driver over to a different core from LVGL when using DMA transfers? The display driver I wrote myself for another display on a Pi Pico LVGL project was so thin it almost didn’t have any overhead for flush. I was literally just sending a command or two to the display and then setup a DMA transfer and was done.

Talking about RTOS, I was planning to learn Zephyr since we use it at work. Would that be a viable RTOS in this scenario?

If you used an ESP32-S3 with 8MB of memory and 16MB of octal flash you would be able to overclock the SPIRAM so it runs at an effective speed of 240mHz instead of 80mHz.

It was just today I noticed that the 2MB version of the ESP32-S3 only had a quad connection for the PSRAM unlike the larger S3 variants, so I might simply upgrade. I liked the small form factor of the WaveShare Pico though. Thanks for the tip.

When reading the ESP32-S3 docs I also noticed this tidbit:

  • Quad PSRAM only supports STR mode, while Octal PSRAM only supports DTR mode.

Since you seem to know a fair bit about this. Does that mean that the 2MB version actually has only a fourth of the PSRAM bandwidth compared to the larger versions? 4 vs 8 bit and single vs double data rate.

Thanks in advance for any answers, I know it was a lot of questions above. =)

It does make a difference but in a round about kind of a way. the DMA buffer is only so large and it is not large enough to fit the entire amound of data being sent into a single “transaction” so what happens is while the transferring is non blocking the loading of data into the next transaction is not. The CPU is required to do that and that code is executed inside of an ISR context. so when you are sending data to multiple displays like you want to do you are going to end up having a heck of a lot of ISR’s occurring. ISR’s are actually quite expensive because of the time needed to save the current CPU state, load the code that is to run in the ISR context and then run that code and to load the saved CPU state and then start running that state again. Because the ESP32-S3 you are using is dual cores that means that you can move the ISR’s to the other core so LVGL will not get interrupted while it is running. To do this the display drivers need to be started and the buffer data passed to those drivers on the other core. It does make a difference.

You can still push ESP NOW over to that core as well as a munch of other things. You will need to get your tack management correct and passing buffers between the running tasks correct. You cannot let a task simply sit there and spin it’s wheels, you have to properly yield so that other tasks will have the ability to run. The trick is to keep the loops from taking a long time to run. so breaking what needs to be done into chunks is always best if what needs to be done is a larger amount of work.

Stop using the Arduino SDK. It doesn’t use a complete version of the ESP-IDF and it has a very large amount of wrapper code that make code execution slower. My suggestion is to use the build system that is built into the ESP-IDF. It is not that hard to use but it does take a while to get a handle on how to use it. Having someone explain what to do makes it really easy. The documentation is not the clearest and it could be better which is why it helps if someone is able to lend a hand.

LovyanGFX is not a driver. It is a rendering framework. It is written in a manner that allows it to render. so when you are writing a bitmap to what you believe is the display you are actually not. the bitmap which is actually LVGL frame buffer data gets copied to another buffer \ and that is the buffer that gets use to send the data to the display. That copying doesn’t need to happen and it slows things down big time.

You have the ability to create multiple displays in LVGL. or you can do some kung fu in your flush function to handle writing the data to both displays using a single “display” in LVGL. If you want to go the route of using a single display you would set the resolution of the LVGL display to have a width of 480 and the data that is written to anything below 240 would go to one display and the data that is written to the other would go to the second display.

There are some tricks you can use if you want to utilize 2 LVGL displays. Setting the user data for the display is an important one. What you pass as the user data is the reference to the display driver so when you need to pass the buffer that has been rendered you are automatically going to be passing it to the correct display driver because the handle is going to be attached to that specific LVGL display. You will only need a single flush function to handle it. You only need to create a single task to handle both display drivers that runs on the second core. there is no need to have 2 tasks to do that work.

This can get quite technical so I will do my best to explain it.

The SPIRAM and flash use the same SPI bus. so that’s #1. So you are only going to be able to speed things as as fast as the slowest item attached to that bus.

With the 2MB SPIRAM there is only 4 lanes connected to it. Same goes for the flash. So lets looks at some therotical numbers… 4 lanes running at 80mhz means you are able to transfer 320,000,000 bits per second. Now this might seem like it is fast but watch this. First lets turn it into bytes so divide that by 8 and you get 40,000,000. Now remember you have reads and writes occuring at the same time so divide that in 1/2. Now you are at 20,000. Oh oh there are 2 devices connected to the bus, the flash AND the ram so divide that in 1/2 once again… Now you are down to 10,000. see how fast that number disappeared?

so most times read and write operations to the SPIRAM is going to occur at this speeds especially when using DMA transfers. This happens because the cal to send the buffer data is non blocking so the flash is going to be accessed to load the code that needs to be run. in a blocking call you would get better transfer speed but you loose the ability to render at the same time the data is being transferred.

If you bump up to the version that has octal SPIRAM but the flash is still using a quad SPI connection you are only going to be able to tweak the SPI to what the slowest device is going to be able to handle. You will have the ability to overclosk to 120mHz but you will not have the ability to use DTR mode. so the best you will get for speed is going to be 120mHz. Remember the bus sharing between the SPIRAM and the flash. well that quad flash is going to take a while to load code. that ends up becoming the bottle neck. Only one device is able to access the bus at a given time. so while the code is being read from flash access to the SPIRAM stops happening. There are several ways to handle this and one of them is o load all of the code in flash into SPIRAM when the ESP32-S3 boots up. This is going to eat up quite a bit of SPIRAM when this is done so if you have a lot of code that gets compiled then you are going to see your SPIRAM availability drop equal to the size of the compiled binary.

If you get the S3 that has octal SPIRAM and also has octal flash you also remove that bottle neck and you also get to double the effective SPI Bus clock sped by turning on DTR. so instead of running at 120mHz the bus has an effective clock rate of 240mHz. Lets see what those numbers look like.

8 lanes * 240000000 = 1,920,000,000 / 8bits = 240,000,000 bytes * 0.5 * 0.5 = 60,000,000…

6 times faster than what is gotten from the quad SPI. When you use that compared to what a display has this is what it looks like.

Single display on a quad SPI bus running at 80mHz

4 * 80000000 / 8 = 40,000,000 bytes

instead of being lower than what the display is able to handle for a transmission speed which is not good because you are only going to be able to feed the data to the display as fast as you are able to read the data from memory. so being under that number means that your transfer speeds will be lower because of the bottleneck at the SPIRAM. Having the displays attached to the same SPI bus is going to slow things down a lot. The maximum transfer speed will be cut in half. My suggestion is to have a separate bus for each display, it is going to put you upside down again but having a max transfer of 30,000,000 to each display is going to be faster than having a max speed of 20,000,000 to each display. and that max is going to be effected by whether or not LVGL is actually rendering to a buffer or not and if code is being loaded from flash. where as having the 2 displays attached to the same bus is not going to be able to get any better than the 20,000,000 per display no matter what is going on with the ESP32S3. so if there is no access tasking place to the flash and LVGL happens to not be rendering to a buffer your SPIRAM access speed when transferring just jumped up to 120,000,000 which means you will be able to max out the transfer speed of 40,000,000 on each bus and still have some room left over with accessing the SPIRAM.

These are theoretical numbers which you won’t actually see and it’s intended to give an understanding of the mechanics taking place at the hardware level.

1 Like

My suggestion is to get the 8mb of octal SPIRAM and the 16mb or 32mb of octal flash. have the S3 load the program into SPIRAM on boot which is not going to be an issue because you have more than enough memory to do that. set up the tasks using FreeRTOS and allocate 2 frame buffers for each display and the buffers need to be allocated in DMA memory.

I actually ordered two Espressif devkits with the ESP32-S3-WROOM-2-N32R16V module yesterday. From what I could read in the data sheets, only the WROOM-2 revision has octal flash as well as octal PSRAM. And 16MB of PSRAM is… a lot.

I guess it is back to creating a new prototype PCB when it arrives and see how much of the things you mentioned above I can actually pull of. The scope sure did grow on this relatively modest hack. =)

One last question though. As I mentioned above, my work uses Zephyr, so it would make a lot of sense to try it out at home as well to learn it. The LVGL docs even seem to indicate that flushing from a thread is already implemented. What are your thought @kdschlosser ?

the ESP32 SDK (ESP-IDF) is written to specifically use a modified version of FreeRTOS. I recommend using that modified version because it has been specifically altered to work with the ESP32 line of MCU’s. It is already packaged with the SDK so nothing else needs to be downloaded to get it to work other than the SDK itself.

It’s really easy to use because it has been modified in a manner that alters the default behavior of FreeRTOS while leaving most of the API in tact. That means the documentation for FreeRTOS still applies. They have added additional functions which are documented in the ESP-IDF.

I can give you some code to use that helps with making synchronization easier to do by simplifying the use of semaphores. Task management is really simple with FreeRTOS and the folks over at espressif extended the capabilities by adding code that allows you to target specific cores to run tasks on. So it’s really a breeze to use.

you don’t want to use the build in “OS” feature of LVGL. It is written to use FreeRTOS yes, but not written to use the ESP-IDF’s modified version of it. While it does work it’s not written to get the best possible performance.

I can give you a hand with the code aspect if you want. once you see how the flow of the program needs to be set up you will understand what I am exactly talking about.

LVGL is NOT thread/task safe. No matter what you read in the LVGL docs. If it says that somewhere in the LVGL docs could you please post me a link to it so I can forward that information to the author to have it corrected because it must be a typo.

LVGL flushing from multiple frame buffers has a mechanism built in to keep LVGL from rendering to a buffer that is currently being transmitted. But it doesn’t make LVGL thread safe. You cannot update any LVGL widgets while LVGL is rendering to a frame buffer as it will cause memory corruption.

Everything done in LVGL must all occur in the same thread including the calling of the flush function. Inside the flush function what happens with the frame buffer data beyond that is what we do have control over and proper handling of the buffer and passing a pointer to the buffer to the task running on another core is what we do have the ability to control and to keep the data flow in sync and restricting access to the buffer by only a single thread/task.

I can point you to an example of what I am talking about but it will be very hard to understand what is happening because of it being written to be used in a high level programming language. so there is a lot of extra care taken in my use case that would not be needed if the code is compile time code and not runtime code.

You asked for better performance and in order to get that you are going to need to make some changes to both the hardware and also the software. I will gurantee you that you will not be unhappy with the result.

There is one last performance boost that can be achieved but it would require rewriting chunks of LVGL in order to make it work. LVGL is written in a manner that onlyu allows a single “instance” of it. what I mean by that is you cannot have 2 LVGL task managers running at the same time. so you will not be able to render to 2 different frame buffers in parallel. This is because of how LVGL has been written and it’s use of global variables to store the state of different things. The code would need to be changed to remove the use of the global variables and then you could have 2 separate instances of LVGL running on 2 different cores handling 2 different displays at the same time. this would mean more RAM use and more CPU use but the ability to run 2 separate instances of LVGL at the same time would be the fastest thing that could be done.

If wanting to get that advanced you would end up having issues with WiFi and Bluetooth because of of CPU intensive those things are. Those are what gets run on the second core and this is what the ESP-IDF does internally. Dealing with DMA IO transfers is not something that would cause a problem with the running of those things on the same core.

The documentation isn’t claiming that rendering in LVGL on Zephyr is used over multiple threads, it specifically talks about the flushing of buffers here, also note the text at the end about newer versions of Zephyr.

The main question is simply about being pragmatic with the limited time resources of a small hobby project. I mainly opted for ESP32 in this project since I could get ahold of a finished and well made CAN-shield and because learning to use ESP32:s would be useful at work. The work aspect would apply to Zephyr as well.

I really appreciate you taking time to give your long and detailed replies, I can feel that you are really enthusiastic about these things, which is great. I on the other hand need to reel in my ambitions a bit, for now.

At least my little project reached a milestone today with the can-gateway and prototype display working in the car to show the live engine RPMs.

As far as my long responses. It’s what I do. better to explain something in full then play 20 questions.

It’s not being enthusiastic… it’s being thorough.