Lvgl performance dropped when switching from v7 to v9. How to increase?

,

What MCU/Processor/Board and compiler are you using?

imxrt1050/1060

What LVGL version are you using?

lvgl 7.6 as part of zephyr 2.7
lvgl 9.3 as part of zephyr 4.1

What do you want to achieve?

Reduce the time required to generate the framebuffer

What have you tried so far?

UI 768x1024 points.
My program has several screens between which I switch while working. To switch, I use the physical buttons of the device with which LV_INDEV_TYPE_BUTTON input device is associated.
First screen has about 20 elements, one of them is bmp 768x689, several png-s (img-widgets), rest are lvgl-obj (base, text).
Second screen has about 50 elements, but part of them are hidden. Screen object is an img-widget with bmp 768x1024 as src. At one moment the screen may display about 10 png (img widgets), about 20 base & txt widgets.
Third screen also has about 60 elements, part of them also hidden. Screen object is an img-widget with bmp 768x1024 as src. At one moment the screen may display about 10 png (img widgets), about 10 base & txt widgets.

When user clicked button I prepare new screen: show/hide img depending of device status, set text & color, after that call lv_scr_load() to switch to activated screen.

With lvgl v7.6 time between button-click detected and lv_scr_load() call is about 60-70 ms when the screen is first activated, 30-50ms on subsequent activations. I observe similar values ​​when using lvgl v9.3.
With lvgl v7.6 time between lv_scr_load() call and ELCDIF_SetNextBufferAddr() call in display_flush() is about 160-260ms on first activation, 105-110 ms on next activations. When using lvgl v9.3 first activation takes 450-650 ms, subsequent activations 270-320ms.
As far as I understand after lv_scr_load() call lvgl fills a new frame buffer inside lv_task_handler() and in v9.3 takes 2-3 times as long to do this compared to v7.

And now I can clearly see the delay when switching the screen after pressing the button, although in the v7 everything was unnoticeable.
All images (bmp, png) stored in ro-filesystem which content located in RAM.

Сall stack when accessing bmp-file contents in v7.6 looks like
image
Сall stack when accessing the same file in v9.3

All data - OS heap, LVGL heap, files content - are stored in external SDRAM, accessing via SEMC.

How can I reduce this time?

I took some measurements.
v7.6 lv_img_design() located in src\lv_widgets\lv_img.c execution time accessing bmp-file 768x689 - 25.1ms
v9.3 draw_image() (as far as I understand this function has the same purpose as lv_img_design() in v7.6) located in src\widgets\image\lv_image.c execution time accessing the same bmp - 51.5ms.