How to use multiply threads to render the screen in parallel?

YQHP · April 12, 2024, 12:38pm

Description

I have tried setting the LV_DRAW_SW_DRAW_UNIT_CNT to 2 to create the task in different CPU cores. But I found that almost only one task is executing.

I want to know that how to use multiply threads to render the screen in parallel? Am I using it wrong? Are there any other requirements besides setting the LV_DRAW_SW_DRAW_UNIT_CNT value?

What MCU/Processor/Board and compiler are you using?

ESP32

What LVGL version are you using?

V9.1

What do you want to achieve?

I want to improve the fps of lcd display through parallel rendering.

What have you tried so far?

I have tried setting the LV_DRAW_SW_DRAW_UNIT_CNT to 2 to create the task in different CPU cores. But I found that almost only one task is executing.

Code to reproduce

Add a code snippet which can run in the simulator. It should contain only the relevant code that compiles without errors when separated from your main code base.

The code block(s) should be formatted like:


#define LV_DRAW_SW_DRAW_UNIT_CNT 2

Screenshot and/or video

If possible, add screenshots and/or videos about the current state.

Cyrille · April 18, 2024, 7:55am

Hi, i face the same issue with an i.mx93 evk board + yocto Linux imx-6.1.55-2.2.0.

I added
#define LV_USE_OS LV_OS_PTHREAD
#define LV_DRAW_SW_DRAW_UNIT_CNT 2

I also enabled:
#define LV_USE_PARALLEL_DRAW_DEBUG 1

What i observe with lvgl benchmark demo is:

There are indeed 2 threads doing the drawing, but they run on the same core (from top command).
There are no additional cpu power, one core busy with 2 drawing threads, the second i.MX93 cores not being used.

Reading LVGL v9 is released | LVGL’s Blog :
“Next week, I plan to conduct further tests with an i.MX 9 multi-core MPU board to assess the benefits of parallel software rendering”

@kisvegabor, are we doing something wrong configuring lvgl ?
Did you get time to test on i.MX9 multi core MPU ?

Regards

YQHP · April 18, 2024, 9:05am

Hi, thanks for your reply.

My configuration is
#define LV_USE_OS LV_OS_FREERTOS
#define LV_DRAW_SW_DRAW_UNIT_CNT 2

Regarding this issue, I discovered some phenomena. But I’m not sure if this is the right reason. I found that there are two parameters that ‘perferred_draw_unit_id’ and ‘preference_score’ in the ‘_lv_draw_task_t’ structure. I also found that the preferred_draw_unit_id is set to DRAW_UNIT_ID_SW in the evaluate function, and the DRAW_UNIT_ID_SW parameter is passed in to lv_draw_get_next_available_task() in the dispatch function. Does this mean that when multiple tasks are in a waiting state, the code will prefer to use the specified task for rendering?

struct _lv_draw_task_t {

    lv_draw_task_t * next;

    lv_draw_task_type_t type;

    /**
     * The area where to draw
     */
    lv_area_t area;

    /**
     * The real draw area. E.g. for shadow, outline, or transformed images it's different from `area`
     */
    lv_area_t _real_area;

    /** The original area which is updated*/
    lv_area_t clip_area_original;

    /**
     * The clip area of the layer is saved here when the draw task is created.
     * As the clip area of the layer can be changed as new draw tasks are added its current value needs to be saved.
     * Therefore during drawing the layer's clip area shouldn't be used as it might be already changed for other draw tasks.
     */
    lv_area_t clip_area;

    volatile int state;              /*int instead of lv_draw_task_state_t to be sure its atomic*/

    void * draw_dsc;

    /**
     * The ID of the draw_unit which should take this task
     */
    uint8_t preferred_draw_unit_id;

    /**
     * Set to which extent `preferred_draw_unit_id` is good at this task.
     * 80: means 20% better (faster) than software rendering
     * 100: the default value
     * 110: means 10% worse (slower) than software rendering
     */
    uint8_t preference_score;
};

#define DRAW_UNIT_ID_SW 1

static int32_t evaluate(lv_draw_unit_t * draw_unit, lv_draw_task_t * task)
{
    LV_UNUSED(draw_unit);

    switch(task->type) {
        case LV_DRAW_TASK_TYPE_IMAGE:
        case LV_DRAW_TASK_TYPE_LAYER: {
                lv_draw_image_dsc_t * draw_dsc = task->draw_dsc;

                /* not support skew */
                if(draw_dsc->skew_x != 0 || draw_dsc->skew_y != 0) {
                    return 0;
                }

                bool transformed = draw_dsc->rotation != 0 || draw_dsc->scale_x != LV_SCALE_NONE ||
                               draw_dsc->scale_y != LV_SCALE_NONE ? true : false;

                bool masked = draw_dsc->bitmap_mask_src != NULL;
                if(masked && transformed)  return 0;

                lv_color_format_t cf = draw_dsc->header.cf;
                if(masked && (cf == LV_COLOR_FORMAT_A8 || cf == LV_COLOR_FORMAT_RGB565A8)) {
                    return 0;
                }
            }
            break;
        default:
            break;
    }

    if(task->preference_score >= 100) {
        task->preference_score = 100;
        task->preferred_draw_unit_id = DRAW_UNIT_ID_SW;
    }

    return 0;
}


static int32_t dispatch(lv_draw_unit_t * draw_unit, lv_layer_t * layer)
{
    LV_PROFILER_BEGIN;
    lv_draw_sw_unit_t * draw_sw_unit = (lv_draw_sw_unit_t *) draw_unit;

    /*Return immediately if it's busy with draw task*/
    if(draw_sw_unit->task_act) {
        LV_PROFILER_END;
        return 0;
    }

    lv_draw_task_t * t = NULL;
    t = lv_draw_get_next_available_task(layer, NULL, DRAW_UNIT_ID_SW);
    if(t == NULL) {
        LV_PROFILER_END;
        return -1;
    }

    void * buf = lv_draw_layer_alloc_buf(layer);
    if(buf == NULL) {
         LV_PROFILER_END;
        return -1;
    }

    t->state = LV_DRAW_TASK_STATE_IN_PROGRESS;
    draw_sw_unit->base_unit.target_layer = layer;
    draw_sw_unit->base_unit.clip_area = &t->clip_area;
    draw_sw_unit->task_act = t;

#if LV_USE_OS
    /*Let the render thread work*/
    if(draw_sw_unit->inited) lv_thread_sync_signal(&draw_sw_unit->sync);
#else
    execute_drawing_unit(draw_sw_unit);
#endif
    LV_PROFILER_END;
    return 1;
}

The above is just my guess, and I also want to know the mechanism of parallel rendering.

Regards

kisvegabor · April 19, 2024, 8:28am

Hey,

I was fighting quite a bit with with Yocto, but finally couldn’t test it on i.MX

LVGL just uses pthread to create a thread and I assumed that they will be assigned to different cores automatically. (At least this what happens on my Linux notebook)

Could you try this?

This page of the docs describes how the new rendering pipeline works. If there are 2 draw units with the same ID, LVGL tries the first one, if it’s busy, it tries the next one. If it’s also busy, it will wait for a draw task to be completed and try to dispatch again.

kdschlosser · April 20, 2024, 10:11pm

The easiest way to go about this is by creating a thread in the cores that you want and have those threads stall using a mutex. When the flush callback gets called for a display you set the pointer for the buffer to a variable the thread is able to access and then release the mutex so the thread is able to flush the buffer.

You are using the ESP32 so it is a tad bit more involved if you are using the esp_lcd component and if you are using DMA memory. If using the RGB panel driver it can get even more complicated because of the driver creating the buffers and getting access to those buffers… This is something that is VERY doable when using the ESP32 but I believe that LVGL is still only going to be able to process one displays data at a time and if you are using DMA memory it becomes pointless to do this because the processor isn’t involved in sending the data to the display so the flushing is not blocking. That is really what you should be doing.

I would suggest using the ESP32-S3 because the SPIRAM is able to be used as DMA memory where as the other versions are not. There is only a very small amount of DMA memory if you are not using an S3. it’s just enough to use double buffering for a display that is 480x320x16bit with the frame buffers set to 1/10th the RGB size for the display, that comes out to 61,440 bytes for both frame buffers. so not much DMA memory available. unless you are using the S3.

kdschlosser · April 20, 2024, 10:21pm

Here is an example written for MicroPython using SDL2 as the display…

This function gets called from the flush callback. this is what handles passing setting the buffer data to be written and unlocks the mutex so the thread is able to write the data to the display.

mp_lcd_err_t sdl_tx_color(mp_obj_t obj, int lcd_cmd, void *color, size_t color_size, int x_start, int y_start, int x_end, int y_end)
    {
        LCD_UNUSED(x_start);
        LCD_UNUSED(y_start);
        LCD_UNUSED(x_end);
        LCD_UNUSED(y_end);
        LCD_UNUSED(color_size);

        mp_printf(&mp_plat_print, "sdl_tx_color started\n");

        mp_lcd_sdl_bus_obj_t *self = MP_OBJ_TO_PTR(obj);
        while (!self->trans_done) {}
        self->trans_done = false;
        self->panel_io_config.buf_to_flush = color;
        SDL_UnlockMutex(self->panel_io_config.mutex);

        if (self->callback != mp_const_none && mp_obj_is_callable(self->callback)) {
            mp_call_function_n_kw(self->callback, 0, 0, NULL);
        }

        mp_printf(&mp_plat_print, "sdl_tx_color finished\n");

        return LCD_OK;
    }

This is the code the tread runs.

int flush_thread(void *self_in) {
        mp_printf(&mp_plat_print, "flush_thread running\n");

        mp_lcd_sdl_bus_obj_t *self = (mp_lcd_sdl_bus_obj_t *)self_in;
        void* buf;
        int pitch;

        while (!self->panel_io_config.exit_thread) {
            SDL_LockMutex(self->panel_io_config.mutex);
            mp_printf(&mp_plat_print, "flush_thread started\n");

            if (self->panel_io_config.buf_to_flush != NULL) {
                pitch = self->panel_io_config.width * self->panel_io_config.bytes_per_pixel;
                buf = self->panel_io_config.buf_to_flush;
                SDL_UpdateTexture(self->texture, NULL, buf, pitch);
                SDL_RenderClear(self->renderer);
                SDL_RenderCopy(self->renderer, self->texture, NULL, NULL);
                SDL_RenderPresent(self->renderer);
            }

            self->trans_done = true;
            mp_printf(&mp_plat_print, "flush_thread finished\n");

        }
        mp_printf(&mp_plat_print, "flush_thread ended\n");
        return 0;
    }

When the display is initialized before creating the the thread the mutex gets created and then it gets locked. This way when the thread is created and tries to lock the mutex the thread gets stalled. This keeps the thread from consuming all of the CPU time for that core.

I added a second guard for writing the display buffer data which can be seen with this code self->trans_done = true;. That second guard is used because of needing to call a python function in order to call lv_display_flush_ready. In MicroPython I am only able to call a python function from C code from the main thread. If you look at the first code example you will see a stall until trans_done gets set to true. once that happens I am able to set the pointer to the buffer, release the mutex and then make the callback telling LVGL the buffer is done transferring even tho it really isn’t. In you code you would not have to do this and you can call lv_display_flush_ready directly from the thread that is flushing the buffer. You are able to do this because there is no allocation that is happening.

Marian_M · May 11, 2025, 8:20am

Im little lost in this ticket. Started about iMX multicore @kisvegabor , then switched into ESP32 multicore and noreal info. Pls after year can anybody explain or get link to functional example of multicore render.?

My ideas is for partial mode : 1.LVGL autodetect MCU capability for this. And use it .
For double buffer init one core manage buff1 as single second core buff2.
For single buff init LVGL split buff to half and do same.
2. Flush DMA waits for buffs render done and flush first winner…

kisvegabor · May 23, 2025, 6:18pm

You just need to enable

LV_USE_OS FREERTOS
LV_DRAW_SW_DRAW_UNIT_CNT 2

What happens if you do this?

Marian_M · May 27, 2025, 11:50am

Ok seems on master works, but how measure simply profit ?
Next issue my design use
A4 alpha images, but on master v9 seems dont work. I replace with A8 , but why A4 isnt supported?

kisvegabor · May 29, 2025, 9:50am

You can just run lv_demo_benchmark to compare.

In v9 the rendering engine is way more flexible in v8, but we made some trade offs. One of them was dropping A1/2/4 rendering. at least for now.