Synchronising flush with vsync

microwavesafe · May 18, 2020, 8:37am

Description

I have LVGL working and the results are good. I have a slight tearing effect during animation which I was looking to reduce. I can not use true double buffering as my target will only have the on board RAM of the STM32F429, with a 320 x 240 display.

I would like to synchronize the flush / buffer updates to the line interrupt of the LTDC peripheral (the interrupt will fire at a programmable display line). What I don’t understand at the moment is how to work out when LVGL has completed a frame update. I’m also not sure of the best way to hold up LVGL while I wait for the next line interrupt.

I have two ideas. In the flush function I spin on a flag which is set in the line interrupt. But I would need to know in the flush function whether or not LVGL has finished updating the frame, which I don’t think I can at present. Or I synchronize the calling of lv_task_handler() to the line interrupt, therefore ensuring all functions are run synchronized with the frame write, but this seems very heavy handed.

Does anyone have experience with this and can share some ideas?

Thanks

What MCU/Processor/Board and compiler are you using?

STM32F429, on the discovery board at the moment, by the final target will not have the off board RAM.

What do you want to achieve?

Synchronized frame updates with display frame write.

kisvegabor · May 21, 2020, 7:02am

Hi,

There is no simple built-in method for this but it could be.
A correct solution would be to add a sync_cb (which can return LV_RES_OK or LV_RES_INV) to the display driver struct and keep sync_cb spinning before refreshing.

What do you think about it?

A quick and dirty solution would be to add this synchronization manually to disp_refr_task in lv_refr_c.

microwavesafe · May 21, 2020, 1:13pm

sync_cb sounds like a good idea. At the moment I am synchronizing lv_task_handler() to the LTDC line interrupt, just to see if it helped with the tearing and to be honest, it didn’t really. I realise without double frame buffers it can’t be eliminated.

I still think this is a good idea and could be used for all types of LCD controllers that provide tearing or line position feedback.

kisvegabor · May 25, 2020, 6:49am

On which line do you synchronize? I suppose it should be the last one.

microwavesafe · May 26, 2020, 8:12am

I have played with a few different interrupt lines to see what the difference was. There is minimal difference in my project as the real problem is the frame update is taking longer than the LCD refresh, so whatever I do there is always a partial frame update.

I am currently working on getting the DMA2D integrated, I have done the fill, but not the blend yet. This could decrease the frame update to a point where tearing synchronization becomes more important.

As an aside I noticed in V6.12 of the library the check to use GPU for fill is based on fill width, whereas the GPU is capable of using the offset to fill any rectangle, so an area based check is more appropriate (plus I’ve found the setup code for the DMA2D to be very short about 10 lines of code, so it’s rare that it isn’t worth it).

kisvegabor · May 27, 2020, 2:06pm

In v7 DMA2D is integrated right into LVGL and you acn just enable it in lv_conf.h

github.com

lvgl/lvgl/blob/master/lv_conf_template.h#L180


#define LV_USE_IMG_TRANSFORM    1

/* 1: Enable object groups (for keyboard/encoder navigation) */
#define LV_USE_GROUP            1
#if LV_USE_GROUP
typedef void * lv_group_user_data_t;
#endif  /*LV_USE_GROUP*/

/* 1: Enable GPU interface*/
#define LV_USE_GPU              1   /*Only enables `gpu_fill_cb` and `gpu_blend_cb` in the disp. drv- */
#define LV_USE_GPU_STM32_DMA2D  0

/* 1: Enable file system (might be required for images */
#define LV_USE_FILESYSTEM       1
#if LV_USE_FILESYSTEM
/*Declare the type of the user data of file system drivers (can be e.g. `void *`, `int`, `struct`)*/
typedef void * lv_fs_drv_user_data_t;
#endif

/*1: Add a `user_data` to drivers and objects*/
#define LV_USE_USER_DATA        0

microwavesafe · May 28, 2020, 3:45pm

That is interesting. It does seem a bit overkill to use the ST HAL though. For example to start a DMA2D fill you only need to do this.

typedef struct dma2d__fill_t {
    uint32_t colour;                    // colour to fill (must adhere to colour format in options)
    uint32_t destination;               // destination address
    uint16_t destinationOffset;         // line offset in pixels (add to end of each line)
    uint16_t pixelsPerLine;             // pixels per line
    uint16_t numberOfLines;             // number of lines
} dma2d__fill_t;

void
dma2d__fill(UNUSED uint8_t dma2d, dma2d__fill_t *options, dma2d__interrupts_t *interrupts)
{
    if (BIT__TEST(DMA2D->CR, DMA2D_CR_START_Pos)) {
        return;
    }

    DMA2D->CR = 0x30000;
    DMA2D->OMAR = options->destination;
    DMA2D->OCOLR = options->colour;
    DMA2D->OOR = options->destinationOffset;
    DMA2D->NLR = (options->pixelsPerLine << DMA2D_NLR_PL_Pos)
    | (options->numberOfLines << DMA2D_NLR_NL_Pos);

    // start transfer
    BIT__SET(DMA2D->CR, DMA2D_CR_START_Pos);
}

This isn’t quite complete (doesn’t set up interrupts or deal with errors), but it does show the minimum to get things work.

microwavesafe · May 28, 2020, 3:49pm

I’m using that function for gpu fills this way.

void
lvgl__gpuFill(
    UNUSED lv_disp_drv_t *drv,
    lv_color_t *destinationBuffer,
    lv_coord_t destinationWidth,
    const lv_area_t *fillArea,
    lv_color_t color)
{
    lv_color_t *destination = destinationBuffer;
    destination += (destinationWidth * fillArea->y1) + fillArea->x1;
    uint16_t fillWidth = lv_area_get_width(fillArea);
    uint16_t fillHeight = lv_area_get_height(fillArea);

    dma2d__fill_t fillOptions = {
        .colour = (uint32_t)color.full,
        .destination = (uint32_t)destination,
        .destinationOffset = destinationWidth - fillWidth,
        .pixelsPerLine = fillWidth,
        .numberOfLines = fillHeight,
    };

    dma2d__fill(0, &fillOptions, 0);
}

microwavesafe · May 29, 2020, 7:10am

I was thinking more about this and I was wondering if putting the GPU code directly into the library is the right thing to do? It feels like it would be better to bring all the new hooks out to function pointers so any GPU can use them, then provide the ST HAL DMA2D code as a driver?

That way the core library is not polluted with third party code and developers are free to use all the potential GPU hooks.

Maybe this isn’t the right place mention this, but I thought it was worth mentioning.

embeddedt · May 29, 2020, 12:12pm

This is how we used to do things, but we found that there was a substantial performance boost when the functions were integrated directly.

The external callbacks still exist and work (gpu_fill_cb and gpu_blend_cb).

kisvegabor · June 2, 2020, 7:07am

Bit set instead of HAL calls would be really faster but less clear. What do you think, what is the performance impact of using HAL?

We decided to add GPUs into LVGL because too many callbacks would be required to support every possible case. Regarding software architecture, it is really not that good but make possible to use all power of the GPUs.

microwavesafe · June 2, 2020, 9:01am

Performance impact probably isn’t huge, but the code would be far more readable, in my opinion. You only need to set up the registers that need changing.

I still think the better idea is not to include it in the library at all. First, there could be licensing complications with integrating so tightly with your own MIT licensed library. Second I only count four ST DMA2D specific function calls? This seems very reasonable to move to function pointers, alongside the two that are already there.

Of course these are only my opinions and wont stop me from using the library!

embeddedt · June 2, 2020, 11:35am

You also save space. ST’s HAL is useful if you want to start on your project right away, but it is one of the largest manufacturer libraries I have seen.

microwavesafe · June 2, 2020, 2:33pm

Might it also conflict with users own imports of the ST HAL library?

embeddedt · June 2, 2020, 4:21pm

Hmm. I’m not sure how there can be conflicts, as we don’t bundle the HAL library; we assume that it already exists in your project (if you enable the STM32 GPU in lv_conf.h).

microwavesafe · June 2, 2020, 4:25pm

Sorry, missed that, I assumed you bundled a known working version.

kisvegabor · June 3, 2020, 10:48am

In my opinion, the tight integration gives more flexibility (at the cost of readability of course). Imagine that there is a GPU that can draw rounded rectangles with gradient. To utilize it, we’d need a callback for it. If it supports text rendering, a callback would be required for that too. In the end we would have a bunch of complex callbacks with 10-15 parameters. And user still needs to add their GPU functions or copy/paste them from some example projects.

I agree to skip the HAL library. @microwavesafe Would you like to contribute with that?

microwavesafe · June 3, 2020, 11:58am

I can see understand your reasoning about the tight integration. It does give maximum flexibility. It shouldn’t take too much effort to rewrite the HAL calls to register writes. I need to upgrade to V7 of the library first, but I will be able to take a look at it.

I did see that you wait for the DMA2D to finish at the end of each function, just after starting the transfer. In my own code I wait at the beginning, under the assumption that the CPU has other work to do be doing while the DMA2D is working. I would probably change that, if that’s OK?

I was also wondering if we could check to see if the DMA2D is busy, then use the CPU to perform the next fill or blend, so we can parallel up the processing, rather than using the CPU to poll the DMA2D start bit and call the wait callback?

embeddedt · June 3, 2020, 12:18pm

That’s a good idea!

microwavesafe · June 4, 2020, 7:52am

I was thinking a bit more on how this would work. If the GPU was performing a blend then we couldn’t allow the CPU to blend in the same area, as the result may depend on the result of the GPU. An out of sequence fill would also cause problems. So I think the logic would go something like this.

lv_area_t gpu_area;

// check start bit to determine if busy
if (!gpu_busy()) {
    // set line watermark to 50% to give us some information as to progress of GPU
    gpu_set_watermark_50_percent(area);
    gpu_area = area;
    gpu_run(parameters, area);
}
else {
    lv_area_t gpu_working_area;

    // calculate current working area best resolution is which half
    if (gpu_watermark_reached()) {
        gpu_working_area = second_half(area);
    }
    else {
        gpu_working_area = first_half(area);
    }

    // if current working area and next area to write to collide then wait for gpu to finish
    if (area_collision(gpu_working_area, area) {
        gpu__wait();
        gpu_set_watermark_50_percent(area);
        gpu_area = area;
        gpu_run(parameters, area);
    }
    else {
        cpu_run(parameters, area);
    }
}

This may be more overhead than the speed up gives us, but it might be worth exploring.