Lv_micropython crashing on rp2040 Pico board

Campbell · February 20, 2022, 10:52am

Hi,

I successfully run lvgl on the RP2040 Pico board in a C++ build using Free RTOS under Eclipse IDE on MacBook M1 Pro host.

I want to experiment with duplicating the application using micropython, but I’m having some trouble with the built image.

To avoid porting the build to Eclipse for the moment, I perform a CLI build as follows (note necessary slight modifications from the lv_micropython readme):

git clone https://github.com/lvgl/lv_micropython.git lv_micropython
cd lv_micropython
git submodule update --init --recursive lib/lv_bindings
make -C mpy-cross
make -C ports/rp2 submodules
cd ports/rp2
make USER_C_MODULES=…/…/lib/lv_bindings/bindings.cmake

[100%] Linking CXX executable firmware.elf
text data bss dec hex filename
798248 96 204316 1002660 f4ca4 /Users/campbell/Projects/pico/lv_micropython-1.18-3/lv_micropython-1.18-4/ports/rp2/build-PICO/firmware.elf
[100%] Built target firmware

This produces a binary which runs micropython fine, I’m able to drive Neopixels for example, except that lvgl isn’t happy. Thonny shell output:

MicroPython v1.18-599-gbf62dfc78 on 2022-02-20; Raspberry Pi Pico with RP2040
Type “help()” for more information.

help(‘modules’)
main gc uasyncio/core uio
_boot lodepng uasyncio/event ujson
_onewire lvgl uasyncio/funcs uos
_rp2 machine uasyncio/lock urandom
_thread math uasyncio/stream ure
_uasyncio micropython ubinascii uselect
builtins neopixel ucollections ustruct
cmath onewire uctypes usys
dht rp2 uerrno utime
ds18x20 uarray uhashlib uzlib
framebuf uasyncio/init uheapq
Plus any modules on the filesystem
import lvgl
help(lvgl)
object <module ‘lvgl’> is of type module
name – lvgl
obj

… the shell crashes in the middle of outputting the help text. No REPL follows. Board is subsequently unresponsive.

The build is exactly as above; I have not added my custom ili9486 display driver at this stage, so this is clean off GitHub.

My next steps will be to port the build to Eclipse so that I can single step the issue using a picoprobe, but initially I’d be very grateful if someone could confirm that the build steps I followed above are correct.

amirgon · February 20, 2022, 8:26pm

Hi @Campbell !
You can see the build steps on GitHub Workflow:

github.com

lvgl/lv_micropython/blob/master/.github/workflows/rp2_port.yml

name: rp2 port

on:
  push:
  pull_request:

jobs:
  build:
    runs-on: ubuntu-20.04
    strategy:
      matrix:
        board: [PICO]
    steps:
    - uses: actions/checkout@v2
    - name: arm-none-eabi-gcc
      uses: carlosperate/arm-none-eabi-gcc-action@v1.3.0
      with:
        release: '9-2019-q4' # The arm-none-eabi-gcc release to use.
    - name: Initialize lv_bindings submodule
      run: git submodule update --init --recursive lib/lv_bindings

This file has been truncated. show original

There are slight differences compared to your build steps, I’m not sure whether they are important or not.
For example, USER_C_MODULES is provided on all make executions, BOARD parameter is provided to make, etc.

Apart from that, I don’t see anything unusual on your build steps.
Perhaps @eudoxos who contributed the RP2 integration could comment.

In any case - please let us know your findings after debugging this!

Campbell · February 20, 2022, 9:37pm

Hi amirgon and thank you for looking at this.

I repeated the build with steps based closely on the CI yml:

git clone https://github.com/lvgl/lv_micropython.git lv_micropython-1.18
cd lv_micropython-1.18
git submodule update --init --recursive lib/lv_bindings
make -C ports/rp2 BOARD=PICO USER_C_MODULES=…/…/lib/lv_bindings/bindings.cmake submodules
make -C mpy-cross
make -C ports/rp2 BOARD=PICO USER_C_MODULES=…/…/lib/lv_bindings/bindings.cmake

The resulting binary shows the same behaviour as previously reported.

I noted a comment from you about a crash on object creation if there is no display driver registered, which is one of the behaviours I’ve seen (lvgl.init() currently succeeds and does not crash). I will add my display driver and see if I can register it, then try again to create an object. If that works maybe the issue here is just with the help function.

Building with Eclipse now succeeds but shows a different set of problems, with hardfaults during execution of _boot.py during file system initialisation or panics and assertion failures when Thonny attempts to connect. This must be due to differences between the CLI and Eclipse builds which I have not yet tracked down. The bottom line is I am not yet able to step the symptoms observed above with picoprobe.

I will let you know how I get on. In the meantime, if someone is in a position to try help(lvgl) with a good pico build it would be good to eliminate that.

thanks again,

Campbell

Campbell · February 20, 2022, 10:59pm

So I added my display driver.

It can be loaded and initialised:

import ili9486

display = ili9486.display(wr=14, rd=12, rst=13, cs=27, dc=28, d0=15, backlight=11)
display.init()

The display shows its test pattern.

LVGL can be loaded, initialised and the display driver registered:

import time
import lvgl as lv

lv.init()
draw_buf = lv.disp_draw_buf_t()
buf1_1 = bytearray(480*10)
draw_buf.init(buf1_1, None, len(buf1_1)//4)
disp_drv = lv.disp_drv_t()
disp_drv.init()
disp_drv.draw_buf = draw_buf
disp_drv.flush_cb = display.flush
disp_drv.hor_res = 480
disp_drv.ver_res = 320
disp_drv.register()

but, as before, creating a screen object causes a crash:

scr = lv.obj()

… no REPL, non responsive.

Looks like I need to get that picoprobe going

Campbell · March 5, 2022, 2:00pm

Sorry, haven’t had much time to progress this.

I’ve made some Eclipse builds of lv_micropython. However, these are showing other problems, throwing panics and hard faults without getting as far as the lvgl module. Problems start as soon as Thonny is connected.

So I backtracked to building vanilla micropython in Eclipse. I set up this build the same way I built lv_micropython. Strangely, this build runs and single steps just fine.

Thonny isn’t impressed if I single step while it is executing a command, but this is only to be expected with a front end that relies on timeouts a lot. If you resume from single step mode in Eclipse, Thonny recovers with a Ctrl-F2, so all good.

I’m thinking I should now port lv_binding_micropython into my working build environment. Does this sound like a reasonable way forward? The landing page for lv_binding_micropython looks pretty good, are there any other pages you would recommend I look at before I attempt this?

TIA

amirgon · March 5, 2022, 8:33pm

Thonny is a Python IDE right?
I may be missing something here, but… are you trying to debug C or Python?
AFAIK Micropython does not offer a debugger (yet?)

My recommendation is to use lv_micropython.
I don’t see much benefit in an effort to port lv_binding_micropython unless you have to (for example, to support CircuitPython or some other Micropython fork).
If you still want to use lv_binding_micropython directly, I recommend using git to find the differences between lv_micropython and upstream Micropython (v1.18) so you could see all the changes that need to be done in Micropython to support LVGL.

Campbell · March 6, 2022, 10:31am

Hi amirgon,

Thank you for your suggestion to look at the git diff. I’ll do that. To answer your questions:

“are you trying to debug C or Python?”
Definitely C; i.e. the C based micropython runtime and specifically my build of the lvgl binding. I’m only using Thonny to communicate with and command the built runtime.

“My recommendation is to use lv_micropython.”
Yep, that’s where I started, please see posts 1 and 3 for what isn’t working. My attempts to use Eclipse with 2 wire debugging via a picoprobe to discover why have been frustrated by other strange runtime behaviour. Hence, I went back to vanilla micropython to see if these problems were due to my build. Vanilla micropython builds, runs and debugs fine.

My thought was that by porting lv_binding_micropython into my working vanilla micropython build, I might either a) succeed or b) discover something that needs to be set up for RP2040/Pico that I haven’t done.

One example of this benefit is my discovery of partitions.csv which is defined for esp32 but not for other ports. Is this needed for the pico port? My current thought is no, but it would be good to be sure.

Many thanks for your ongoing help.

amirgon · March 7, 2022, 7:35pm

partitions.csv is specific to esp32, I’m not sure how flash partitioning works on rp2.

Campbell · March 18, 2022, 6:11pm

amirgon:“I’m not sure how flash partitioning works on rp2.”
_boot.py mounts a single root filesystem.

So I’ve gone back and rebuilt lv_micropython under Eclipse and now get a little further.
help(lvgl) no longer crashes, but some of the output is corrupted, I’m attaching an example in case it looks familiar (output prior to this looks normal):

theme_mono_is_inited –
theme_basic_init –
theme_basic_is_inited –
extra_init –
DPI – <class ‘ENUM_LV_DPI’>
�� – <class ‘��’>
�� – <class ‘ENUM_LV_LOG_LEVEL’>
ANIM_REPEAT – <class ‘ENUM_LV_ANIM_REPEAT’>

output varies between normal and corrupted until:

imgbtn_class – struct lv_obj_class_t
spangroup_class – struct lv_obj_class_t
�� – <class ‘��’>
scr = lvgl.obj()

This last command still hangs. I’ll provide an update shortly if picoprobe debugging is working any better.

amirgon · March 18, 2022, 8:14pm

Doesn’t look familiar to me…

Campbell · March 19, 2022, 3:10pm

Some more progress…

Single stepping now gives some useful results, and I can tell that we get a hardfault because the line ‘scr = lvgl.obj()’ returns a null rather than valid memory. LVGL does this because there is no display registered. This could reasonably be expected, so it’s time to integrate the display driver and register it.

Once integrated, the driver initialises to its test display as before, but now scr = lvgl.obj() succeeds. Hurrah!

Adding a test button, gives me something fairly recognisable on the display. Its ugly, but it’s a button. Twice hurrah!

Now back to the C code to set the colour depth correctly for the display and …

… we’re back to square 1. We fail an assert while running some python environment boot code as soon as Thonny connects via usb. The assertion failure is in py/parse.c:

1130: assert((arg & RULE_ARG_KIND_MASK) == RULE_ARG_RULE);

So the difference between working-but-ugly to failing an assertion immediately the front end connects is the addition of:

#define LV_COLOR_DEPTH 16
#define LV_COLOR_16_SWAP 1

to lv_conf.h. Which obviously should have no bearing on what micropython does at connection startup before any lvgl related code has been run.

I suspect I have some kind of memory alignment issue. Are you aware of any sensitivities in the micropython virtual machine (or in frozen python code or anywhere else) to memory alignment?

The pattern of corrupted module help output noted in the previous post also seems to vary with small changes to the build. Further, good operations in the help output can be invoked, while corrupted ones cannot.

Insights gratefully received.

thanks again,

Campbell

amirgon · March 19, 2022, 9:32pm

Good progress!

Really weird, I’ve never hit that assert, or in fact any assert during mp_parse.
I don’t see how setting LV_COLOR_DEPTH / LV_COLOR_16_SWAP could result in anything like this.
Did you try to clean and rebuild everything, after settings these macros? Are all git submodules updated?

I’m not sure about the RP2, I’m really only using the ESP32 and unix ports.
At least on ESP32 I’ve never noticed any such alignments issues anywhere.

Another option, other than alignment issues, is simply a memory corruption. Some array overflow or a write to a random address, dereferencing invalid pointer etc. could also potentially cause behaviors like this.

One thing worth trying is disabling Garbage Collection. I have seen crashes in cases where some gc variables were not correctly rooted.
With gc disabled you’d get out of memory pretty quickly, but it could be interesting to see if the problems you are seeing are reproducible before that happens (which would eliminate gc as the cause of it)

Campbell · March 20, 2022, 8:49pm

Thank you for the suggestions, I think I eliminated the premature garbage collection possibility by inserting gc.disable() into _boot.py. Typing gc.isenabled() at the Thonny REPL returns False on first connection, so that change turned gc off as early as possible. However, the symptoms did not change so I’ve backed that out.

You’re right in that other causes are possible. I’ll keep digging, now that single stepping is more stable I have a better chance of spotting the issue.

One other difference imposed by the host OS is that mpy-cross is built with clang whereas in most hosts it is gcc. clang is complaining (rightly, IMHO) about some odd array indexing in py/vm.c but my initial conclusion is that this is not the cause of what I’m seeing. I may try to force mpy-cross to build with gcc to see if this makes any difference anyway.

EDIT: I keep forgetting that I have already built a vanilla micropython cloned from their website that seems to work perfectly. So my tools would seem to be Ok. The conclusion would seem to be that the lvgl fork introduces a problem which is only evident when building on MacOS.

Campbell · March 21, 2022, 10:06am

Here is an alternative way forward. What changes to the build scripts are required to temporarily back out the lvgl part of the build? i.e. leave all the LVGL source in place but not execute any of the build process?

I could figure this out, but Amirgon, you may already have a definitive list which would save me some time?

EDIT: So it didn’t take long to figure it out:

#define MICROPY_PY_LVGL (0)
#define MICROPY_PY_LODEPNG (0)

plus removing:

USER_C_MODULES=../../lib/lv_bindings/bindings.cmake

from the root make call effectively turns off all lvgl extensions…
… and the built image runs as expected. My conclusion from that is that there is very definitely an lvgl build config issue that is presumably exposed only by MacOS.

Campbell · March 21, 2022, 4:58pm

Here’s a smokin’ gun…

Looking at the output of ‘help(machine)’ in a faulty build:

Signal – <class ‘Signal’> ��GESTYLE_FLEX_GROSTYLE_FLEX_GROW

Signal is a property of the ‘machine’ module, but the last 30-odd characters above are from an enumeration in the lvgl module, right? The lvgl module data is in places overwriting the machine module data. What could cause this?

amirgon · March 21, 2022, 8:17pm

A good way to get an idea is to use git diff between lv_micropython and Micropython v1.18.
rp2 specific changes would be, naturally, under ports/rp2.

Correct.

rp2 is the only port that uses the external C module technique to integrate LVGL with Micropython. It’s not enough to add LVGL as external module, as some additional changes are required (in mpconfigport.h for example).
On other ports (such as esp32, stm32, unix) we changed the Micropython make script instead to add LVGL. I don’t know it that’s related to the problem you are seeing or not.

Is it a MacOS specific problem? I’m not sure. It could be a general rp2 issue.
Could you try building rp2 with Linux instead? (Perhaps with a VM in your Mac?)

Another option is to try a prebuilt image.
The CI, apparently, generates an artifact for rp2, perhaps you could try it on your board? (although no display driver is included there)

I really don’t know…

Campbell · March 24, 2022, 12:45pm

It may not be. I only thought that because I had assumed the rp2 port was more mature than I now think it actually is. Mac support is often a bit behind Linux and Windows due to smaller market share etc.

I don’t think I can try the CI image because the target board and infrastructure doesn’t support hex files. I could figure out how to convert it to elf or uf2 but if it doesn’t work that might just tell us that I don’t know how to convert hex files correctly .

I might try building in a Linux VM, but now that I’m able to single step, I’m able to find out a lot more with the current builds, contrasting the working plain mpy build with the non working mpy + lvgl build which I can now easily switch between in a single build context.

I’m now able to say for sure that the corrupted help file output posted previously in the thread above is due to the micropython string interning infrastructure being broken in the bad build.

The micropython code here declares linked pools of strings to save space and time. Inspecting memory at mp_qstr_special_const_pool in a bad build, the data looks good, but at mp_qstr_const_pool the data is all 0xff. In the case where a long trail of ‘?’ symbols is output, the token id is correct for the expected output, but that id vectors into memory which is all 0xff, producing the output seen above.

This data is in the .ro_data memory group which is in flash, so memory corruption is not a factor.

I’m still working to figure out why one QSTR pool is good and the other is not, and also why the observed help output is good for some strings and not for others.

Learning way more about micropython internals than I anticipated going into this project.

Campbell · March 24, 2022, 1:03pm

Actually, I just realised that this statement may be incorrect. Sure, its a readonly group, but if the filesystem is pointing to the wrong part of the flash it could still be overwritten.

embeddedt · March 24, 2022, 6:08pm

It is an ELF file; the artifact is just named incorrectly for some reason. It should be a ZIP that has the ELF file inside.

amirgon · March 24, 2022, 8:56pm

If you are suspecting string interning - you should know (if you haven’t noticed that already) that lv_micropython implements an optimization which is not (yet) part of upstream Micropython.

github.com/lvgl/lv_micropython

py: Faster qstr search.

committed 12:14AM - 27 Feb 21 UTC

amirgon

+96 -16

Pending upstream PR: https://github.com/micropython/micropython/pull/6896 This …optimization is important because LVGL creates thousands of qstrs and 'import' can take seconds instead of milliseconds without better qstr search. Currently this PR is pending. Once it is merged or some other optimization that solves this problem is merged, this change could be reverted and replaced by the upstream solution. Today qstr implementation scans strings sequntially. In cases there are many strings this can become very inefficient. This change improves qstr search performance by using binary search in sorted qstr pools, when possible. This change introduces an option to create a sorted string pool, which is then searched by a binary search instead of sequential search. qstr pool can be either "sorted" or "unsorted", whereas the unsorted is searched sequentally as today. Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools. Currently runtime strings are unsorted. The constant string pools is split into two and a new pool is introduced, "special_const_pool". This is required because the first sequence of strings already requires special ordering therefore created unsorted, while the rest of the constants are generated sorted. qstr_find_strn searches strings in each pool. If the pool is sorted and larger than a threshold, it will be search using binary search instead of sequential search, significantly improving performance. (cherry picked from commit 8532c21c663e927d1a9974b75155e57b489a0d1e)

I’ve been trying to push this to upstream Micropython and it has received some attention, but not much progress lately

github.com/micropython/micropython

py: Faster qstr search

micropython:master ← amirgon:qstr_bsearch

opened 10:54PM - 13 Feb 21 UTC

amirgon

+237 -69

Motivation: qstr search scales poorly - see https://forum.micropython.org/viewto…pic.php?f=2&t=9748 In short, when there are many qstrs, performance is poor because of the way they are searched. This affects, for example, load time of mpy modules since strings are replaced by qstrs during import. On ESP32 it takes many seconds loading few mpy files in the presence of large native libraries ([LVGL](https://github.com/lvgl/lvgl) for example). --- Today qstr implementation scans strings sequntially. In cases there are many strings this can become very inefficient. This change improves qstr search performance by using binary search in sorted qstr pools, when possible. This change introduces an option to create a sorted string pool, which is then searched by a binary search instead of sequential search. qstr pool can be either "sorted" or "unsorted", whereas the unsorted is searched sequentally as today. Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools. Currently runtime strings are unsorted. The constant string pools is split into two and a new pool is introduced, "special_const_pool". This is required because the first sequence of strings already requires special ordering therefore created unsorted, while the rest of the constants are generated sorted. qstr_find_strn searches strings in each pool. If the pool is sorted and larger than a threshold, it will be search using binary search instead of sequential search, significantly improving performance.

Anyway, I’m not aware of any bug over there. Just pointing out that lv_micropython and upstream Micropython differ exactly there, so… you can try to revert this optimization and see if you are able to reproduce the problem you are seeing without it.

That sounds bad.
Are these pools generated corrupted? Or get corrupted during execution?

Have a look in your build directory at genhdr/qstrdefs.generated.h. Does the data there make sense?
If it does, maybe have a look in the ELF image.
If you stop at the first instruction of your program, is the data corrupted?
Try to set a watchpoint to catch the moment it gets corrupted. I’m not sure if RP2 supports hardware watchpoints (ESP32 does!) but they can also be emulated by software (slowly!).

genhdr/qstrdefs.generated.h is generated by py/makeqstrdata.py.
If mp_qstr_const_pool is corrupted then you should be looking at QDEF1 which are q1_values which are qstrs with q[0] >= 0 (these are the “regular” qstrings not in the special list) taken from the input files provided to makeqstrdata.py (mostly $(HEADER_BUILD)/qstrdefs.preprocessed.h)

Just for the background, here is the reason for splitting the constant pool in the first place:
This optimization creates sorted qstr pools that can be searched with binary search instead of serially. This improves significantly cases where there are lots of qstrs (such as LVGL with its huge API).
The problem was that Micropython assumes some special qstr must come first and cannot be sorted alphabetically (such that, the qstr integer value would fit into a byte, for example).
So the trick is to hold an unsorted pool with special qstrs (mp_qstr_special_const_pool ) followed by a much larger sorted pool for the rest of the static qstrs known on compile time (mp_qstr_const_pool). These are followed by qstr pools generated on runtime which are never sorted (for now).