How to determine if app crashed

wiklod · February 11, 2022, 1:48pm

Hi!

I have an app running on unix port on RPi Zero and I would like to keep it running all the time (and restart automatically when it crashes). To make it happen I decided to use Supervisor. I have prepared the demo in which I throw Exceptions artificially, but I quickly realized that Supervisor can not detect app crash (exception throw). I am quite confident about my Supervisor configuration so I assume that the problem is related more to micropython itself or maybe to the way the exceptions in other threads (I assume there are some other threads for timers etc) are handled in lvgl. I also observed that stacktrace of my exceptions goes to stdout and not stderr which I would expect, so it might be a small clue.
My goal is to make my app crash when unhandled exception is being thrown so the Supervisor can detect that and restart the UI. Now the stacktrace is being printed to stdout and status of “crashed” app remain “RUNNING” in Supervisor since it does not exit.

EDIT:
I forgot to mention that I am using a framebuffer driver.
I was experimenting with python3 threads to reproduce similar behavior and I think that the problem could be related to the child threads are not “passing” exceptions to main thread so it keeps running. At least this is something I was able to achieve with python. Sadly I don’t know the architecture of lvgl and micropython binding to determine where are the other threads started and how to modify them. It was easy in python, but in C/micropython mix it looks more complicated

amirgon · February 11, 2022, 9:48pm

When saying “I throw Exceptions artificially” do you mean Python exceptions?
If you do, then keep in mind that Python exceptions are internal mechanism of the language implementation (Micropython). External Linux tools are not aware of them and usually cannot handle them. You would need to catch exceptions yourself in your Python code and signal them to the outside world (with sys.exit for example, specifying the exit code, or in any other way you want).

Note that when using LVGL+Micropython in the wrong way, other bad things can happen such as Segmentation Faults or even an application that hangs. So you would probably also want some “watchdog” handler to make sure your application is alive if you are implementing an “always on” application.

As a very simple example you can have a look at our CI tests driver script where we use the Linux utilities timeout and catchsegv for that purpose.

If fact, there are no “other threads”. Micropython+LVGL works by default as a single-threaded application!
If you are using the FB driver on linux without explicit event loop, there is another thread but it is only used to schedule Micropython callbacks that would run on the main thread, so no exceptions are ever thrown on that thread.

In Micropython+LVGL, exceptions are thrown in two possible contexts: the main context and event handlers context. You can see our CI for an example where we catch exceptions both on the main thread and on event handlers by setting an exception sink of the event loop.
Apparently we can’t (or shouldn’t) call sys.exit from an event handler, so instead we record the exception and rethrow it later on the main context.

You can do the same thing with the FB driver but you need to set auto_refresh to false on FB init, create an explicit event loop and set its exception_sink argument to handle exceptions thrown in events the way you want.

wiklod · February 14, 2022, 12:12pm

Well, I have to admit that until now I thought that all unhandled exceptions in Python are, in the result, crashing the program, but now I see in Python documentation that “exceptions are not unconditionally fatal”. Although I have never experienced such case, but it probably just me
When I tried to simulate some unhandled exception I just put:

raise Exception("Test Supervisord")

between some lv_obj definitions of one of my “screens” (they are not screens in the LVGL definition). When I was requesting this screen through my UI it printed:

Traceback (most recent call last):
  File "/home/pi/display/panels/main_panel.py", line 165, in <lambda>
  File "/home/pi/display/functions.py", line 47, in get_panel
  File "/home/pi/display/panels/battery_panel.py", line 11, in __init__
Exception: Test Supervisord

on the stout (!) and whole application just hanged, didn’t exit. For testing purpose I was trying to surround the code with some try: except Exception: (I know, lame ) and then send sys.exit(1) but it didn’t catch it.
After your answer I understand why: the “screen” is being built within event handler (of some button press event) so it is in different context.
Before I just tried to reproduce the behavior I was getting with my Micropython + LVGL app with some test Python3 app and it behaved very similar when I was raising exceptions in child thread (so I was getting traceback printed to stdout, but no exit signal, the app just hanged). I solved it with quite clever Thread subclass I found somewhere:

class PropagatingThread(threading.Thread):
    def run(self):
        self.exc = None
        try:
            if hasattr(self, '_Thread__target'):
                # Thread uses name mangling prior to Python 3.
                self.ret = self._Thread__target(*self._Thread__args, **self._Thread__kwargs)
            else:
                self.ret = self._target(*self._args, **self._kwargs)
        except BaseException as e:
            self.exc = e

    def join(self, timeout=None):
        super(PropagatingThread, self).join(timeout)
        if self.exc:
            raise self.exc
        return self.ret

And basically that’s why I assumed it could be similar issue.

That’s really helpful, I will look into this!

So I assume that if I want to have this exceptions from event handlers raised in main context then I need to create explicit event loop and make exception_sink just reraise them? Can I use the one you provided? And will using custom loop affect the performance in general?

amirgon · February 14, 2022, 7:13pm

Yes, just note that they should be reraised on the main context. For some reason, “sys.exit” doesn’t behave well on event/timer context.
The exception_sink is only used to record or report the exception, you can’t call “exit” there.

I don’t think so.
Actually the explicit event loop is recommended because it provides more control and allows customizations of the event loop behavior.

wiklod · February 16, 2022, 1:45pm

I think that I got I working as I wanted, but as I was having some problems I will document it there if somebody would be interested.

At first I had to install machine from upip. The Timer() defined there (/home/pi/.micropython/lib/machine/timer.py ) is a little bit different to what is expected in lv_utils.py. It looks like that:

import ffilib
import uctypes
import array
import uos
import os
import utime
from signal import *

libc = ffilib.libc()
librt = ffilib.open("librt")

CLOCK_REALTIME = 0
CLOCK_MONOTONIC = 1
SIGEV_SIGNAL = 0

sigval_t = {
    "sival_int": uctypes.INT32 | 0,
    "sival_ptr": (uctypes.PTR | 0, uctypes.UINT8),
}

sigevent_t = {
    "sigev_value": (0, sigval_t),
    "sigev_signo": uctypes.INT32 | 8,
    "sigev_notify": uctypes.INT32 | 12,
}

timespec_t = {
    "tv_sec": uctypes.INT32 | 0,
    "tv_nsec": uctypes.INT64 | 8,
}

itimerspec_t = {
    "it_interval": (0, timespec_t),
    "it_value": (16, timespec_t),
}


__libc_current_sigrtmin = libc.func("i", "__libc_current_sigrtmin", "")
SIGRTMIN = __libc_current_sigrtmin()

timer_create_ = librt.func("i", "timer_create", "ipp")
timer_settime_ = librt.func("i", "timer_settime", "PiPp")

def new(sdesc):
    buf = bytearray(uctypes.sizeof(sdesc))
    s = uctypes.struct(uctypes.addressof(buf), sdesc, uctypes.NATIVE)
    return s

def timer_create(sig_id):
    sev = new(sigevent_t)
    #print(sev)
    sev.sigev_notify = SIGEV_SIGNAL
    sev.sigev_signo = SIGRTMIN + sig_id
    timerid = array.array('P', [0])
    r = timer_create_(CLOCK_MONOTONIC, sev, timerid)
    os.check_error(r)
    #print("timerid", hex(timerid[0]))
    return timerid[0]

def timer_settime(tid, hz):
    period = 1000000000 // hz
    new_val = new(itimerspec_t)
    new_val.it_value.tv_nsec = period
    new_val.it_interval.tv_nsec = period
    #print("new_val:", bytes(new_val))
    old_val = new(itimerspec_t)
    #print(new_val, old_val)
    r = timer_settime_(tid, 0, new_val, old_val)
    os.check_error(r)
    #print("old_val:", bytes(old_val))
    #print("timer_settime", r)

class Timer:

    def __init__(self, id, freq):
        self.id = id
        self.tid = timer_create(id)
        self.freq = freq

    def callback(self, cb):
        self.cb = cb
        timer_settime(self.tid, self.freq)
        org_sig = signal(SIGRTMIN + self.id, self.handler)
        #print("Sig %d: %s" % (SIGRTMIN + self.id, org_sig))

    def handler(self, signum):
        #print('Signal handler called with signal', signum)
        self.cb(self)

So basically it takes two arguments during initialization (frequency beside id) and it does not have Timer.init() method. I tried to adjust lv_utils.py to this version of timer, but it crashes while initializing anyway:

Traceback (most recent call last):
  File "main.py", line 81, in <module>
  File "main.py", line 46, in main
  File "/home/pi/display/lv_utils.py", line 89, in __init__
  File "/home/pi/.micropython/lib/machine/timer.py", line 78, in __init__
  File "/home/pi/.micropython/lib/machine/timer.py", line 56, in timer_create
  File "/home/pi/.micropython/lib/os/__init__.py", line 68, in check_error
OSError: [Errno 22] EINVAL

I didn’t want to dig inside this because what’s being done there is far beyond my knowledge, so I tried with uasyncio driven loop:
Installed uasyncio with upip, but:

>>> import uasyncio
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: incompatible .mpy file

I found out in the Micropython forums that there is a problem with upip version of uasyncio for unix, so I just copied uasyncio from lv_micropython/extmod to /home/pi/.micropython/lib and it works fine.

So in my main context I added simple function:

def handle_exceptions(e):
	sys.print_exception(e, sys.stderr)
	sys.exit(1)

and defined my event loop like that:

lv_utils.event_loop(asynchronous=True, exception_sink=handle_exceptions)

Then I realized that it still doesn’t work cause exceptions are being raised inside lv_utils.async_refresh()at line with lv.task_handler() and they are not caught nor passed to exception_sink.

So I changed it to:

async def async_refresh(self):
        while True:
            await self.refresh_event.wait()
            self.refresh_event.clear()

            try:
                lv.task_handler()
            except Exception as e:
                if self.exception_sink:
                    self.exception_sink(e)

            if self.refresh_cb: self.refresh_cb()

, to match what is being done inside lv_utils.task_handler()

It does what I want it to do. Basically if unhandled exception occurs in the event handlers context, the traceback is being printed to stderr and program exits (so it can be restarted by Supervisor).

As always, thank you @amirgon for your invaluable help! I would be glad if you could comment on my solution, as I expect that there is probably more elegant and safe way to do things.

amirgon · February 16, 2022, 9:08pm

I dont see why you needed to install machine from upip.

I don’t recommed replacing lv_timer.py.
The Timer module from micropython lib (which you installed with upip) is buggy.
lv_timer is based on it, attemping to fix some bugs.
For more details about that, see this PR.

If you use uasyncio, you don’t need the Timer module at all.
But you need to be aware that working with uasyncio is very different. You need to know how to use coroutines correcly (async and await). If you know what you are doing - it’s very convenient.

Well, you don’t need to copy it - instead you should freeze it in your unix variant.
By default it’s frozen only on the dev variant so you can either just build that (make -j -C ports/unix VARIANT=dev) , or freeze it on the default variant as well.

github.com

lvgl/lv_micropython/blob/bf62dfc78497d47ced3b0931a270e553d4d2552b/ports/unix/variants/dev/manifest.py#L3


      
          include("$(PORT_DIR)/variants/manifest.py")
          
          include("$(MPY_DIR)/extmod/uasyncio/manifest.py")

That part is really missing from lv_utils.py, but very simple to fix, like you did.
Would you like to open a Pull Request on lv_binding_micropython to add it?

What’s still not clear to me, is what was the problem you had with the regular (non-async) event loop? Why couldn’t you work with the event loop as is, which uses the lv_timer.py from lv_binding_micropython?
All you need to do is to record or signal the exceptions using the exception_sink.

wiklod · February 17, 2022, 10:19am

I just assumed that machine.Timer is the default. At least this is what current implementation of lv_utils.py suggests from my understanding. The lv_timer.Timer will be used only if there is no machine.Timer (and ImportError will be risen). But maybe it is there because on platforms which have proper machine.Timer it is already preinstalled and with unix port it is assumed that it won’t be?

In my case when I first included lv_utils.py in my app then it didn’t load lv_timer because … well, to be honest I don’t really understand how to make frozen modules work on my app. They work in REPL fine, but when I try to import them in a .py file then Micropython couldn’t find them. So as a workaround I am just copying them inside my app directory. I didn’t have lv_timer.py copied so lv_utils.py raised RuntimeError. But maybe it would be better if I open a new thread with this topic?

Anyway lv_timer.Timer is not working for me as well.
I got:

Traceback (most recent call last):
  File "/home/pi/display/main.py", line 71, in <module>
  File "/home/pi/display/main.py", line 41, in main
  File "/home/pi/display/lib/lv_utils.py", line 89, in __init__
  File "/home/pi/display/lib/lv_timer.py", line 138, in init
  File "/home/pi/display/lib/lv_timer.py", line 99, in timer_create
AttributeError: 'module' object has no attribute 'errno'

I think that this is some oversight as os does not have errno()?

Anyway timer_create_() on line 97 in lv_timer.py returns -1 so there has to be another problem.

Is it different from a LVGL app perspective? It seems to work just fine and I only changed to asynchronous loop from default framebuffer one. Apart from the one module which retrieves and sends data over sockets and is being scheduled by lv_timer() everything else in my app is pure Micropython+LVGL.

Sure

amirgon · February 18, 2022, 10:10am

It is for many architectures (esp32, stm32), but it’s missing from the unix port.
micropython-lib provides a bogus Timer which should be avoided, but any Timer provided by Micropython core would probably work well.

That’s probably related to MicroPython Path. Alternateivly you can set sys.path.
As explained there, you should probably add ".frozen" entry to sys.path.

Strange - on the unix port os module has errno function, I just verified it.
Maybe you replaced the os module by something else? From micropython-lib?
Possibly, you are building some port other than the unix port…?

Anyway, if you reached errno call that means there was some os error, and timer_create failed. This is very unusual and is related to the Linux OS you are using.

It’s different when you have blocking operations such as IO, network or simply “sleep”.
In that case you need to use async/await operations to yield and pass control to other coroutines.
If you don’t, you would block everything including LVGL event loop (so everything would just “freeze” during the blocking operation).

wiklod · February 18, 2022, 1:14pm

I think that I caused a lot of ambiguity myself.
I indeed had some version of os in /.micropython/lib, don’t remember why I installed it.
At first I was working with lv_micropython from late December (so v1.17) and with this version adding ".frozen" at the beginning of the path didn’t help, but adding "" there solved the issue. Then I realized that I don’t have the correct version of lv_utils.py frozen so decided to move to the newest version of lv_micropython and there ".frozen" is always present in the sys.path and frozen modules work as expected without any “fixes”.
Now I got the errno of the _create_timer:

Traceback (most recent call last):
  File "/home/pi/display/main.py", line 73, in <module>
  File "/home/pi/display/main.py", line 43, in main
  File "lv_utils.py", line 91, in __init__
  File "lv_timer.py", line 138, in init
  File "lv_timer.py", line 99, in timer_create
RuntimeError: timer_create_ error: -1 (errno = 22)

So I think that in my OS it expects some different format of timer_id. I am running Raspbian GNU/Linux 11 (bullseye) (Linux raspberrypi 5.10.63-v7+ #1459 SMP Wed Oct 6 16:41:10 BST 2021 armv7l GNU/Linux from uname -a).

So in my case it shouldn’t be very problematic as I only would have to adjust my socket client implementation a little. If I won’t be able to make lv_timer work then I’ll switch to uasyncio.

amirgon · February 18, 2022, 3:33pm

errno = 22 means Invalid Argument:

       EINVAL Clock ID, sigev_notify, sigev_signo, or
              sigev_notify_thread_id is invalid.

In our case:

Clock ID == CLOCK_MONOTONIC
sigev_notify == SIGEV_SIGNAL
sigev_signo == SIGRTMIN + sig_id (sig_id is 0 by default, but can be set by timer_id argument of event_loop)
sigev_notify_thread_id is not used.

Maybe one of these don’t work for Raspbian, You can try asking in Raspbian forum and if you find the problem we’ll fix lv_timer.py.