255 migrate the so3 build system to infrabase and move to the new so3 logo#256
Conversation
Re-sync the build system from the edgemtech Infrabase tree (without
torizon and e1c), nest the SO3 sources under so3/ to match the Infrabase
per-OS layout, and build so3/usr-so3/rootfs-so3/avz in-tree.
- build/: Infrabase meta-layers re-synced from edgemtech; torizon, e1c
and verdin removed; new meta-toolchain layer (musl-cross-make recipe
building the aarch64/arm musl user-space toolchain into build/tmp)
- SO3 sources nested under so3/{so3,usr,rootfs,target}; recipe paths and
.gitignore updated for the new layout (artifacts re-ignored)
- in-tree recipes: so3 (6.2.0), usr-so3, rootfs-so3, avz (no github
fetch); u-boot fetched+patched (2022.04, aligned with edgemtech)
- deploy via unprivileged bitbake + sudo -n (meta-filesystem)
- bsp-so3 builds, deploys and boots to so3% standalone (virt64) and as
an AVZ guest (virt64_avz_so3 ITS, EL2)
…e-and-move-to-the-new-so3-logo
…system-to-infrabase-and-move-to-the-new-so3-logo
The old manual qemu/ mechanism (fetch.sh + qemu.patch) is superseded by
the meta-qemu recipe: it fetches the same QEMU 8.2.2 and applies the same
hw/arm/virt.{c,h} patches (CLCD/KMI/PS2). Verified that build.sh -x qemu
rebuilds an equivalent qemu-system-aarch64. qemu/ stays gitignored and is
regenerated on demand.
Revert the avz recipe to fetching SO3 from upstream at a pinned SRCREV and building the hypervisor (EL2) from it, instead of the in-tree so3/ sources. AVZ is decoupled from the in-tree SO3, which is the guest/ capsule (EL1) under development. Verified: bitbake avz fetches, attaches into avz/, configures virt64_avz_defconfig and builds avz/so3.bin.
The do_build make invocation relied on a CROSS_COMPILE inherited from the
caller's shell, which broke virt32 (arm) builds when the shell had an
aarch64 CROSS_COMPILE set (cc1: unknown value 'generic-armv7-a' for
-mtune). Pass CROSS_COMPILE=${IB_TOOLCHAIN}- explicitly so virt32 uses
arm-linux-gnueabihf- and virt64 uses aarch64-none-linux-gnu-, matching
atf.bbclass.
Drop the usr/lib/lvgl git submodule (.gitmodules removed) and go back to the original meta-usr strategy: lvgl is fetched at build time by the meta-usr lvgl bbappend, gated on the :lvgl OVERRIDE. usr-so3 re-enables do_fetch/unpack/attach so the bbappend pulls lvgl into usr/lib/lvgl (do_patch stays noexec — the slv/lvgl integration patches are already baked into the in-tree usr/). The lvgl bbappend now mkdir's lib/lvgl (no longer pre-created by the submodule). usr/lib/lvgl is gitignored; meta-usr otherwise realigned with edgemtech.
The bbclass selects the current platform's target (QEMU_TARGET: arm-softmmu for virt32, aarch64-softmmu for virt64) and, when reconfiguring, appends any other arch already built under qemu/build so meson does not drop it. Thus building arm-softmmu then aarch64-softmmu (or vice-versa, e.g. switching IB_PLATFORM between so3 standalone/avz/capsule) keeps both qemu-system-* binaries instead of wiping the previous one. do_configure is nostamp so the accumulation re-evaluates each build.
The SO3 kernel is built in place, so switching IB_PLATFORM between virt64 and virt32 (aarch64<->arm) leaves a stale .config and object files behind, producing a wrong-arch kernel. Track the last built arch in a .ib_last_arch marker and run 'make distclean' only when it changes, keeping same-arch rebuilds incremental.
Two arch-switch bugs surfaced when building SO3 for virt32 (arm) after
virt64 (aarch64):
1. 'OVERRIDES += ":so3"' inserts a leading space, so OVERRIDES became
"...:arm :so3" and the CPU token parsed as "arm " (trailing space).
:<cpu> overrides such as IB_MUSL_TARGET:arm then never collapsed, so
the user-space cmake build got a literal ${IB_MUSL_TARGET} on PATH and
could not find arm-linux-musleabihf-gcc. Switch to OVERRIDES:append in
all five SO3 recipes (no inserted space).
2. The usr-so3 cmake build dir caches the toolchain in CMakeCache.txt, so
switching arch kept emitting aarch64 binaries (an aarch64 init.elf on a
32-bit kernel -> prefetch abort at boot). Wipe so3/usr/build when the
arch changes, tracked via a .ib_last_arch marker at the usr/ root.
Both QEMU launch scripts only handled virt64, so with IB_PLATFORM=virt32 they printed the MAC/GDB lines and exited without starting QEMU. Select QEMU_BIN per platform (qemu-system-arm for virt32) and add a virt32 branch booting U-Boot directly (-M virt -cpu cortex-a15 -kernel u-boot/u-boot, sdcard.img.virt32). stg.sh keeps the virtio GPU/keyboard/mouse + SDL window; the virt64-only guard is widened to accept virt32.
u-boot is built from the meta-uboot recipe (github 2022.04 @ pinned SRCREV + the SO3 patch set), which fetches and attaches it, backing any prior copy up to u-boot.back. The committed in-tree u-boot/ was therefore obsolete and was clobbered on every build, producing a huge spurious diff. Remove all 18k files from tracking and gitignore /u-boot/, matching how qemu/ and avz/ are already handled.
The patch set was inherited wholesale from the edgemtech recipe and had
never been regenerated by do_updiff in this repo. It carried two classes
of cruft:
* duplicate chains — the same source file patched twice (e.g. board.c
in 0004 and 0077, setexpr.c in 0008/0081, the tools/boot/*.c and the
defconfigs each appearing in two generations with ./ vs b/ labels),
the residue of repeated append-only updiff runs across a label-format
change;
* build artifacts frozen as patches — hello_world.srec, autoconf.mk,
autoconf.mk.dep, include/config/uboot.release, include/generated/*
(dt.h, *_autogenerated.h), lib/efi_selftest/efi_miniapp_*.h.
Regenerated from scratch: diff the pristine fetch against the working
tree (do_diffcompose), drop the old numbered set, promote the staged
one-patch-per-file result (do_updiff). 64 messy patches -> 54 clean,
consolidated, git-labelled patches. e1c_boot.c is kept (compiled but
unused) per decision. Verified: a clean fetch+unpack+patch+build applies
all 54 and produces a working u-boot.
Also completed the do_diffcompose artifact exclude-list in patch.bbclass
(autoconf.mk, autoconf.mk.dep, *.srec, efi_miniapp_*.h) so future updiff
runs stay clean.
ls sets CLOEXEC via fcntl(). arm64 musl issues this as fcntl (NR 25), which SO3 handles; arm32 (virt32) musl issues the same call as fcntl64 (NR 221), which syscall.tbl never registered -> 'unhandled syscall: 221' warning and a silently-failing -ENOSYS. Map fcntl64 to the existing __sys_fcntl handler so virt32 behaves like virt64.
Killing a process whose spawned thread was blocked in the kernel hit 'BUG in kernel/thread.c:105' (discard_tcb_in_pcb: WAITING 'not handled yet'). A sleeping thread sits in __sleep() with a struct timer on its own kernel stack, so it cannot just be freed — the pending timer would dangle and later fire on freed memory. Handle it cooperatively: add a tcb->killed flag; discard_tcb_in_pcb() flags+wakes WAITING threads (instead of BUG()) and waits for them via the existing threads_active completion, reaping them afterwards. A woken thread resumes in __sleep(), stops its own timer, sees killed and self-terminates with thread_exit() — entirely in kernel, never returning to the (already-released) user pages. READY threads are still force-freed (they must not resume into freed user space). Verified: Ctrl-C of lvgl_demo stress (whose slv tick thread loops in usleep) no longer panics. Limitation: only the __sleep() wait is instrumented. A thread killed while blocked on a futex/mutex would not yet self-terminate; that needs the same killed-check added to those wait paths.
The 128 KB lvgl heap is too small to build lv_demo_widgets (lv_conf.h's own note flags this), so the widget tree failed to allocate, nothing rendered, and the main thread spun in lv_timer_handler() without reaching a syscall boundary — making Ctrl-C undeliverable. 4 MB fits the demo comfortably; it is BSS (zero-init) so the .elf on disk is unchanged.
A diagnostic that bypasses LVGL: opens /dev/fb, queries geometry via the same ioctls slv uses, mmap()s the VRAM and draws colour bars + an animated square straight into it. Lets us tell apart a broken display pipeline (PL111 CLCD -> QEMU SDL) from an LVGL-side problem. Ctrl-C to quit.
fb_mmap() mapped the CLCD VRAM cacheable, which is wrong for a framebuffer: on real hardware the CPU writes linger in the data cache and never reach the scanout buffer. Map it non-cacheable (nocache=true). (Under QEMU/TCG it is cosmetic since the cache is not modelled, but it is required on real targets.)
SO3 drives the PL111 CLCD + PL050 keyboard/mouse that the so3 QEMU patch wires unconditionally into '-M virt'; it has no virtio-gpu driver, so the virtio-gpu/keyboard/mouse devices only added a competing blank console. More importantly the SDL backend did not present the PL111 console's surface at all (verified: pl110 renders the framebuffer into the surface - monitor 'screendump' shows it - yet the SDL window stayed black). Switching to '-display gtk' shows the panel correctly (and its View menu lists every console). Drop the virtio-gpu/keyboard/mouse devices.
Paint the colour-bar background once, then per frame only restore the square's previous rows and redraw it, instead of memcpy-ing the whole 3 MB framebuffer every frame.
The serial IRQ delivered SIGINT to current() - whatever thread happened to be running when the Ctrl-C key arrived. A foreground app asleep in a syscall (e.g. usleep) is not the running thread (the idle thread is, with pcb==NULL), so Ctrl-C was silently dropped; it only worked for CPU-busy apps. And at the shell prompt the prompt was never reprinted. Two parts: 1. Track the foreground console process. Add a global fg_pcb, set by sys_do_wait4() to the child a process blocks waiting on (the shell's foreground job) and restored to the waiter when it exits. The serial IRQ now targets fg_pcb (fallback: current()), so SIGINT reaches the foreground app even while it sleeps. 2. Cancel the line at the prompt instead of signalling the shell. When a console read is in progress (read_lock held), the IRQ sets serial_intr; pl011_get_byte returns ETX and console_getc discards the typed line and returns an empty line, so the shell's fgets returns and it reprints the prompt once. This avoids musl's sticky-EOF on a 0-byte read and a siglongjmp-through-fgets file-lock leak. Matches the driver's existing read_lock design comment. Relies on the cooperative WAITING-thread teardown for the kill path.
Mirror the virt32 graphical fix onto the virt64 branch: SO3 drives the same PL111 CLCD + PL050 (virt64.dts has clcd@08800000 / pl050 nodes), has no virtio-gpu driver, and the SDL backend does not present the PL111 console. Switch to '-display gtk' and drop the virtio-gpu/keyboard/mouse devices. The flash0.img AVZ-vs-U-Boot boot heuristic is unchanged. Untested (no virt64 graphical run this session) but the framebuffer path is identical to virt32; the kernel-side fixes (non-cacheable fb, Ctrl-C) are arch-shared.
An interrupted task (e.g. Ctrl-C during a clean) can leave a recipe WORKDIR that exists but lacks its temp/ subdir. bitbake then cannot create that task's fifo and fails with do_clean: [Errno 2] No such file or directory: .../temp/fifo.NNNN (hit on 'build.sh -ca bsp-so3'). Before clean/build, scan tmp/work and remove any workdir missing temp/ (it holds nothing useful) so bitbake recreates it cleanly.
'build.sh -ca bsp-so3' failed with
usr-so3 do_clean: [Errno 2] No such file or directory: .../temp/fifo.NNNN
Root cause: the lvgl bbappend's shell do_clean:append ran
'rm -rf ${WORKDIR}/*', which deleted the running clean task's own temp/
(holding its fifo + run script) mid-execution, leaving an empty workdir.
The next clean then could not create its fifo there and failed.
Fix: make do_clean a Python task (usr.bbclass) plus Python do_clean:append
in the usr-so3 recipe and the lvgl bbappend. Python tasks create their
temp dir themselves and use no fifo, so they are robust when the workdir
is fresh/empty. The lvgl append no longer touches WORKDIR (bitbake owns
it); it only purges the fetched lvgl tree (in-tree usr/lib/lvgl, src/lib,
${S}/lib/lvgl). Verified: fresh clean, repeat clean, full 'bsp-so3 -c
clean', and clean->rebuild (aarch64) all succeed.
…2 entries Remove the IB_PLATFORM:so3 override: SO3 now always builds for the main IB_PLATFORM. That override was referenced only here and resolved via OVERRIDES, so a value diverging from IB_PLATFORM silently built SO3 for the wrong arch. The standalone / AVZ-guest / capsule contexts are not distinct platforms - they differ only by IB_CONFIG:so3 / IB_TARGET_ITS:so3 (e.g. capsule = virt64_capsule_defconfig + virt64_capsule), which are independent of the platform variable. Also add the virt32 counterparts that were missing (PREFERRED_VERSION_so3, IB_CONFIG:so3, IB_TARGET_ITS:so3, IB_STORAGE_MODE) and default IB_PLATFORM to virt64.
AVZ is an EL2 hypervisor. The virt64 launcher only enabled EL2 (virtualization=on) when filesystem/flash0.img was present (ATF chain); booting AVZ via the ITS without ATF used plain -M virt,gic-version=2 (EL1), so AVZ faulted on its first EL2 system-register write (Synchronous Abort -> reset). Detect the selected so3 ITS from local.conf and, when it is an avz ITS, add virtualization=on with -kernel u-boot (virt64_defconfig is EL2-aware). Verified: AVZ now boots.
do_build compiles gcc 12.4.0 with in-tree mpfr/gmp/mpc. When the source tree has inconsistent timestamps (configure.ac newer than configure, as on a fresh copy that doesn't preserve mtimes), make tries to regenerate the autotools files and invokes automake-1.17/autoconf. The so3-env CI image ships no autotools (and Ubuntu 24.04 has automake 1.16, not the 1.17 mpfr wants), so do_build died with 'automake-1.17: command not found'. --disable-maintainer-mode turns the regen rules into no-ops, making the toolchain build environment- and timestamp-independent. Verified by reproducing the exact failure in the so3-env container and confirming the flag builds gcc past the mpfr stage.
Temporary diagnostic: the toolchain build fails only on GitHub-hosted runners (passes locally and on a self-hosted box with the same image and commit), and the inner log.do_build is never shown in the CI console. Print nproc/df/free and tail the failing toolchain log so we can see the actual error. To be reverted once diagnosed.
The CI failure on 32852c3 was transient: the same build logic passed on re-run (and passes locally + on a self-hosted box). The runner had ample disk (85G) and RAM (16G), so the cause was a flaky mirror download — musl- cross-make fetches tarballs from ftpmirror.gnu.org during do_build with a no-retry 'wget -c -O'. Override DL_CMD with --tries/--waitretry/--timeout so a single bad mirror recovers instead of failing the toolchain build. Also revert the temporary build.yml diagnostic (cedbc42) now that the root cause is understood; the workflow is back to its clean form.
Reproduces .github/workflows/build.yml without pushing: exports the git-tracked tree into a throwaway dir under $HOME (snap/rootless Docker cannot bind-mount /tmp) and runs the exact 'build.sh -k so3' + 'build.sh -x usr-so3' in the so3-env image, per platform. Mounting only tracked files means untracked-but-referenced sources fail locally exactly as in CI, and build/tmp is excluded so the toolchain builds from scratch. Use -r <ref> for an exact committed state.
The toolchain build failed intermittently in CI (FAIL/FAIL/PASS/PASS/FAIL across runs), always at musl-toolchain do_build, very early and with no build output — i.e. a download failure. musl-cross-make's default GNU_SITE is ftpmirror.gnu.org, which 302-redirects to a random mirror; incomplete mirrors 404 and wget --tries just re-hits the same redirect. Pin GNU_SITE to the canonical https://ftp.gnu.org/gnu (complete, no random mirror); keep the wget retries as a safety net. Also keep a minimal on-failure dump of the toolchain do_build log in CI so any residual download flake is diagnosable without a separate commit.
The Check Code Style workflow was red (pre-existing): after the Infrabase migration the SO3 sources moved under so3/, so check-path 'so3' swept in vendored code (micropython, libxml2) and check-path 'usr/src' (no longer a real dir) silently fell back to scanning the whole repo. clang-format also flagged genuinely-misformatted first-party files. - Point check-paths at the real nested dirs: so3/so3 (kernel) and so3/usr (user space); exclude vendored trees (micropython, libxml2, usr/lib/linux, lvgl). - Reformat the 13 tracked first-party files that violated the repo's own .clang-format (5 kernel, 7 usr/lib/slv, fb_test.c). Verified by replicating the action's exact logic (find + exclude regex, clang-format 19, --style=file) over the tracked tree: both jobs report 0 failures.
Make the generic build/ files byte-identical to edgem1 where they should be (meld-minimal), while keeping the torizon/e1c separation intact: - restore the EDGEMTech copyright headers on the generic layer files (meta-so3/meta-qemu/meta-rootfs/meta-filesystem/meta-uboot layer.conf, avz/so3 bbclass, bsp-so3, rootfs-so3, so3_6.2.0) - drop the dead utils_restore_user_ownership() call in usr-so3 (undefined, error-path only) - drop a stray whitespace line in rootfs-linux
4934926 to
97585d5
Compare
|
@AndreCostaaa @clemdiep You can proceed with the review :-) Thanks. |
Rewrite the landing README around the three build modes (standalone / AVZ / SO3 capsule), supported targets, and a clear pointer to the published documentation as the source of truth. Remove the no-longer-current discourse.heig-vd.ch discussion-forum link and the obsolete in-tree CI-patch and ./st/./stv/./stg run notes (all covered by doc/ now).
The discourse.heig-vd.ch forum no longer exists. Remove the 'Discussion forum' section from the index (keeping the sponsor acknowledgement and the HEIG-VD/REDS logo) and the forum link from the LVGL page; questions now go through GitHub issues / the maintainer (see the README).
Add a proper 'Welcome to SO3' opening and a dedicated section explaining SO3's defining trait — polymorphism: one source tree built into a standalone OS (EL1), the AVZ hypervisor (EL2), or an SO3 capsule (S3C) on top of AVZ beside a Linux agency.
The source IB_TARGET/fs is the rootfs image loop-mounted as root, and the ext4 rootfs partition needs ownership/perms/symlinks preserved. Replace the unprivileged non-preserving `cp -rv` (which aborts on root-owned files) with `sudo cp -av`. Keeps this recipe identical to the edgem1 tree.
-k so3 -> -x so3, -f -> -x filesystem (the -k/-b/-r/-f options were removed when build.sh/deploy.sh were reduced to -a/-x).
clemdiep
left a comment
There was a problem hiding this comment.
This seems ok to me, I didn't find any regression on the sources.
I haven't time to test it however
Just a small comment that I want to address before approving.
| # Physical device used when IB_STORAGE_MODE = "hard" | ||
| IB_STORAGE_DEVICE:rpi4 = "mmcblk0" | ||
| IB_STORAGE_DEVICE:rpi4_64 = "mmcblk0" | ||
| IB_STORAGE_DEVICE:verdin-imx8mp = "sda" |
There was a problem hiding this comment.
I am not a fan of having default value setted for this.
For people that don't know the build system, or when you make a clean clone and then mechanically deploy without thinking about it, this can lead to unrecoverable driver overwrite (at least for the sda) as those drive aren't default for everyone.
I suggest to put those parameters in comment and making sure that an error is triggered if so when required. This will make it safer.
There was a problem hiding this comment.
Fully agreed. I did comment and handle an error message properly.
The guard tested IB_STORAGE_DEVICE == "", but a commented-out (unset)
variable is None in bitbake, not "", so it never fired — a mechanical
hard/target deploy could then write to a wrong/default device (e.g.
/dev/sda) and overwrite a host disk. Use 'not IB_STORAGE_DEVICE' (catches
None and empty) with an explicit message, and add the missing guard to the
verdin class (which formatted /dev/${device} unconditionally).
A default device on a hard deploy (e.g. /dev/sda) could overwrite a host disk on a mechanical clean-clone deploy. Leave it unset; the do_fs_init_storage guard now fails explicitly when 'hard' mode needs it (commit 949090c).
No description provided.