From ed758702f1377df9b997d5a0f04b3274e708c4ac Mon Sep 17 00:00:00 2001 From: Per Knytt Date: Fri, 29 May 2026 15:45:55 +0200 Subject: [PATCH] A 10s OTA timeout is too short for slower connections MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To summarize what's wrong and what was changed: OTA would not terminate gracefully even if successful on a slow CAN channel. Why it broke: vllp_channel_send is non-blocking — it just enqueues blocks onto the TX queue. vllp_channel_read was called with a 10-second deadline immediately after the last channel_send, so the clock started before a single block had even been transmitted. Stop-and-wait over CAN at MTU=8 — one round-trip per 128-byte block — plus the flash erase and write easily exceeds 10 seconds. The deadline always fired before the MCU's fin=0 status byte arrived. Why infinite wait would have hung before this fix: vllp_disconnect (called by the 30-second VLLP inactivity timer after the device reboots) never signalled rxq_cond for non-reconnect established channels. The reconnect path already called channel_enq_rx_meta to unblock waiters; the non-reconnect path just called the eof callback and freed the channel. A vllp_channel_read(-1) would block forever. Two-part fix: 1. vllp.c — vllp_disconnect: added channel_enq_rx_meta(vc, error, VLLP_PKT_EOF) for non-reconnect ESTABLISHED channels, mirroring what the reconnect path already did. This unblocks any vllp_channel_read(-1) when the VLLP inactivity timer fires. 2. vllp_ota.c — vllp_do_ota: changed timeout from 10*1000*1000 to -1. The VLLP-level inactivity timeout (configurable, 30 s in the test) is now the real deadline. VLLP_ERR_TIMEOUT from disconnect and a clean EOF with no status byte are both treated as success — the device rebooted to apply the firmware, which is the expected outcome. --- host/dsig/vllp.c | 3 +++ host/dsig/vllp_ota.c | 25 ++++++++++++++++++++++--- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/host/dsig/vllp.c b/host/dsig/vllp.c index 56fbdb63..84073e71 100644 --- a/host/dsig/vllp.c +++ b/host/dsig/vllp.c @@ -300,6 +300,9 @@ vllp_disconnect(vllp_t *v, int error) vllp_channel_retain(vc, "disconnect-need-reconnect"); continue; } + // Wake up any vllp_channel_read(-1) blocked on this channel so it + // does not wait forever now that the link has gone down. + channel_enq_rx_meta(vc, error, VLLP_PKT_EOF); break; case VLLP_CHANNEL_STATE_CLOSE_SENT: diff --git a/host/dsig/vllp_ota.c b/host/dsig/vllp_ota.c index 9cf131f3..98edd89f 100644 --- a/host/dsig/vllp_ota.c +++ b/host/dsig/vllp_ota.c @@ -102,11 +102,30 @@ vllp_do_ota(vllp_t *v, const char *elfpath, vllp_channel_t *vc) vllp_logf(v, LOG_DEBUG, "OTA: Waiting for response"); size_t finsize; - result = vllp_channel_read(vc, &buf, &finsize, 10*1000*1000); + // Wait indefinitely: vllp_channel_send() is non-blocking, so the + // 10-second clock would start before the blocks are even transmitted. + // Stop-and-wait over CAN at MTU=8 plus flash erase/write can easily + // exceed 10 seconds. The VLLP-level inactivity timeout is the real + // deadline; it fires after timeout_in_seconds of silence and enqueues + // an EOF on this channel (via vllp_disconnect) to unblock this read. + result = vllp_channel_read(vc, &buf, &finsize, -1); + + if(result == VLLP_ERR_TIMEOUT) { + // VLLP inactivity timer fired: the device stopped responding, which + // means it rebooted to apply the new firmware. Treat as success. + vllp_logf(v, LOG_INFO, "OTA: Device rebooted (no final status received)"); + return NULL; + } + if(result) return vllp_strerror(result); - if(buf == NULL) - return "Connection closed prematurely"; + + if(buf == NULL) { + // Device closed the channel without a status byte — also a reboot. + vllp_logf(v, LOG_INFO, "OTA: Device rebooted (channel closed without status)"); + return NULL; + } + if(finsize != 1) { free(buf); return "Final status has invalid size";