MsgReader: missing config validation + retries forever on NATS BadRequest

## Summary

Two related shortcomings in `serverish.messenger.MsgReader`, surfaced while debugging an unrelated event-loop-starvation bug in `oca_monitor`:

1. **No upfront validation of `deliver_policy` ↔ start-marker consistency.** `MsgReader` accepts mutually-required-fields-missing combinations silently at construction. NATS rejects the consumer create on the first read attempt, but by then the reader is already wired into a caller's `async for` loop.
2. **Retry loop never gives up on configuration errors.** The recovery path in `_reopen_pull_subscribtion` rebuilds the same malformed `ConsumerConfig` and fires it at the server, which keeps replying 400. Backoff grows without bound; the reader stays silent forever and the iterator never raises.

## Repro

Against any reachable NATS with JetStream:

```python
import asyncio
from serverish.messenger import Messenger

async def go():
    msg = Messenger()
    await msg.open(host="nats.example", port=4222)
    rdr = msg.get_reader("telemetry.weather.davis", deliver_policy="by_start_time")
    #                                                              ^^^^^^^^^^^^^^^
    # opt_start_time intentionally omitted — should be a hard error
    async for data, meta in rdr:
        print(data); break

asyncio.run(go())
```

**Expected:** a `ValueError` (or similar) at construction time saying `opt_start_time` is required when `deliver_policy='by_start_time'`.

**Actual:** construction succeeds, then this line repeats with growing backoff forever:

```
[WARNING] msg_reader: (N)[PULL]MsgReader_xxx [...] read_next error,
  (retry in N.Ns): nats: BadRequestError: code=400 err_code=10094
  description='consumer delivery policy is deliver by start time,
              but optional start time is not set'
  elapsed: NN.Ns hist: ensure_open-err
```

The same shape applies to `deliver_policy='by_start_sequence'` without `opt_start_seq` (NATS replies with `err_code=10095`).

## Root cause

In `messenger/msg_reader.py`:

* `__init__` (≈line 75) trusts the param descriptors and never cross-validates `deliver_policy` against `opt_start_time` / `consumer_cfg`.
* `_create_consumer_cfg` (≈line 390) builds the config from `self.consumer_cfg.copy()` and only copies `opt_start_time` when set — silently producing a malformed config.
* `_reopen_pull_subscribtion` (≈line 330) calls `_create_consumer_cfg` again on every retry. There is no fatal-error exit.

## Suggested fix

### (a) Validate at construction

In `MsgReader.__init__`, after the param coercion, raise `ValueError` if any of:

* `deliver_policy == 'by_start_time'` and `opt_start_time is None`
* `deliver_policy == 'by_start_sequence'` and `consumer_cfg.get('opt_start_seq') is None`

Surface message: e.g. `"deliver_policy='by_start_time' requires opt_start_time"`.

### (b) Classify NATS API errors as fatal vs transient

`nats-py` already exposes the classification we need — its `nats.js.errors` hierarchy maps NATS' HTTP-style status codes:

```
APIError                              # base
├── BadRequestError          (400)    # config error → fatal
├── NotFoundError            (404)    # stream/consumer missing → fatal
├── ServerError              (500)    # server bug → transient
└── ServiceUnavailableError  (503)    # overload/disconnect → transient
```

Plus `nats.errors.TimeoutError` and `ConnectionClosedError` for transport-level transients.

Proposed rule, simple to implement and aligned with NATS docs:

* **Fatal** (stop retrying, mark reader stopped, raise from `__anext__`): `BadRequestError`, `NotFoundError`. These mean *the request itself is wrong* — the same payload will never succeed.
* **Transient** (keep current backoff retry): `ServerError`, `ServiceUnavailableError`, `TimeoutError`, `ConnectionClosedError`, generic `APIError` whose `code` is `5xx`.

The `err_code` field (e.g. `10094`, `10095`) gives even finer granularity if needed in the future, but the 4xx/5xx split is correct and matches the [NATS JetStream error reference](https://github.com/nats-io/nats-server/blob/main/server/jetstream_errors_generated.go).

For surfacing the fatal case to callers, a new exception type — e.g. `MessengerReaderConfigError(MessengerError)` — raised from `__anext__` would let `async for` consumers fail loudly rather than silently iterate forever.

## Why it matters

Today an oca-side mistake (forgetting `opt_start_time`) is invisible to ops — the reader just stays silent. Worse, the same fatal-vs-transient gap means a transient *configuration* problem (e.g. someone briefly deletes a stream and recreates it under a new name) can leave the reader stuck in a no-progress retry storm even after the underlying issue clears, because each retry resends an outdated config.

(Surfaced while chasing a separate `oca_monitor` issue where ~50 readers all timed out concurrently during a slow synchronous startup tick. The freeze itself recovers cleanly when the params are correct — but if any reader's params were bad the retry loop just hides it.)

I'm happy to follow up with a PR for (a) and (b) once we agree on the exception hierarchy / public name. Let me know if you'd prefer a single combined PR or two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MsgReader: missing config validation + retries forever on NATS BadRequest #30

Summary

Repro

Root cause

Suggested fix

(a) Validate at construction

(b) Classify NATS API errors as fatal vs transient

Why it matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

MsgReader: missing config validation + retries forever on NATS BadRequest #30

Description

Summary

Repro

Root cause

Suggested fix

(a) Validate at construction

(b) Classify NATS API errors as fatal vs transient

Why it matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions