front: route to a pinned/owning engine that's busy, not a duplicate elsewhere
An engine mid-generation is GIL-blocked and fails the health poll, so it
reads as unhealthy. pick_engine required e.healthy at every step, so a
second request for a model pinned to that engine fell through to the
least-loaded engine — which loaded a DUPLICATE copy (and ignored the
model's configured n_ctx, e.g. 2048 vs 32000 → "exceeds context window").
Honour the pin (and the assigned owner) when the engine is alive but
transiently busy: route there so the request queues on its gen-lock and
the owner handles serialization/eviction. Only fall back to another engine
when the owner's process is actually dead. Adds Engine.is_alive() (process
liveness) and registry.engine_owning() (health-agnostic owner lookup).
Co-Authored-By:
Claude Opus 4.8 <noreply@anthropic.com>
Showing
Please
register
or
sign in
to comment