front: route to a pinned/owning engine that's busy, not a duplicate elsewhere

An engine mid-generation is GIL-blocked and fails the health poll, so it
reads as unhealthy. pick_engine required e.healthy at every step, so a
second request for a model pinned to that engine fell through to the
least-loaded engine — which loaded a DUPLICATE copy (and ignored the
model's configured n_ctx, e.g. 2048 vs 32000 → "exceeds context window").

Honour the pin (and the assigned owner) when the engine is alive but
transiently busy: route there so the request queues on its gen-lock and
the owner handles serialization/eviction. Only fall back to another engine
when the owner's process is actually dead. Adds Engine.is_alive() (process
liveness) and registry.engine_owning() (health-agnostic owner lookup).
Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
parent cf50ab84
......@@ -83,6 +83,19 @@ class Engine:
def can_serve(self, required_cap: Optional[str]) -> bool:
return (not required_cap) or (required_cap in self.capabilities)
def is_alive(self) -> bool:
"""Process is up (so 'unhealthy' means busy/GIL-blocked, not dead).
An engine mid-generation can't answer the health poll and reads as
unhealthy, but it's the right place to send a request pinned/assigned to
it — the request queues on its gen-lock instead of duplicating the model
elsewhere. A None proc means externally managed; assume alive."""
p = self.proc
try:
return p is None or p.poll() is None
except Exception:
return True
class EngineRegistry:
def __init__(self):
......@@ -192,6 +205,24 @@ class EngineRegistry:
return e
return None
def engine_owning(self, model_key: str) -> Optional[Engine]:
"""The engine ASSIGNED this model, regardless of current health.
Like engine_for_assigned but without the healthy filter — so a request can
be routed to its owner while the owner is transiently busy (mid-generation,
failing health polls) and queue there, rather than spawning a duplicate on
another engine. Callers should gate on ``is_alive()``."""
if not model_key:
return None
short = _short_stem(model_key)
with self._lock:
for e in self._engines.values():
for k in e.assigned_models:
if (k == model_key or _short_stem(k) == short
or k.endswith(model_key) or model_key.endswith(k.split("/")[-1])):
return e
return None
def least_loaded(self, required_cap: Optional[str] = None) -> Optional[Engine]:
"""Pick a healthy, capability-compatible engine to load a new model on:
fewest resident models, then most free VRAM."""
......
......@@ -123,10 +123,17 @@ def pick_engine(registry: EngineRegistry, path: str, method: str,
# 1. Per-model pin (models.json "engine") — only honoured if compatible.
if pinned:
e = registry.by_name(pinned)
if e and e.healthy and e.can_serve(cap):
if e and e.can_serve(cap):
if e.healthy:
return e
# Pinned engine is busy (mid-generation → failing health polls) but
# its process is alive: route here anyway so the request queues on
# its gen-lock, instead of duplicating a pinned model on another
# engine (which also ignores its configured n_ctx etc.).
if e.is_alive():
return e
# Pin can't be honoured — say why (once per model+engine) instead of
# silently falling back, so a misconfiguration is visible in the logs.
# Pin can't be honoured (engine down or incompatible) — say why (once
# per model+engine) instead of silently falling back.
_warn_bad_pin(model, pinned, cap, e)
# 2. Engine that already has the model resident.
......@@ -135,13 +142,20 @@ def pick_engine(registry: EngineRegistry, path: str, method: str,
if e:
return e
# 3. Configured default engine, when it can serve this request.
# 3. The assigned owner, busy-but-alive: prefer queueing on the engine that
# owns this model over loading a second copy on a different one.
if model:
owner = registry.engine_owning(model)
if owner is not None and owner.can_serve(cap) and owner.is_alive():
return owner
# 4. Configured default engine, when it can serve this request.
if default_engine:
e = registry.by_name(default_engine)
if e and e.healthy and e.can_serve(cap):
return e
# 4. Least-loaded compatible engine; then any engine rather than 503.
# 5. Least-loaded compatible engine; then any engine rather than 503.
return (registry.least_loaded(cap)
or registry.least_loaded(None)
or registry.primary())
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment