-
Stefy Lanza (nextime / spora ) authored
docker/backend: graceful llama-cpp load + additive GPU modes + libcuda mapping; admin GGUF batch/slots tuning Backend robustness: - vulkan.py catches Exception (not just ImportError) around the llama_cpp import: a CUDA-built llama-cpp missing libcuda.so.1 raised RuntimeError/OSError that crash-looped the whole server. Now it logs a warning and marks the Vulkan/GGUF backend unavailable; CUDA/CPU/ds4 keep working. - detect_available_backends() reads LLAMA_CPP_AVAILABLE instead of re-importing (which re-raised the same error). Docker launcher (run_oci.sh): - GPU backends are now additive: --nvidia --vulkan enables both (maps libcuda via --gpus all AND /dev/dri). Added --all and --with-libcuda[=PATH]. - --vulkan auto bind-mounts the host's libcuda.so.1 (the bundled llama-cpp is a CUDA build), so Vulkan GGUF loads without full --gpus all. Banner shows mode set and libcuda status. Dist bundle: - New uninstall.sh (removes runner + optional image), wired into make_dist_bundle. - install.sh + uninstall.sh print what they'll do and confirm before proceeding, bypassable with --yes/-y. Admin GGUF tuning: - Expose n_batch / n_ubatch / n_seq_max (llama.cpp -b/-ub/-np) in the model config UI and apply them in the Vulkan backend to shrink VRAM at the ceiling; n_seq_max gated on llama-cpp-python support. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
c077b7da