• Stefy Lanza (nextime / spora )'s avatar
    docker/backend: graceful llama-cpp load + additive GPU modes + libcuda... · c077b7da
    Stefy Lanza (nextime / spora ) authored
    docker/backend: graceful llama-cpp load + additive GPU modes + libcuda mapping; admin GGUF batch/slots tuning
    
    Backend robustness:
    - vulkan.py catches Exception (not just ImportError) around the llama_cpp
      import: a CUDA-built llama-cpp missing libcuda.so.1 raised RuntimeError/OSError
      that crash-looped the whole server. Now it logs a warning and marks the
      Vulkan/GGUF backend unavailable; CUDA/CPU/ds4 keep working.
    - detect_available_backends() reads LLAMA_CPP_AVAILABLE instead of re-importing
      (which re-raised the same error).
    
    Docker launcher (run_oci.sh):
    - GPU backends are now additive: --nvidia --vulkan enables both (maps libcuda via
      --gpus all AND /dev/dri). Added --all and --with-libcuda[=PATH].
    - --vulkan auto bind-mounts the host's libcuda.so.1 (the bundled llama-cpp is a
      CUDA build), so Vulkan GGUF loads without full --gpus all. Banner shows mode set
      and libcuda status.
    
    Dist bundle:
    - New uninstall.sh (removes runner + optional image), wired into make_dist_bundle.
    - install.sh + uninstall.sh print what they'll do and confirm before proceeding,
      bypassable with --yes/-y.
    
    Admin GGUF tuning:
    - Expose n_batch / n_ubatch / n_seq_max (llama.cpp -b/-ub/-np) in the model config
      UI and apply them in the Vulkan backend to shrink VRAM at the ceiling; n_seq_max
      gated on llama-cpp-python support.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    c077b7da
install.sh 3.3 KB