- 15 Mar, 2026 40 commits
-
-
Your Name authored
- Changed torch_dtype from float16 (when CUDA available) to float32 - This prevents NaN/Infinity values in image output that cause black/corrupted images - FP16 can cause numerical overflow on some models like SDXL
-
Your Name authored
- Add 'steps' parameter to ImageGenerationRequest (overrides quality-based default) - Add 'guidance_scale' parameter to ImageGenerationRequest (overrides CLI --image-cfg-scale) - Use request values in diffusers pipeline call
-
Your Name authored
- Import time module inside try block with alias to avoid UnboundLocalError - This prevents Python's exception handling from affecting variable scope
-
Your Name authored
- Remove duplicate 'image' entry in list_models() - Remove vision: alias (user doesn't want it) - Skip URLs in loaded models listing (they're download sources) - Add full traceback to diffusers error for debugging
-
Your Name authored
- Recursively scan huggingface cache directory (hub/, xet/, etc.) - Also fix remove-model to search recursively in huggingface cache
-
Your Name authored
- Add get_all_cache_dirs() to find GGUF, HuggingFace, and Diffusers caches - Update --list-cached-models to show all cache locations - Update --remove-all-models to clean all cache directories - Update --remove-model to search across all caches - Add better error handling for diffusers image extraction
-
Your Name authored
- When --file-path is set and --url is 'auto', use the Host header from the request (what the client used to connect) instead of the client's IP address - This ensures the returned URL points to the correct server
-
Your Name authored
- Remove debug print for empty filtered chunks - Fix strip_tool_calls_from_content to preserve whitespace-only chunks ('\n\n', ' ') - These whitespace characters are essential for proper message composition -
Your Name authored
- Changed context arguments to use action='append' allowing multiple values - Added get_ctx_by_index() helper function for index-based context retrieval - Updated text, audio, and image model loading to use indexed context values - Users can now specify different context sizes per model
-
Your Name authored
- Cleanup image models before loading text models to prevent OOM errors - Applied to both text model loading paths in get_model_for_request
-
Your Name authored
- Cleanup any existing models (text, audio, etc.) from VRAM before loading image models to prevent out of memory errors when switching between model types - Applied to both diffusers and stable-diffusion-cpp loading paths
-
Your Name authored
- Add fallback to use configured --image-model when unknown model name is sent - Cache dynamically loaded StableDiffusion models for reuse across requests - Always check cache (not just in loadall mode) so ondemand mode reuses models
-
Your Name authored
- Detect GGUF models and skip diffusers, use stable-diffusion-cpp instead - Add HuggingFace model ID resolution for GGUF files - Add support for VAE, LLM, T5XXL paths from CLI args - Add clip_on_cpu support for VRAM savings - Use all available CPU cores instead of hardcoded 4 threads
-
Your Name authored
When loading image models dynamically (in ondemand mode), the code now: 1. Checks if model URL is cached 2. If not cached, downloads the model before loading This fixes the 'Could not resolve sd.cpp model path' error when using image models without --loadall or --loadswap flags.
-
Your Name authored
- Set llama-cpp-python verbose flag to match debug mode - Remove n_gpu_layers from stable_diffusion_cpp (not supported)
-
Your Name authored
Fix: verbose=True when debug flag set, fix stable_diffusion_cpp n_gpu_layers bug, remove capability pre-check for image generation
-
Your Name authored
User will experiment with Vulkan environment variables from launching script. Keep only CUDA_VISIBLE_DEVICES setting for now.
-
Your Name authored
For stable-diffusion-cpp-python: - GGML_VK_VISIBLE_DEVICES= - GGML_VULKAN_DEVICE= For llama-cpp-python (additional): - VK_ICD_FILENAMES=/dev/null - VK_DRIVER_FILES=/dev/null - VK_LOADER_DRIVERS_DISABLE=* - VK_LOADER_LAYERS_DISABLE=~all~ All variables are restored on cleanup for subsequent Vulkan models.
-
Your Name authored
- Use VK_ICD_FILENAMES=/dev/null to disable Vulkan ICD and force CUDA - This is the correct variable for llama.cpp to disable Vulkan - Restore VK_ICD_FILENAMES on cleanup for subsequent Vulkan models
-
Your Name authored
- Set GGML_DISABLE_VULKAN=1 and GGML_VULKAN_DEVICE='' before loading model - These must be set before llama_cpp import since it reads them at init - Restore Vulkan settings on cleanup so subsequent Vulkan models work - Addresses issue where GGUF models ran on CPU instead of CUDA with --backend nvidia
-
Your Name authored
-
Your Name authored
-
Your Name authored
- Make cleanup() method handle StableDiffusion and other non-ModelManager objects - Add try-except in on-demand swap to handle cleanup failures gracefully - Check if cleanup method exists before calling
-
Your Name authored
- Add model_backend_types dict to track backend for each model - Update set_default_model to accept backend_type parameter - Modify get_model_for_request to swap models on-demand when in ondemand mode - Unload current model from VRAM and load new model when request arrives for different model - Respect --backend flag when loading models on-demand - Only activates when no --loadall or --loadswap flag is specified
-
Your Name authored
-
Your Name authored
- Add VK_ICD_FILENAMES=/dev/null to disable Vulkan during startup preload - Previously only set at request time, causing crash during --loadall - Added check in both startup preload locations (lines ~5254 and ~5811) - Checks --backend and --image-backend to determine CUDA usage
-
Your Name authored
When --backend nvidia is used, set VK_ICD_FILENAMES=/dev/null to completely disable Vulkan and force CUDA-only mode for sd.cpp
-
Your Name authored
-
Your Name authored
-
Your Name authored
Print model capabilities (text, image-to-text, image, etc.) after successful model loading in both NvidiaBackend and VulkanBackend
-
Your Name authored
-
Your Name authored
- Add ModelCapabilities dataclass to represent model capabilities - Add detect_model_capabilities() function to detect: - text_generation (LLM) - vision (image understanding) - image_generation (Stable Diffusion) - speech_to_text (whisper) - text_to_speech (TTS) - Use capability detection for better error messages in image generation endpoint
-
Your Name authored
- Show 'cuda (via llama-cpp-python)' when force_cuda is enabled - Show original backend in GGUF detection message
-
Your Name authored
GGUF models are for text/LLM and cannot do image generation.
-
Your Name authored
-
Your Name authored
-
Your Name authored
diffusers is required for Stable Diffusion image generation
-
Your Name authored
The package is named 'stable_diffusion_cpp_python', not 'stable_diffusion_cpp'
-
Your Name authored
If --image-model is not specified, try to use the main --model as the image model fallback when requesting 'default' model.
-
Your Name authored
If loading a cached GGUF model fails with corruption indicators (invalid, corrupt, magic, header), delete the corrupted cache and re-download the model automatically.
-