Commits · 6e794ae6dd6788d6702a7b1aa46e97c7608a02b8 · nexlab / coderai

16 Mar, 2026 3 commits

Add support for specifying chat template in --hf-chat-template · 6e794ae6

Your Name authored Mar 16, 2026

- Now can specify template directly: --hf-chat-template "model:template"
- Updated check_hf_chat_template to return tuple (should_use, template_name)
- Updated _load_huggingface_tokenizer to accept template_name parameter
- Updated README with new syntax and template examples

6e794ae6

Add auto-detect support for --hf-chat-template · e17bc553

Your Name authored Mar 16, 2026

- Added 'auto' as a valid value for --hf-chat-template
- When --hf-chat-template auto is used, it auto-detects and applies HF template to all models
- Updated README with new syntax

e17bc553

Add debug output showing raw vs escaped content · 079bd8dc
Your Name authored Mar 16, 2026

079bd8dc

15 Mar, 2026 37 commits

Make --hf-chat-template repeatable per model · 31b6480e

Your Name authored Mar 15, 2026

- Changed --hf-chat-template from boolean to action=append
- Added check_hf_chat_template() function for model-specific checking
- Updated _finalize_chat_template_detection to use new function
- Updated README with new syntax

31b6480e

Add --hf-chat-template option for HuggingFace apply_chat_template · 3121fb85

Your Name authored Mar 15, 2026

- Added --hf-chat-template CLI flag to use transformers apply_chat_template
- Added _load_huggingface_tokenizer() to load HF tokenizer for GGUF models
- Added _format_messages_hf() method for HF chat template formatting
- Updated generate_chat and generate_chat_stream to use HF tokenizer when available
- Updated format_messages to check for HF tokenizer first
- Added documentation in README.md

3121fb85

Add --reply-filters option for optional content filtering · 533f8fd5

Your Name authored Mar 15, 2026

- Added --reply-filters CLI flag to make content filtering optional
- Supports comma-separated values: --reply-filters malformed,tool_calls
- Supports model-specific filters: --reply-filters text:malformed --reply-filters image:tool_calls
- Supports specific model names: --reply-filters text:llama-3.1:malformed
- Added check_reply_filter() and check_single_filter() helper functions
- Updated stream_chat_response and generate_chat_response to use new filtering
- Updated ToolCallParser._filter_malformed_content for conditional filtering
- Added documentation in README.md

533f8fd5

Increase VRAM cleanup delay to 3 seconds · f8d2481e
Your Name authored Mar 15, 2026
```
- Give more time for Vulkan memory to be freed after unloading image models
```
f8d2481e

Fix: Force VRAM cleanup when switching from image to text model · a4a8c340

Your Name authored Mar 15, 2026

- Add garbage collection and torch.cuda.empty_cache() after unloading image models
- Add a small delay to allow VRAM to be freed before loading new model
- This should help prevent OOM errors when switching between image and text models

a4a8c340

Fix: Don't strip whitespace from model output · 13b56ea0

Your Name authored Mar 15, 2026

- Remove stripping in strip_tool_calls_from_content function
- Whitespace (spaces, newlines) are valid content and should be preserved

13b56ea0

Fix: Use cached file path when reloading URL-based models · 7e7930f5

Your Name authored Mar 15, 2026

- When reloading a default model that was loaded from a URL,
  check for cached file path and use it instead of the URL

7e7930f5

Fix: Handle NaN values in diffusers image output · dd924c1a

Your Name authored Mar 15, 2026

- Replace NaN and Inf values with valid values before saving
- Clip image values to valid range [0, 1] to prevent black images

dd924c1a

Fix: Reload default text model when switching from image to text · c517c947

Your Name authored Mar 15, 2026

- When 'default' model is requested but not loaded (was unloaded for image model),
  the code now tries to reload the default model
- Cleanup image models first to free VRAM, then reload the text model

c517c947

Add --image-cpu-offload option and fix sequential offload logic · c8f7c8d9

Your Name authored Mar 15, 2026

- Add --image-cpu-offload CLI flag for explicit sequential CPU offload
- Enable sequential CPU offload only on 3rd OOM retry or when --image-cpu-offload is set

c8f7c8d9

Fix: Filter URLs from default model listing · ac005426

Your Name authored Mar 15, 2026

- Skip URLs when listing the default model in list_models()
- This prevents download URLs from appearing in available models list

ac005426

Add OOM handling and sequential offload for diffusers · 096b75d2

Your Name authored Mar 15, 2026

- Enable sequential CPU offload if --offload-strategy or --offload-dir is specified
- Add retry logic: on OOM, retry with attention_slicing, then with sequential_offload
- Clear CUDA cache between retry attempts

096b75d2

Add --image-precision option and VAE tiling support for diffusers · 782612ea

Your Name authored Mar 15, 2026

- Add --image-precision with choices: bf16, f32, f16, f8
- bf16 recommended for modern GPUs (RTX 30/40 series) to avoid NaN issues
- Enable VAE tiling for diffusers when --vae-tiling is specified

782612ea

Fix diffusers NaN warning by using FP32 instead of FP16 · df8b4875

Your Name authored Mar 15, 2026

- Changed torch_dtype from float16 (when CUDA available) to float32
- This prevents NaN/Infinity values in image output that cause black/corrupted images
- FP16 can cause numerical overflow on some models like SDXL

df8b4875

Add steps and guidance_scale to image generation request · 55a39eeb

Your Name authored Mar 15, 2026

- Add 'steps' parameter to ImageGenerationRequest (overrides quality-based default)
- Add 'guidance_scale' parameter to ImageGenerationRequest (overrides CLI --image-cfg-scale)
- Use request values in diffusers pipeline call

55a39eeb

Fix diffusers time variable scoping issue · 9a749ea4

Your Name authored Mar 15, 2026

- Import time module inside try block with alias to avoid UnboundLocalError
- This prevents Python's exception handling from affecting variable scope

9a749ea4

Fix model listing: remove duplicate 'image', remove vision: alias, filter URLs · 9f01de41

Your Name authored Mar 15, 2026

- Remove duplicate 'image' entry in list_models()
- Remove vision: alias (user doesn't want it)
- Skip URLs in loaded models listing (they're download sources)
- Add full traceback to diffusers error for debugging

9f01de41

Fix cache listing to include HuggingFace subdirectories · 57a7951b

Your Name authored Mar 15, 2026

- Recursively scan huggingface cache directory (hub/, xet/, etc.)
- Also fix remove-model to search recursively in huggingface cache

57a7951b

Add multi-cache support for cached model commands · f606b9a7

Your Name authored Mar 15, 2026

- Add get_all_cache_dirs() to find GGUF, HuggingFace, and Diffusers caches
- Update --list-cached-models to show all cache locations
- Update --remove-all-models to clean all cache directories
- Update --remove-model to search across all caches
- Add better error handling for diffusers image extraction

f606b9a7

Fix auto URL to use server host from request headers · ab253a98

Your Name authored Mar 15, 2026

- When --file-path is set and --url is 'auto', use the Host header
  from the request (what the client used to connect) instead of
  the client's IP address
- This ensures the returned URL points to the correct server

ab253a98

Fix whitespace filtering in stream response · 12d19211

Your Name authored Mar 15, 2026

- Remove debug print for empty filtered chunks
- Fix strip_tool_calls_from_content to preserve whitespace-only chunks ('\n\n', ' ')
- These whitespace characters are essential for proper message composition

12d19211

Add support for multiple context values (--n-ctx, --audio-ctx, --image-ctx) · fe8b5ea4

Your Name authored Mar 15, 2026

- Changed context arguments to use action='append' allowing multiple values
- Added get_ctx_by_index() helper function for index-based context retrieval
- Updated text, audio, and image model loading to use indexed context values
- Users can now specify different context sizes per model

fe8b5ea4

Add VRAM cleanup when loading text models to free memory from image models · ccd7cce5

Your Name authored Mar 15, 2026

- Cleanup image models before loading text models to prevent OOM errors
- Applied to both text model loading paths in get_model_for_request

ccd7cce5

Add VRAM cleanup before loading image models · 08496f1f

Your Name authored Mar 15, 2026

- Cleanup any existing models (text, audio, etc.) from VRAM before loading
  image models to prevent out of memory errors when switching between model types
- Applied to both diffusers and stable-diffusion-cpp loading paths

08496f1f

Fix image generation: add model caching and fallback for unknown models · 7cbe5355

Your Name authored Mar 15, 2026

- Add fallback to use configured --image-model when unknown model name is sent
- Cache dynamically loaded StableDiffusion models for reuse across requests
- Always check cache (not just in loadall mode) so ondemand mode reuses models

7cbe5355

Add GGUF model support and extended stable-diffusion-cpp options · f498093e

Your Name authored Mar 15, 2026

- Detect GGUF models and skip diffusers, use stable-diffusion-cpp instead
- Add HuggingFace model ID resolution for GGUF files
- Add support for VAE, LLM, T5XXL paths from CLI args
- Add clip_on_cpu support for VRAM savings
- Use all available CPU cores instead of hardcoded 4 threads

f498093e

Fix: Download image model on-demand if not cached · 2dfbce90

Your Name authored Mar 15, 2026

When loading image models dynamically (in ondemand mode), the code now:
1. Checks if model URL is cached
2. If not cached, downloads the model before loading

This fixes the 'Could not resolve sd.cpp model path' error when using
image models without --loadall or --loadswap flags.

2dfbce90

Add verbose=True when debug mode is enabled · 0a3fd1ff

Your Name authored Mar 15, 2026

- Set llama-cpp-python verbose flag to match debug mode
- Remove n_gpu_layers from stable_diffusion_cpp (not supported)

0a3fd1ff

Fix: verbose=True when debug flag set, fix stable_diffusion_cpp n_gpu_layers... · b8b465ac

Your Name authored Mar 15, 2026

Fix: verbose=True when debug flag set, fix stable_diffusion_cpp n_gpu_layers bug, remove capability pre-check for image generation

b8b465ac

Revert: Remove Vulkan disable env vars for llama-cpp-python · c15e6ec6

Your Name authored Mar 15, 2026

User will experiment with Vulkan environment variables from launching script.
Keep only CUDA_VISIBLE_DEVICES setting for now.

c15e6ec6

Fix CUDA backend - use comprehensive Vulkan disable variables · e538d802

Your Name authored Mar 15, 2026

For stable-diffusion-cpp-python:
- GGML_VK_VISIBLE_DEVICES=
- GGML_VULKAN_DEVICE=

For llama-cpp-python (additional):
- VK_ICD_FILENAMES=/dev/null
- VK_DRIVER_FILES=/dev/null
- VK_LOADER_DRIVERS_DISABLE=*
- VK_LOADER_LAYERS_DISABLE=~all~

All variables are restored on cleanup for subsequent Vulkan models.

e538d802

Fix CUDA backend for GGUF models - use VK_ICD_FILENAMES to disable Vulkan · 11bada84

Your Name authored Mar 15, 2026

- Use VK_ICD_FILENAMES=/dev/null to disable Vulkan ICD and force CUDA
- This is the correct variable for llama.cpp to disable Vulkan
- Restore VK_ICD_FILENAMES on cleanup for subsequent Vulkan models

11bada84

Fix CUDA backend for GGUF models - force CUDA via environment variables · f77d34da

Your Name authored Mar 15, 2026

- Set GGML_DISABLE_VULKAN=1 and GGML_VULKAN_DEVICE='' before loading model
- These must be set before llama_cpp import since it reads them at init
- Restore Vulkan settings on cleanup so subsequent Vulkan models work
- Addresses issue where GGUF models ran on CPU instead of CUDA with --backend nvidia

f77d34da

Fix image model preloading - only preload when --loadall or --loadswap is set · c4af709f
Your Name authored Mar 15, 2026

c4af709f
Fix http_request bug in image generation and add http_request parameter to save_image_response · d24c7e18
Your Name authored Mar 15, 2026

d24c7e18

Fix cleanup handling for different model types · 16ce04c5

Your Name authored Mar 15, 2026

- Make cleanup() method handle StableDiffusion and other non-ModelManager objects
- Add try-except in on-demand swap to handle cleanup failures gracefully
- Check if cleanup method exists before calling

16ce04c5

Implement on-demand model swapping for multiple models · 362b8452

Your Name authored Mar 15, 2026

- Add model_backend_types dict to track backend for each model
- Update set_default_model to accept backend_type parameter
- Modify get_model_for_request to swap models on-demand when in ondemand mode
- Unload current model from VRAM and load new model when request arrives for different model
- Respect --backend flag when loading models on-demand
- Only activates when no --loadall or --loadswap flag is specified

362b8452