- 20 Mar, 2026 9 commits
-
-
Your Name authored
- Add offload_strategy to kwargs in _load_default_model and _load_model_by_name - Fix parameter name: ram -> manual_ram_gb to match backend expectation - Also pass load_in_4bit, load_in_8bit, and max_gpu_percent
-
Your Name authored
- Add 'none' to --offload-strategy choices in cli.py - In cuda.py backend: - _get_vram_percentages_for_strategy() returns None for 'none' strategy - _get_vram_percentages_for_gpu() skips VRAM detection for 'none' - load_model() loads directly on GPU without max_memory constraints - Add startup status message in main.py for --offload-strategy none
-
Your Name authored
- Add --no-ram CLI option to force model loading without CPU RAM spilling - Implement --no-ram behavior for: - llama-cpp-python: n_gpu_layers=-1, use_mmap=False, ignore --n-ctx - HuggingFace transformers: device_map='cuda:0', low_cpu_mem_usage=True - Diffusers: force full GPU loading - sd.cpp: maximize GPU usage - Propagate flag through model manager - Add startup banner message
-
Your Name authored
- Add get_all_allowed_identifiers() to MultiModelManager returning all valid model identifiers (default model + short name + aliases, audio, tts, image, vision models, and custom aliases) - Rewrite is_allowed_model() to check against the full allowed set with support for prefixed forms and short-name matching - Add validation in request_model() that rejects unknown models with an error message listing all available models - Fix get_model_for_request() to reject loading arbitrary models not in the allowed set - Update all API endpoints (text, images, tts, transcriptions) to check for the error key and return HTTP 404 when a disallowed model is requested
-
Your Name authored
- Try GGUF pattern first for HuggingFace model IDs - Fall back to snapshot_download for entire repo (transformers/diffusers models) - Works for both GGUF models and full HuggingFace repos
-
Your Name authored
-
Your Name authored
- Remove auto-detection logic, just use download_model from cache - User can specify --download-file-pattern for non-GGUF models
-
Your Name authored
- Scan HuggingFace repo to detect available file patterns - Try multiple patterns (.gguf, .safetensors, .bin, .pt, .pth) - Default to .gguf if nothing found
-
Your Name authored
- Add --download-model argument to download a model (URL or HuggingFace ID) to cache - Add --download-file-pattern argument to specify file pattern for HF downloads - Use download_model from codai.models.cache module - Model downloads to appropriate cache and exits without starting server
-
- 19 Mar, 2026 31 commits
-
-
Your Name authored
text.py's set_global_args() was only setting its local global_args but not calling state.set_global_args(). This meant _load_default_model() and _load_model_by_name() got None from get_global_args(), so CLI flags like --flash-attn, --n-gpu-layers, --ram were not passed to backends.
-
Your Name authored
- Add flash_attn extraction from global_args in _load_default_model() - Add flash_attn extraction from global_args in _load_model_by_name() - Now --flash-attn flag will properly enable Flash Attention 2 when loading models
-
Your Name authored
- ModelParserDispatcher: Only log parser selection when actually used for parsing - ModelParserAdapter: Defer dispatcher creation until first use - Fixes noisy 'model_name=None, selected parser: ApexBig50Parser' during initialization
-
Your Name authored
- loadall: pre-load image models into VRAM at startup (with OOM fallback) - loadswap: pre-load image models into CPU RAM at startup (first model stays in VRAM) - Audio and TTS models are cached at startup, loaded into memory on first request (they use specialized loading mechanisms via faster-whisper and kokoro)
-
Your Name authored
- Default mode changed to ondemand (pre-load first model, unload/load on switch) - loadswap: load first model in VRAM, others in CPU RAM, swap on switch - loadall: try to load all models in VRAM, offload to CPU RAM if OOM - --nopreload: skip pre-loading in any mode, load on first request - request_model() now properly handles all three modes - Added _move_model_to_cpu() and _move_model_to_vram() for loadswap - Fixed NameError: model_manager reference in request_model() (was using global singleton instead of self) - Updated CLI help text for --loadall, --loadswap, --nopreload
-
Your Name authored
- Added request_model() method to MultiModelManager that handles: 1. Alias resolution (image, audio, tts, vision, default, custom aliases) 2. VRAM management (unloading previous models in ondemand mode) 3. Checking if model is already loaded - Simplified codai/api/images.py: - Uses request_model() for model resolution and VRAM management - Extracted helper functions: _is_gguf_model(), _load_diffusers_pipeline(), _generate_with_diffusers(), _generate_with_sdcpp(), _load_sdcpp_model() - Removed duplicated sd.cpp generation code - Fixed semaphore scope (all generation now inside semaphore block) - Simplified codai/api/tts.py: - Uses request_model() instead of duplicated VRAM management code - Removed duplicate get_cached_model_path() and get_model_cache_dir() wrappers - Simplified codai/api/transcriptions.py: - Uses request_model() instead of duplicated VRAM management code - Simplified codai/api/text.py: - Both /v1/chat/completions and /v1/completions use request_model() - Removed duplicated VRAM management blocks -
Your Name authored
-
Your Name authored
- **Model Manager**: Central coordinator for model lifecycle, alias resolution, loading/unloading - **Cache Module**: Handles downloading, caching, and storage of models - **API Modules**: Request models from Model Manager (not directly from cache) Key changes: - Removed resolve_and_load_model() from cache - moved logic to Model Manager - Model Manager now downloads/caches models at startup when registered - API modules use multi_model_manager.load_model() instead of cache functions - Proper separation: Cache=storage, Manager=lifecycle coordination, APIs=requests This fixes the incorrect direct API-to-cache coupling and establishes proper architectural boundaries.
-
Your Name authored
- Added resolve_and_load_model() function to codai.models.cache - Simplified codai/api/images.py by removing 100+ lines of complex model resolution logic - API modules now use single centralized function for all model loading - Eliminates code duplication across API endpoints - All model resolution logic now managed in one place
-
Your Name authored
- Added check in sd.cpp fallback to skip HF model IDs that are likely diffusers models - Prevents sd.cpp from trying to download non-GGUF files like .gitattributes for diffusers models - Tongyi-MAI/Z-Image-Turbo and similar diffusers models now handled correctly by diffusers library - GGUF models still work with sd.cpp as before
-
Your Name authored
- Updated codai/api/images.py to use cache module functions directly - Updated codai/api/tts.py to use centralized load_model() function - Removed proxy method calls that were causing AttributeError - All model loading/downloading now goes through codai.models.cache
-
Your Name authored
- Updated load_model() to handle three input types: 1. Local files: Use directly without caching 2. URLs: Download to cache if not cached, then use 3. HF model IDs: Download via HF API if not cached, then use - Updated get_cached_model_path() to validate local files - Enhanced module documentation to reflect new capabilities - All model types (text, image, audio, etc.) can now use any input type
-
Your Name authored
- Updated remove_cached_model() to remove entire repo directories when matching by repo_id - Previously only removed individual files, now removes complete repository cache - Handles both files and directories in removal process - More thorough cleanup of HuggingFace cached models
-
Your Name authored
- Added unified load_model() function as main entry point for model loading - Updated WhisperServerManager to use centralized load_model() instead of inline logic - Removed proxy methods from MultiModelManager - use cache module directly - All cache functions now work seamlessly with both GGUF and HF model caches - Improved separation of concerns: cache module handles all caching/downloading
-
Your Name authored
- Updated get_cached_model_path() to check both coderai and HF caches - Updated download_model() to handle both URLs and HF model IDs automatically - Made download_huggingface_model() consistent with unified API - Updated module docstring to reflect unified cache functionality - All cache functions now work seamlessly with both cache types
-
Your Name authored
- Updated remove_cached_model() to search by repo_id for HuggingFace models - Moved cache management options (--list-cached-models, --remove-model, --remove-all-models) to run before heavy imports - Improved cache operations to use centralized functions in codai.models.cache module - Fixed model removal to work with full repo IDs like 'TheBloke/Llama-2-7B-GGUF'
-
Your Name authored
- Add list_cached_models_info() function to codai.models.cache module - Move cache listing logic from main.py to the cache module - Update main.py to use the centralized function early (before heavy imports) - Improves code organization and avoids unnecessary imports for --list-cached-models
-
Your Name authored
- CoderAI cache: Shows individual GGUF files with sizes - HuggingFace cache: Uses HF API (scan_cache_dir) to show model-level info, not individual files - Shows model names, sizes, revision counts - not thousands of individual files - Much more useful and readable output
-
Your Name authored
- Added code to print individual cached model files with sizes - Previously only showed cache directory headers and summary - Now shows each file with format: [cache_name] filename (size MB) - Matches the format used by --remove-model command
-
Your Name authored
- Updated get_all_cache_dirs() to properly find HuggingFace hub directory - Now checks for ~/.cache/huggingface/hub/ instead of just ~/.cache/huggingface/ - This fixes --list-cached-models not showing HuggingFace cached models
-
Your Name authored
- Removed the GGUF-only restriction on sd.cpp fallback - Some HF models may be GGUF even without 'gguf' in the name - Let sd.cpp attempt loading and fail gracefully if incompatible - This allows sd.cpp to work as a proper fallback for any model type
-
Your Name authored
- Added check to only attempt sd.cpp fallback for GGUF models - Tongyi-MAI/Z-Image-Turbo is a diffusers model, not GGUF, so sd.cpp should be skipped - sd.cpp only supports GGUF models, diffusers models use the diffusers pipeline - This prevents unnecessary sd.cpp resolution attempts for incompatible model types
-
Your Name authored
- Added proxy methods to MultiModelManager class for cache module functions - These methods are called by images.py sd.cpp fallback path - Fixes AttributeError: 'MultiModelManager' object has no attribute 'get_cached_model_path'
-
Your Name authored
- Enhanced the HF model resolution logic in images.py sd.cpp fallback path - Now checks for ANY cached file from the repo first (not just GGUF files) - Falls back to checking for cached GGUF files specifically - Last resort: downloads the first file in the repo as fallback - Better error handling and logging throughout the resolution process - This should resolve models that are already cached even if the exact GGUF filename isn't known
-
Your Name authored
- Enhanced model resolution for sd.cpp fallback path - Added multiple fallback strategies: 1. Try HuggingFace GGUF resolution (existing) 2. Fallback to direct file path check 3. Fallback to cached model lookup 4. Last resort: attempt download as URL - Better error logging and handling - Ensures model loading attempts all possible resolution paths before failing
-
Your Name authored
- Added model resolution and unload logic to /v1/audio/transcriptions - Added model resolution and unload logic to /v1/audio/speech (TTS) - Now ALL endpoints (text, image, audio, TTS) properly handle model switching - In ondemand mode, ANY model type switch triggers unload first (e.g., text->audio, TTS->image, etc.)
-
Your Name authored
- Added resolve_model_name() to MultiModelManager to properly resolve model aliases - Added get_currently_loaded_model_name() to track what's actually in VRAM - Updated /v1/chat/completions, /v1/completions, and /v1/images/generations - Now correctly compares resolved canonical names before deciding to unload - Handles all aliases (default, image, audio, tts) and custom aliases - Works across ALL model types: text->text2, image->image2, text->image, etc.
-
Your Name authored
- Added unload_all_models() to MultiModelManager that handles ALL model types: ModelManager, diffusers pipelines, sd.cpp StableDiffusion, and any other objects - Text endpoints now properly unload image models before loading text models - Image endpoints now properly unload text models before loading image models - The rule: in ondemand mode, if the model in VRAM differs from the requested model (regardless of type), fully unload before loading the new one - Includes gc.collect(), torch.cuda.empty_cache(), and 1s settle delay
-
Your Name authored
- In ondemand mode (no --load-all or --loadswap specified), when a new model is requested, the current model in VRAM is now fully unloaded before loading the new one. This ensures clean model switching. - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints - Added same logic to image generation endpoints (diffusers and sd.cpp paths) - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
-
Your Name authored
Root cause: The refactored code was hardcoding torch.float16 for CUDA, ignoring the --image-precision bf16 CLI argument. The Z-Image-Turbo model requires bfloat16 precision - using float16 causes NaN values in the image processor, resulting in all-black images. Also restored the original model loading logic with: - GGUF model detection (skip diffusers for GGUF) - OOM retry with progressive memory optimization - use_safetensors=True - Sequential CPU offload support
-
Your Name authored
- Changed default image size from 512x512 back to 1024x1024 to match original coderai - Changed NaN handling from 0.5 to 0.0 to match original coderai
-