1. 20 Mar, 2026 9 commits
    • Your Name's avatar
      Fix offload-strategy parameter passing to CUDA backend · bf1d3f52
      Your Name authored
      - Add offload_strategy to kwargs in _load_default_model and _load_model_by_name
      - Fix parameter name: ram -> manual_ram_gb to match backend expectation
      - Also pass load_in_4bit, load_in_8bit, and max_gpu_percent
      bf1d3f52
    • Your Name's avatar
      Add --offload-strategy none to disable CPU offloading and VRAM auto-detection · beded066
      Your Name authored
      - Add 'none' to --offload-strategy choices in cli.py
      - In cuda.py backend:
        - _get_vram_percentages_for_strategy() returns None for 'none' strategy
        - _get_vram_percentages_for_gpu() skips VRAM detection for 'none'
        - load_model() loads directly on GPU without max_memory constraints
      - Add startup status message in main.py for --offload-strategy none
      beded066
    • Your Name's avatar
      Add --no-ram option to maximize VRAM usage · b782a092
      Your Name authored
      - Add --no-ram CLI option to force model loading without CPU RAM spilling
      - Implement --no-ram behavior for:
        - llama-cpp-python: n_gpu_layers=-1, use_mmap=False, ignore --n-ctx
        - HuggingFace transformers: device_map='cuda:0', low_cpu_mem_usage=True
        - Diffusers: force full GPU loading
        - sd.cpp: maximize GPU usage
      - Propagate flag through model manager
      - Add startup banner message
      b782a092
    • Your Name's avatar
      API: validate requested models against CLI-registered models · ef949827
      Your Name authored
      - Add get_all_allowed_identifiers() to MultiModelManager returning all valid
        model identifiers (default model + short name + aliases, audio, tts, image,
        vision models, and custom aliases)
      - Rewrite is_allowed_model() to check against the full allowed set with
        support for prefixed forms and short-name matching
      - Add validation in request_model() that rejects unknown models with an error
        message listing all available models
      - Fix get_model_for_request() to reject loading arbitrary models not in the
        allowed set
      - Update all API endpoints (text, images, tts, transcriptions) to check for
        the error key and return HTTP 404 when a disallowed model is requested
      ef949827
    • Your Name's avatar
      Fix --download-model for non-GGUF HuggingFace models · b0a633c7
      Your Name authored
      - Try GGUF pattern first for HuggingFace model IDs
      - Fall back to snapshot_download for entire repo (transformers/diffusers models)
      - Works for both GGUF models and full HuggingFace repos
      b0a633c7
    • Your Name's avatar
      Try to fix · aacd990a
      Your Name authored
      aacd990a
    • Your Name's avatar
      Simplify --download-model: use cache module directly · fe7a30dc
      Your Name authored
      - Remove auto-detection logic, just use download_model from cache
      - User can specify --download-file-pattern for non-GGUF models
      fe7a30dc
    • Your Name's avatar
      Improve --download-model auto-detection for non-GGUF HF models · 01bdfe14
      Your Name authored
      - Scan HuggingFace repo to detect available file patterns
      - Try multiple patterns (.gguf, .safetensors, .bin, .pt, .pth)
      - Default to .gguf if nothing found
      01bdfe14
    • Your Name's avatar
      Add --download-model CLI argument to download models to cache and exit · a49d1d88
      Your Name authored
      - Add --download-model argument to download a model (URL or HuggingFace ID) to cache
      - Add --download-file-pattern argument to specify file pattern for HF downloads
      - Use download_model from codai.models.cache module
      - Model downloads to appropriate cache and exits without starting server
      a49d1d88
  2. 19 Mar, 2026 31 commits
    • Your Name's avatar
      Fix global_args not propagated to state module · 8512c7db
      Your Name authored
      text.py's set_global_args() was only setting its local global_args but
      not calling state.set_global_args(). This meant _load_default_model()
      and _load_model_by_name() got None from get_global_args(), so CLI flags
      like --flash-attn, --n-gpu-layers, --ram were not passed to backends.
      8512c7db
    • Your Name's avatar
      Fix --flash-attn CLI flag not being passed to model loading · 7e4ae96f
      Your Name authored
      - Add flash_attn extraction from global_args in _load_default_model()
      - Add flash_attn extraction from global_args in _load_model_by_name()
      - Now --flash-attn flag will properly enable Flash Attention 2 when loading models
      7e4ae96f
    • Your Name's avatar
      Suppress spurious DEBUG model_parser output during startup · 49c01211
      Your Name authored
      - ModelParserDispatcher: Only log parser selection when actually used for parsing
      - ModelParserAdapter: Defer dispatcher creation until first use
      - Fixes noisy 'model_name=None, selected parser: ApexBig50Parser' during initialization
      49c01211
    • Your Name's avatar
      Pre-load all model types at startup for loadall/loadswap modes · bc2b1388
      Your Name authored
      - loadall: pre-load image models into VRAM at startup (with OOM fallback)
      - loadswap: pre-load image models into CPU RAM at startup (first model stays in VRAM)
      - Audio and TTS models are cached at startup, loaded into memory on first request
        (they use specialized loading mechanisms via faster-whisper and kokoro)
      bc2b1388
    • Your Name's avatar
      Implement proper loadswap/loadall/ondemand model management modes · c08a5b4f
      Your Name authored
      - Default mode changed to ondemand (pre-load first model, unload/load on switch)
      - loadswap: load first model in VRAM, others in CPU RAM, swap on switch
      - loadall: try to load all models in VRAM, offload to CPU RAM if OOM
      - --nopreload: skip pre-loading in any mode, load on first request
      - request_model() now properly handles all three modes
      - Added _move_model_to_cpu() and _move_model_to_vram() for loadswap
      - Fixed NameError: model_manager reference in request_model() (was using global singleton instead of self)
      - Updated CLI help text for --loadall, --loadswap, --nopreload
      c08a5b4f
    • Your Name's avatar
      Centralize model resolution and VRAM management in MultiModelManager.request_model() · e004541a
      Your Name authored
      - Added request_model() method to MultiModelManager that handles:
        1. Alias resolution (image, audio, tts, vision, default, custom aliases)
        2. VRAM management (unloading previous models in ondemand mode)
        3. Checking if model is already loaded
      
      - Simplified codai/api/images.py:
        - Uses request_model() for model resolution and VRAM management
        - Extracted helper functions: _is_gguf_model(), _load_diffusers_pipeline(),
          _generate_with_diffusers(), _generate_with_sdcpp(), _load_sdcpp_model()
        - Removed duplicated sd.cpp generation code
        - Fixed semaphore scope (all generation now inside semaphore block)
      
      - Simplified codai/api/tts.py:
        - Uses request_model() instead of duplicated VRAM management code
        - Removed duplicate get_cached_model_path() and get_model_cache_dir() wrappers
      
      - Simplified codai/api/transcriptions.py:
        - Uses request_model() instead of duplicated VRAM management code
      
      - Simplified codai/api/text.py:
        - Both /v1/chat/completions and /v1/completions use request_model()
        - Removed duplicated VRAM management blocks
      e004541a
    • Your Name's avatar
    • Your Name's avatar
      Fix architecture: Proper separation of Model Manager and Cache responsibilities · 7788ce85
      Your Name authored
      - **Model Manager**: Central coordinator for model lifecycle, alias resolution, loading/unloading
      - **Cache Module**: Handles downloading, caching, and storage of models
      - **API Modules**: Request models from Model Manager (not directly from cache)
      
      Key changes:
      - Removed resolve_and_load_model() from cache - moved logic to Model Manager
      - Model Manager now downloads/caches models at startup when registered
      - API modules use multi_model_manager.load_model() instead of cache functions
      - Proper separation: Cache=storage, Manager=lifecycle coordination, APIs=requests
      
      This fixes the incorrect direct API-to-cache coupling and establishes proper architectural boundaries.
      7788ce85
    • Your Name's avatar
      Centralize model resolution logic in cache module · de4d544f
      Your Name authored
      - Added resolve_and_load_model() function to codai.models.cache
      - Simplified codai/api/images.py by removing 100+ lines of complex model resolution logic
      - API modules now use single centralized function for all model loading
      - Eliminates code duplication across API endpoints
      - All model resolution logic now managed in one place
      de4d544f
    • Your Name's avatar
      Fix image generation to properly handle diffusers vs GGUF models · c535ca5f
      Your Name authored
      - Added check in sd.cpp fallback to skip HF model IDs that are likely diffusers models
      - Prevents sd.cpp from trying to download non-GGUF files like .gitattributes for diffusers models
      - Tongyi-MAI/Z-Image-Turbo and similar diffusers models now handled correctly by diffusers library
      - GGUF models still work with sd.cpp as before
      c535ca5f
    • Your Name's avatar
      Fix API modules to use centralized cache functions · 5e641ba2
      Your Name authored
      - Updated codai/api/images.py to use cache module functions directly
      - Updated codai/api/tts.py to use centralized load_model() function
      - Removed proxy method calls that were causing AttributeError
      - All model loading/downloading now goes through codai.models.cache
      5e641ba2
    • Your Name's avatar
      Implement intelligent model loading for local files, URLs, and HF IDs · bff24350
      Your Name authored
      - Updated load_model() to handle three input types:
        1. Local files: Use directly without caching
        2. URLs: Download to cache if not cached, then use
        3. HF model IDs: Download via HF API if not cached, then use
      - Updated get_cached_model_path() to validate local files
      - Enhanced module documentation to reflect new capabilities
      - All model types (text, image, audio, etc.) can now use any input type
      bff24350
    • Your Name's avatar
      Fix --remove-model to remove entire HF repository directories · 3e3067a9
      Your Name authored
      - Updated remove_cached_model() to remove entire repo directories when matching by repo_id
      - Previously only removed individual files, now removes complete repository cache
      - Handles both files and directories in removal process
      - More thorough cleanup of HuggingFace cached models
      3e3067a9
    • Your Name's avatar
      Centralize all model loading/downloading logic in codai.models.cache · c93d4a6b
      Your Name authored
      - Added unified load_model() function as main entry point for model loading
      - Updated WhisperServerManager to use centralized load_model() instead of inline logic
      - Removed proxy methods from MultiModelManager - use cache module directly
      - All cache functions now work seamlessly with both GGUF and HF model caches
      - Improved separation of concerns: cache module handles all caching/downloading
      c93d4a6b
    • Your Name's avatar
      Unify cache functions to work with both GGUF and HuggingFace caches · 82735770
      Your Name authored
      - Updated get_cached_model_path() to check both coderai and HF caches
      - Updated download_model() to handle both URLs and HF model IDs automatically
      - Made download_huggingface_model() consistent with unified API
      - Updated module docstring to reflect unified cache functionality
      - All cache functions now work seamlessly with both cache types
      82735770
    • Your Name's avatar
      Fix --remove-model to work with HuggingFace repo IDs · 52eb402a
      Your Name authored
      - Updated remove_cached_model() to search by repo_id for HuggingFace models
      - Moved cache management options (--list-cached-models, --remove-model, --remove-all-models) to run before heavy imports
      - Improved cache operations to use centralized functions in codai.models.cache module
      - Fixed model removal to work with full repo IDs like 'TheBloke/Llama-2-7B-GGUF'
      52eb402a
    • Your Name's avatar
      Refactor --list-cached-models to use centralized cache module function · e509279a
      Your Name authored
      - Add list_cached_models_info() function to codai.models.cache module
      - Move cache listing logic from main.py to the cache module
      - Update main.py to use the centralized function early (before heavy imports)
      - Improves code organization and avoids unnecessary imports for --list-cached-models
      e509279a
    • Your Name's avatar
      Fix: Properly implement --list-cached-models with model-level information · 4d9f9886
      Your Name authored
      - CoderAI cache: Shows individual GGUF files with sizes
      - HuggingFace cache: Uses HF API (scan_cache_dir) to show model-level info, not individual files
      - Shows model names, sizes, revision counts - not thousands of individual files
      - Much more useful and readable output
      4d9f9886
    • Your Name's avatar
      Fix: --list-cached-models now displays individual cached files · 07cf6c3f
      Your Name authored
      - Added code to print individual cached model files with sizes
      - Previously only showed cache directory headers and summary
      - Now shows each file with format: [cache_name] filename (size MB)
      - Matches the format used by --remove-model command
      07cf6c3f
    • Your Name's avatar
      Fix: Correct HuggingFace cache directory detection · 73c81b2f
      Your Name authored
      - Updated get_all_cache_dirs() to properly find HuggingFace hub directory
      - Now checks for ~/.cache/huggingface/hub/ instead of just ~/.cache/huggingface/
      - This fixes --list-cached-models not showing HuggingFace cached models
      73c81b2f
    • Your Name's avatar
      Revert: Keep sd.cpp fallback available for all models when diffusers fails · 2cedd442
      Your Name authored
      - Removed the GGUF-only restriction on sd.cpp fallback
      - Some HF models may be GGUF even without 'gguf' in the name
      - Let sd.cpp attempt loading and fail gracefully if incompatible
      - This allows sd.cpp to work as a proper fallback for any model type
      2cedd442
    • Your Name's avatar
      Fix: Skip sd.cpp fallback for non-GGUF models · f5b9d812
      Your Name authored
      - Added check to only attempt sd.cpp fallback for GGUF models
      - Tongyi-MAI/Z-Image-Turbo is a diffusers model, not GGUF, so sd.cpp should be skipped
      - sd.cpp only supports GGUF models, diffusers models use the diffusers pipeline
      - This prevents unnecessary sd.cpp resolution attempts for incompatible model types
      f5b9d812
    • Your Name's avatar
      Fix: Add missing get_cached_model_path and get_model_cache_dir methods to MultiModelManager · 392895da
      Your Name authored
      - Added proxy methods to MultiModelManager class for cache module functions
      - These methods are called by images.py sd.cpp fallback path
      - Fixes AttributeError: 'MultiModelManager' object has no attribute 'get_cached_model_path'
      392895da
    • Your Name's avatar
      Fix: Improve HuggingFace model ID resolution for sd.cpp · ce75ec47
      Your Name authored
      - Enhanced the HF model resolution logic in images.py sd.cpp fallback path
      - Now checks for ANY cached file from the repo first (not just GGUF files)
      - Falls back to checking for cached GGUF files specifically
      - Last resort: downloads the first file in the repo as fallback
      - Better error handling and logging throughout the resolution process
      - This should resolve models that are already cached even if the exact GGUF filename isn't known
      ce75ec47
    • Your Name's avatar
      Fix: Improve sd.cpp model loading fallback logic · 7bb4eec1
      Your Name authored
      - Enhanced model resolution for sd.cpp fallback path
      - Added multiple fallback strategies:
        1. Try HuggingFace GGUF resolution (existing)
        2. Fallback to direct file path check
        3. Fallback to cached model lookup
        4. Last resort: attempt download as URL
      - Better error logging and handling
      - Ensures model loading attempts all possible resolution paths before failing
      7bb4eec1
    • Your Name's avatar
      Complete fix: Add ondemand mode model switching to audio and TTS endpoints · 63460a13
      Your Name authored
      - Added model resolution and unload logic to /v1/audio/transcriptions
      - Added model resolution and unload logic to /v1/audio/speech (TTS)
      - Now ALL endpoints (text, image, audio, TTS) properly handle model switching
      - In ondemand mode, ANY model type switch triggers unload first (e.g., text->audio, TTS->image, etc.)
      63460a13
    • Your Name's avatar
      Fix: Proper model resolution for ondemand mode - unload when switching between ANY different models · a37085b4
      Your Name authored
      - Added resolve_model_name() to MultiModelManager to properly resolve model aliases
      - Added get_currently_loaded_model_name() to track what's actually in VRAM
      - Updated /v1/chat/completions, /v1/completions, and /v1/images/generations
      - Now correctly compares resolved canonical names before deciding to unload
      - Handles all aliases (default, image, audio, tts) and custom aliases
      - Works across ALL model types: text->text2, image->image2, text->image, etc.
      a37085b4
    • Your Name's avatar
      Fix: Centralize model unloading - properly handle all model types in ondemand mode · 00775972
      Your Name authored
      - Added unload_all_models() to MultiModelManager that handles ALL model types:
        ModelManager, diffusers pipelines, sd.cpp StableDiffusion, and any other objects
      - Text endpoints now properly unload image models before loading text models
      - Image endpoints now properly unload text models before loading image models
      - The rule: in ondemand mode, if the model in VRAM differs from the requested
        model (regardless of type), fully unload before loading the new one
      - Includes gc.collect(), torch.cuda.empty_cache(), and 1s settle delay
      00775972
    • Your Name's avatar
      Fix: In ondemand mode, fully unload current model before loading new one · 7d838962
      Your Name authored
      - In ondemand mode (no --load-all or --loadswap specified), when a new model
        is requested, the current model in VRAM is now fully unloaded before loading
        the new one. This ensures clean model switching.
      - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints
      - Added same logic to image generation endpoints (diffusers and sd.cpp paths)
      - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
      7d838962
    • Your Name's avatar
      Fix black image: use --image-precision from CLI args instead of hardcoded float16 · 9b3126d7
      Your Name authored
      Root cause: The refactored code was hardcoding torch.float16 for CUDA,
      ignoring the --image-precision bf16 CLI argument. The Z-Image-Turbo model
      requires bfloat16 precision - using float16 causes NaN values in the
      image processor, resulting in all-black images.
      
      Also restored the original model loading logic with:
      - GGUF model detection (skip diffusers for GGUF)
      - OOM retry with progressive memory optimization
      - use_safetensors=True
      - Sequential CPU offload support
      9b3126d7
    • Your Name's avatar
      Fix black image issue: restore original default size (1024x1024) and NaN handling · 553cdf07
      Your Name authored
      - Changed default image size from 512x512 back to 1024x1024 to match original coderai
      - Changed NaN handling from 0.5 to 0.0 to match original coderai
      553cdf07