v0.4.0 - Configuration refactoring and autoselect enhancements

- Centralized API key storage in providers.json only - Added support for provider-only rotation entries (auto-selects random model) - Added default settings hierarchy at provider and rotation levels - Limited autoselect selection context to 10 messages or 8000 tokens - Added support for direct provider models in autoselect (rotation/provider/model) - Added 'internal' keyword for local HuggingFace model selection - Updated requirements.txt with torch and transformers

v0.4.0 - Configuration refactoring and autoselect enhancements
- Centralized API key storage in providers.json only - Added support for provider-only rotation entries (auto-selects random model) - Added default settings hierarchy at provider and rotation levels - Limited autoselect selection context to 10 messages or 8000 tokens - Added support for direct provider models in autoselect (rotation/provider/model) - Added 'internal' keyword for local HuggingFace model selection - Updated requirements.txt with torch and transformers
8e4cdf2f · Your Name · b6bbf540 · 8e4cdf2f · 8e4cdf2f · 8e4cdf2f
Commit 8e4cdf2f authored Mar 22, 2026 by Your Name
9 changed files
--- a/AI.PROMPT
+++ b/AI.PROMPT
@@ -331,6 +331,39 @@ When making changes:
 2. Add to `PROVIDER_HANDLERS` dictionary
 3. Add provider configuration to `config/providers.json`

+### Configuration Architecture
+
+**API Key Management:**
+- API keys are stored centrally in `config/providers.json`
+- Each provider definition includes an `api_key` field
+- Rotation and autoselect configurations reference providers by name only
+- The system automatically resolves API keys by matching provider names
+
+**Priority Order for API Keys:**
+1. API key from provider config (providers.json) - highest priority
+2. API key from rotation config (rotations.json) - fallback for backward compatibility
+
+**Model Selection in Rotations:**
+- If models are specified in rotation config: uses those models
+- If no models specified: randomly selects from provider's available models
+- Models from provider config are used with default weight of 1
+
+**Default Settings Hierarchy:**
+Settings can be specified at three levels with the following priority:
+1. Model-specific settings (highest priority)
+2. Rotation default settings
+3. Provider default settings (lowest priority)
+
+**Supported Default Fields:**
+- `default_rate_limit`: Rate limiting between requests
+- `default_max_request_tokens`: Maximum tokens per request
+- `default_rate_limit_TPM`: Tokens per minute limit
+- `default_rate_limit_TPH`: Tokens per hour limit
+- `default_rate_limit_TPD`: Tokens per day limit
+- `default_context_size`: Context window size
+- `default_condense_context`: Context condensation threshold
+- `default_condense_method`: Context condensation method
+
 ### Kiro Gateway Integration

 **Overview:**
@@ -353,12 +386,33 @@ Kiro Gateway is a third-party proxy gateway that provides OpenAI and Anthropic-c
 In `config/providers.json`:
 ```json
 {
+  "gemini": {
+    "id": "gemini",
+    "name": "Google AI Studio",
+    "endpoint": "https://generativelanguage.googleapis.com/v1beta",
+    "type": "google",
+    "api_key_required": true,
+    "api_key": "YOUR_GEMINI_API_KEY",
+    "rate_limit": 0,
+    "default_rate_limit": 0,
+    "default_max_request_tokens": 1000000,
+    "default_context_size": 1000000,
+    "models": [
+      {
+        "name": "gemini-2.0-flash",
+        "rate_limit": 0,
+        "max_request_tokens": 1000000,
+        "context_size": 1000000
+      }
+    ]
+  },
  "kiro": {
    "id": "kiro",
    "name": "Kiro Gateway (Amazon Q Developer)",
    "endpoint": "http://localhost:8000/v1",
    "type": "kiro",
    "api_key_required": true,
+    "api_key": "YOUR_KIRO_API_KEY",
    "rate_limit": 0,
    "models": [
      {
@@ -372,20 +426,39 @@ In `config/providers.json`:
 }
 ```

-In `config/rotations.json`:
+In `config/rotations.json` (API keys are now referenced from providers.json):
 ```json
 {
+  "coding": {
+    "model_name": "coding",
+    "notifyerrors": false,
+    "default_rate_limit": 0,
+    "default_context_size": 100000,
+    "providers": [
+      {
+        "provider_id": "gemini",
+        "models": [
+          {
+            "name": "gemini-2.0-flash",
+            "weight": 3,
+            "rate_limit": 0
+          }
+        ]
+      },
+      {
+        "provider_id": "openai"
+      }
+    ]
+  },
  "kiro-claude": {
    "model_name": "kiro-claude",
    "providers": [
      {
        "provider_id": "kiro",
-        "api_key": "YOUR_KIRO_GATEWAY_API_KEY",
        "models": [
          {
            "name": "claude-sonnet-4-5",
-            "weight": 3,
-            "rate_limit": 0
+            "weight": 3
          }
        ]
      }
@@ -394,6 +467,8 @@ In `config/rotations.json`:
 }
 ```

+**Note:** In the example above, the "openai" provider entry has no models specified, so the system will randomly select from all models defined in the provider's configuration.
+
 **Setup Requirements:**
 1. Kiro Gateway must be running (typically on `http://localhost:8000`)
 2. Kiro Gateway must be configured with valid Kiro credentials (IDE or CLI)
@@ -515,6 +590,20 @@ This AI.PROMPT file is automatically updated when significant changes are made t

 ### Recent Updates

+**2026-03-22 - Configuration Refactoring**
+- Centralized API key storage in providers.json
+- API keys are now stored only in provider definitions, not in rotation/autoselect configs
+- Rotation and autoselect configurations reference providers by name only
+- Added support for provider-only entries in rotations (no models specified)
+- When no models specified, system randomly selects from provider's available models
+- Added default settings support at provider level (default_rate_limit, default_max_request_tokens, etc.)
+- Added default settings support at rotation level
+- Settings priority: model-specific > rotation defaults > provider defaults
+- Updated ProviderConfig with default setting fields
+- Updated RotationConfig with default setting fields
+- Added _apply_defaults_to_model() method in RotationHandler
+- Updated configuration examples in AI.PROMPT
+
 **2026-02-07 - Version 0.2.7**
 - Added max_request_tokens support for automatic request splitting
 - Updated ProviderConfig to include optional models field with max_request_tokens

--- a/aisbf/config.py
+++ b/aisbf/config.py
@@ -53,10 +53,28 @@ class ProviderConfig(BaseModel):
    api_key: Optional[str] = None  # Optional API key in provider config
    models: Optional[List[ProviderModelConfig]] = None  # Optional list of models with their configs
    kiro_config: Optional[Dict] = None  # Optional Kiro-specific configuration (credentials, region, etc.)
+    # Default settings for models in this provider
+    default_rate_limit: Optional[float] = None
+    default_max_request_tokens: Optional[int] = None
+    default_rate_limit_TPM: Optional[int] = None
+    default_rate_limit_TPH: Optional[int] = None
+    default_rate_limit_TPD: Optional[int] = None
+    default_context_size: Optional[int] = None
+    default_condense_context: Optional[int] = None
+    default_condense_method: Optional[Union[str, List[str]]] = None

 class RotationConfig(BaseModel):
    providers: List[Dict]
    notifyerrors: bool = False
+    # Default settings for models in this rotation
+    default_rate_limit: Optional[float] = None
+    default_max_request_tokens: Optional[int] = None
+    default_rate_limit_TPM: Optional[int] = None
+    default_rate_limit_TPH: Optional[int] = None
+    default_rate_limit_TPD: Optional[int] = None
+    default_context_size: Optional[int] = None
+    default_condense_context: Optional[int] = None
+    default_condense_method: Optional[Union[str, List[str]]] = None

 class AutoselectModelInfo(BaseModel):
    model_id: str

--- a/aisbf/handlers.py
+++ b/aisbf/handlers.py
@@ -663,6 +663,50 @@ class RotationHandler:
            return provider_config.type
        return None
    
+    def _apply_defaults_to_model(self, model: Dict, provider_config, rotation_config) -> Dict:
+        """
+        Apply default settings to a model configuration.
+        
+        Priority order:
+        1. Model-specific settings (highest priority)
+        2. Rotation default settings
+        3. Provider default settings (lowest priority)
+        
+        Args:
+            model: The model configuration dict
+            provider_config: The provider configuration
+            rotation_config: The rotation configuration
+            
+        Returns:
+            Model dict with defaults applied
+        """
+        # List of fields that can have defaults
+        default_fields = [
+            'rate_limit',
+            'max_request_tokens',
+            'rate_limit_TPM',
+            'rate_limit_TPH',
+            'rate_limit_TPD',
+            'context_size',
+            'condense_context',
+            'condense_method'
+        ]
+        
+        for field in default_fields:
+            # If field is not set in model, try rotation defaults, then provider defaults
+            if field not in model or model[field] is None:
+                # Try rotation defaults first
+                rotation_default = getattr(rotation_config, f'default_{field}', None)
+                if rotation_default is not None:
+                    model[field] = rotation_default
+                else:
+                    # Try provider defaults
+                    provider_default = getattr(provider_config, f'default_{field}', None)
+                    if provider_default is not None:
+                        model[field] = provider_default
+        
+        return model
+
    async def _handle_chunked_rotation_request(
        self,
        handler,
@@ -893,11 +937,40 @@ class RotationHandler:
            
            logger.info(f"  [AVAILABLE] Provider {provider_id} is active and ready")
            
-            models_in_provider = len(provider['models'])
+            # Check if models are specified in rotation config
+            # If not, use models from provider config
+            rotation_models = provider.get('models')
+            if not rotation_models:
+                logger.info(f"  No models specified in rotation config for {provider_id}")
+                logger.info(f"  Will use models from provider configuration")
+                
+                # Get models from provider config
+                if provider_config.models:
+                    # Use models from provider config with default weight of 1
+                    rotation_models = []
+                    for provider_model in provider_config.models:
+                        model_dict = {
+                            'name': provider_model.name,
+                            'weight': 1,  # Default weight
+                            'rate_limit': provider_model.rate_limit,
+                            'max_request_tokens': provider_model.max_request_tokens
+                        }
+                        rotation_models.append(model_dict)
+                    logger.info(f"  Loaded {len(rotation_models)} model(s) from provider config")
+                else:
+                    logger.warning(f"  No models defined in provider config for {provider_id}")
+                    logger.warning(f"  Skipping this provider")
+                    skipped_providers.append(provider_id)
+                    continue
+            
+            models_in_provider = len(rotation_models)
            total_models_considered += models_in_provider
            logger.info(f"  Found {models_in_provider} model(s) in this provider")
            
-            for model in provider['models']:
+            for model in rotation_models:
+                # Apply defaults: model-specific > rotation defaults > provider defaults
+                model = self._apply_defaults_to_model(model, provider_config, rotation_config)
+                
                model_name = model['name']
                model_weight = model['weight']
                model_rate_limit = model.get('rate_limit', 'N/A')
@@ -1882,6 +1955,9 @@ class AutoselectHandler:
    def __init__(self):
        self.config = config
        self._skill_file_content = None
+        self._internal_model = None
+        self._internal_tokenizer = None
+        self._internal_model_lock = None

    def _get_skill_file_content(self) -> str:
        """Load the autoselect.md skill file content"""
@@ -1911,6 +1987,110 @@ class AutoselectHandler:
        
        return self._skill_file_content
    
+    def _initialize_internal_model(self):
+        """Initialize the internal HuggingFace model for selection (lazy loading)"""
+        import logging
+        logger = logging.getLogger(__name__)
+        
+        if self._internal_model is not None:
+            return  # Already initialized
+        
+        try:
+            import torch
+            from transformers import AutoTokenizer, AutoModelForCausalLM
+            import threading
+            
+            logger.info("=== INITIALIZING INTERNAL SELECTION MODEL ===")
+            model_name = "huihui-ai/Qwen2.5-0.5B-Instruct-abliterated-v3"
+            logger.info(f"Model: {model_name}")
+            
+            # Check for GPU availability
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+            logger.info(f"Device: {device}")
+            
+            # Load tokenizer
+            logger.info("Loading tokenizer...")
+            self._internal_tokenizer = AutoTokenizer.from_pretrained(model_name)
+            logger.info("Tokenizer loaded")
+            
+            # Load model
+            logger.info("Loading model...")
+            self._internal_model = AutoModelForCausalLM.from_pretrained(
+                model_name,
+                torch_dtype=torch.float16 if device == "cuda" else torch.float32,
+                device_map="auto" if device == "cuda" else None
+            )
+            
+            if device == "cpu":
+                self._internal_model = self._internal_model.to(device)
+            
+            logger.info("Model loaded successfully")
+            
+            # Initialize thread lock for model access
+            self._internal_model_lock = threading.Lock()
+            
+            logger.info("=== INTERNAL SELECTION MODEL READY ===")
+        except ImportError as e:
+            logger.error(f"Failed to import required libraries for internal model: {e}")
+            logger.error("Please install: pip install torch transformers")
+            raise
+        except Exception as e:
+            logger.error(f"Failed to initialize internal model: {e}", exc_info=True)
+            raise
+    
+    async def _run_internal_model_selection(self, prompt: str) -> str:
+        """Run the internal model for selection in a separate thread"""
+        import logging
+        import asyncio
+        from concurrent.futures import ThreadPoolExecutor
+        logger = logging.getLogger(__name__)
+        
+        # Initialize model if needed
+        if self._internal_model is None:
+            self._initialize_internal_model()
+        
+        def run_inference():
+            """Run inference in a separate thread"""
+            with self._internal_model_lock:
+                try:
+                    import torch
+                    
+                    # Tokenize input
+                    inputs = self._internal_tokenizer(prompt, return_tensors="pt")
+                    
+                    # Move to same device as model
+                    device = next(self._internal_model.parameters()).device
+                    inputs = {k: v.to(device) for k, v in inputs.items()}
+                    
+                    # Generate response
+                    with torch.no_grad():
+                        outputs = self._internal_model.generate(
+                            **inputs,
+                            max_new_tokens=100,
+                            temperature=0.1,
+                            do_sample=True,
+                            pad_token_id=self._internal_tokenizer.eos_token_id
+                        )
+                    
+                    # Decode response
+                    response = self._internal_tokenizer.decode(outputs[0], skip_special_tokens=True)
+                    
+                    # Extract only the generated part (remove the prompt)
+                    if response.startswith(prompt):
+                        response = response[len(prompt):].strip()
+                    
+                    return response
+                except Exception as e:
+                    logger.error(f"Error during internal model inference: {e}", exc_info=True)
+                    return None
+        
+        # Run in thread pool to avoid blocking
+        loop = asyncio.get_event_loop()
+        with ThreadPoolExecutor(max_workers=1) as executor:
+            result = await loop.run_in_executor(executor, run_inference)
+        
+        return result
+
    def _build_autoselect_prompt(self, user_prompt: str, autoselect_config) -> str:
        """Build the prompt for model selection"""
        skill_content = self._get_skill_file_content()
@@ -1943,11 +2123,7 @@ class AutoselectHandler:
        import logging
        logger = logging.getLogger(__name__)
        logger.info(f"=== AUTOSELECT MODEL SELECTION START ===")
-        logger.info(f"Using '{autoselect_config.selection_model}' rotation for model selection")
-        
-        # Use the first available provider/model for the selection
-        # This is a simple implementation - could be enhanced to use a specific selection model
-        rotation_handler = RotationHandler()
+        logger.info(f"Using '{autoselect_config.selection_model}' for model selection")
        
        # Create a minimal request for model selection
        selection_request = {
@@ -1962,10 +2138,79 @@ class AutoselectHandler:
        logger.info(f"  Max tokens: 100 (short response expected)")
        logger.info(f"  Stream: False")
        
-        # Use the configured selection rotation for the selection
+        # Determine if selection_model is a rotation, provider, or special keyword
+        selection_model = autoselect_config.selection_model
+        
        try:
-            logger.info(f"Sending selection request to rotation handler...")
-            response = await rotation_handler.handle_rotation_request(autoselect_config.selection_model, selection_request)
+            # Check if it's the special "internal" keyword
+            if selection_model == "internal":
+                logger.info(f"Selection model is 'internal' - using local HuggingFace model")
+                response_content = await self._run_internal_model_selection(prompt)
+                
+                if not response_content:
+                    logger.error("Internal model returned no response")
+                    return None
+                
+                logger.info(f"Internal model response: {response_content[:200]}..." if len(response_content) > 200 else f"Internal model response: {response_content}")
+                
+                # Extract model selection from response
+                model_id = self._extract_model_selection(response_content)
+                
+                if model_id:
+                    logger.info(f"=== AUTOSELECT MODEL SELECTION SUCCESS ===")
+                    logger.info(f"Selected model ID: {model_id}")
+                else:
+                    logger.warning(f"=== AUTOSELECT MODEL SELECTION FAILED ===")
+                    logger.warning(f"Could not extract model ID from internal model response")
+                
+                return model_id
+            # Check if it's a rotation
+            elif selection_model in self.config.rotations:
+                logger.info(f"Selection model '{selection_model}' is a rotation")
+                rotation_handler = RotationHandler()
+                response = await rotation_handler.handle_rotation_request(selection_model, selection_request)
+            # Check if it's a provider/model format (e.g., "gemini/gemini-pro")
+            elif '/' in selection_model:
+                provider_id, model_name = selection_model.split('/', 1)
+                logger.info(f"Selection model '{selection_model}' is a direct provider model")
+                logger.info(f"  Provider: {provider_id}, Model: {model_name}")
+                
+                if provider_id not in self.config.providers:
+                    logger.error(f"Provider '{provider_id}' not found in configuration")
+                    return None
+                
+                # Use the direct provider handler
+                request_handler = RequestHandler()
+                selection_request['model'] = model_name
+                response = await request_handler.handle_chat_completion(
+                    request=None,  # No HTTP request object needed
+                    provider_id=provider_id,
+                    request_data=selection_request
+                )
+            # Check if it's just a provider ID (use any model from that provider)
+            elif selection_model in self.config.providers:
+                logger.info(f"Selection model '{selection_model}' is a provider (will use first available model)")
+                provider_config = self.config.get_provider(selection_model)
+                
+                # Get first available model from provider
+                if provider_config.models and len(provider_config.models) > 0:
+                    model_name = provider_config.models[0].name
+                    logger.info(f"  Using model: {model_name}")
+                    
+                    request_handler = RequestHandler()
+                    selection_request['model'] = model_name
+                    response = await request_handler.handle_chat_completion(
+                        request=None,
+                        provider_id=selection_model,
+                        request_data=selection_request
+                    )
+                else:
+                    logger.error(f"Provider '{selection_model}' has no models configured")
+                    return None
+            else:
+                logger.error(f"Selection model '{selection_model}' not found in rotations or providers")
+                return None
+            
            logger.info(f"Selection response received")
            
            content = response.get('choices', [{}])[0].get('message', {}).get('content', '')
@@ -2017,16 +2262,38 @@ class AutoselectHandler:
        logger.info(f"User messages count: {len(user_messages)}")
        
        # Build a string representation of the user prompt
+        # Limit to last 10 messages or 8000 tokens, whichever comes first
+        MAX_SELECTION_MESSAGES = 10
+        MAX_SELECTION_TOKENS = 8000
+        
+        # Take the last N messages
+        limited_messages = user_messages[-MAX_SELECTION_MESSAGES:] if len(user_messages) > MAX_SELECTION_MESSAGES else user_messages
+        logger.info(f"Limited to last {len(limited_messages)} messages for selection")
+        
+        # Build prompt and check token count
        user_prompt = ""
-        for msg in user_messages:
+        final_messages = []
+        for msg in limited_messages:
            role = msg.get('role', 'user')
            content = msg.get('content', '')
            if isinstance(content, list):
                # Handle complex content (e.g., with images)
                content = str(content)
-            user_prompt += f"{role}: {content}\n"
            
-        logger.info(f"User prompt length: {len(user_prompt)} characters")
+            # Check if adding this message would exceed token limit
+            test_prompt = user_prompt + f"{role}: {content}\n"
+            # Use a simple token estimation (rough approximation: 1 token ≈ 4 chars)
+            estimated_tokens = len(test_prompt) // 4
+            
+            if estimated_tokens > MAX_SELECTION_TOKENS:
+                logger.info(f"Reached token limit ({estimated_tokens} > {MAX_SELECTION_TOKENS}), stopping at {len(final_messages)} messages")
+                break
+            
+            user_prompt = test_prompt
+            final_messages.append(msg)
+        
+        logger.info(f"Final message count for selection: {len(final_messages)}")
+        logger.info(f"User prompt length: {len(user_prompt)} characters (est. {len(user_prompt) // 4} tokens)")
        logger.info(f"User prompt preview: {user_prompt[:200]}..." if len(user_prompt) > 200 else f"User prompt: {user_prompt}")

        # Build the autoselect prompt
@@ -2099,15 +2366,37 @@ class AutoselectHandler:
        logger.info(f"User messages count: {len(user_messages)}")
        
        # Build a string representation of the user prompt
+        # Limit to last 10 messages or 8000 tokens, whichever comes first
+        MAX_SELECTION_MESSAGES = 10
+        MAX_SELECTION_TOKENS = 8000
+        
+        # Take the last N messages
+        limited_messages = user_messages[-MAX_SELECTION_MESSAGES:] if len(user_messages) > MAX_SELECTION_MESSAGES else user_messages
+        logger.info(f"Limited to last {len(limited_messages)} messages for selection")
+        
+        # Build prompt and check token count
        user_prompt = ""
-        for msg in user_messages:
+        final_messages = []
+        for msg in limited_messages:
            role = msg.get('role', 'user')
            content = msg.get('content', '')
            if isinstance(content, list):
                content = str(content)
-            user_prompt += f"{role}: {content}\n"
            
-        logger.info(f"User prompt length: {len(user_prompt)} characters")
+            # Check if adding this message would exceed token limit
+            test_prompt = user_prompt + f"{role}: {content}\n"
+            # Use a simple token estimation (rough approximation: 1 token ≈ 4 chars)
+            estimated_tokens = len(test_prompt) // 4
+            
+            if estimated_tokens > MAX_SELECTION_TOKENS:
+                logger.info(f"Reached token limit ({estimated_tokens} > {MAX_SELECTION_TOKENS}), stopping at {len(final_messages)} messages")
+                break
+            
+            user_prompt = test_prompt
+            final_messages.append(msg)
+        
+        logger.info(f"Final message count for selection: {len(final_messages)}")
+        logger.info(f"User prompt length: {len(user_prompt)} characters (est. {len(user_prompt) // 4} tokens)")
        logger.info(f"User prompt preview: {user_prompt[:200]}..." if len(user_prompt) > 200 else f"User prompt: {user_prompt}")

        # Build the autoselect prompt
@@ -2140,21 +2429,63 @@ class AutoselectHandler:
        logger.info(f"Selection method: {'AI-selected' if selected_model_id != autoselect_config.fallback else 'Fallback'}")
        logger.info(f"Request mode: Streaming")

-        # Now proxy the actual streaming request to the selected rotation
-        # The rotation handler will return a StreamingResponse with proper handling
-        # based on the selected provider's type (google vs others)
+        # Proxy the streaming request to the selected model (rotation or direct provider)
+        try:
+            # Ensure stream is set to True
+            request_data['stream'] = True
+            
+            # Check if it's a rotation first
+            if selected_model_id in self.config.rotations:
                logger.info(f"Proxying streaming request to rotation: {selected_model_id}")
                rotation_handler = RotationHandler()
-        
-        # The rotation handler handles streaming internally and returns a StreamingResponse
-        response = await rotation_handler.handle_rotation_request(
-            selected_model_id,
-            {**request_data, "stream": True}
+                response = await rotation_handler.handle_rotation_request(selected_model_id, request_data)
+            # Check if it's a provider/model format (e.g., "gemini/gemini-pro")
+            elif '/' in selected_model_id:
+                provider_id, model_name = selected_model_id.split('/', 1)
+                logger.info(f"Proxying streaming request to direct provider model: {selected_model_id}")
+                logger.info(f"  Provider: {provider_id}, Model: {model_name}")
+                
+                if provider_id not in self.config.providers:
+                    logger.error(f"Provider '{provider_id}' not found in configuration")
+                    raise HTTPException(status_code=400, detail=f"Provider {provider_id} not found")
+                
+                # Use the direct provider handler
+                request_handler = RequestHandler()
+                request_data['model'] = model_name
+                response = await request_handler.handle_streaming_chat_completion(
+                    request=None,
+                    provider_id=provider_id,
+                    request_data=request_data
+                )
+            # Check if it's just a provider ID (use first available model)
+            elif selected_model_id in self.config.providers:
+                logger.info(f"Proxying streaming request to provider: {selected_model_id} (will use first available model)")
+                provider_config = self.config.get_provider(selected_model_id)
+                
+                # Get first available model from provider
+                if provider_config.models and len(provider_config.models) > 0:
+                    model_name = provider_config.models[0].name
+                    logger.info(f"  Using model: {model_name}")
+                    
+                    request_handler = RequestHandler()
+                    request_data['model'] = model_name
+                    response = await request_handler.handle_streaming_chat_completion(
+                        request=None,
+                        provider_id=selected_model_id,
+                        request_data=request_data
                    )
+                else:
+                    logger.error(f"Provider '{selected_model_id}' has no models configured")
+                    raise HTTPException(status_code=400, detail=f"Provider {selected_model_id} has no models configured")
+            else:
+                logger.error(f"Selected model '{selected_model_id}' not found in rotations or providers")
+                raise HTTPException(status_code=400, detail=f"Model {selected_model_id} not found")
            
            logger.info(f"=== AUTOSELECT STREAMING REQUEST END ===")
-        # Return the StreamingResponse directly - rotation handler already handled the conversion
            return response
+        except Exception as e:
+            logger.error(f"Error proxying to selected model: {str(e)}", exc_info=True)
+            raise

    async def handle_autoselect_model_list(self, autoselect_id: str) -> List[Dict]:
        """List the available models for an autoselect endpoint"""

--- a/config/providers.json
+++ b/config/providers.json
@@ -11,6 +11,7 @@
      "endpoint": "https://generativelanguage.googleapis.com/v1beta",
      "type": "google",
      "api_key_required": true,
+      "api_key": "YOUR_GEMINI_API_KEY",
      "rate_limit": 0,
      "models": [
        {
@@ -43,6 +44,7 @@
      "endpoint": "https://api.openai.com/v1",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_OPENAI_API_KEY",
      "rate_limit": 0
    },
    "anthropic": {
@@ -51,6 +53,7 @@
      "endpoint": "https://api.anthropic.com/v1",
      "type": "anthropic",
      "api_key_required": true,
+      "api_key": "YOUR_ANTHROPIC_API_KEY",
      "rate_limit": 0
    },
    "ollama": {
@@ -67,6 +70,7 @@
      "endpoint": "https://your-azure-endpoint.openai.azure.com",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_AZURE_OPENAI_API_KEY",
      "rate_limit": 0
    },
    "cohere": {
@@ -75,6 +79,7 @@
      "endpoint": "https://api.cohere.com/v1",
      "type": "cohere",
      "api_key_required": true,
+      "api_key": "YOUR_COHERE_API_KEY",
      "rate_limit": 0
    },
    "huggingface": {
@@ -83,6 +88,7 @@
      "endpoint": "https://api-inference.huggingface.co",
      "type": "huggingface",
      "api_key_required": true,
+      "api_key": "YOUR_HUGGINGFACE_API_KEY",
      "rate_limit": 0
    },
    "replicate": {
@@ -91,6 +97,7 @@
      "endpoint": "https://api.replicate.com/v1",
      "type": "replicate",
      "api_key_required": true,
+      "api_key": "YOUR_REPLICATE_API_KEY",
      "rate_limit": 0
    },
    "togetherai": {
@@ -99,6 +106,7 @@
      "endpoint": "https://api.together.xyz/v1",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_TOGETHERAI_API_KEY",
      "rate_limit": 0
    },
    "groq": {
@@ -107,6 +115,7 @@
      "endpoint": "https://api.groq.com/openai/v1",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_GROQ_API_KEY",
      "rate_limit": 0
    },
    "mistralai": {
@@ -115,6 +124,7 @@
      "endpoint": "https://api.mistral.ai/v1",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_MISTRALAI_API_KEY",
      "rate_limit": 0
    },
    "stabilityai": {
@@ -123,6 +133,7 @@
      "endpoint": "https://api.stability.ai/v2beta",
      "type": "stabilityai",
      "api_key_required": true,
+      "api_key": "YOUR_STABILITYAI_API_KEY",
      "rate_limit": 0
    },
    "kilo": {
@@ -131,6 +142,7 @@
      "endpoint": "https://kilocode.ai/api/openrouter",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_KILO_API_KEY",
      "rate_limit": 0
    },
    "perplexity": {
@@ -139,6 +151,7 @@
      "endpoint": "https://api.perplexity.ai",
      "type": "openai",
      "api_key_required": true,
+      "api_key": "YOUR_PERPLEXITY_API_KEY",
      "rate_limit": 0
    },
    "poe": {
@@ -147,6 +160,7 @@
      "endpoint": "https://api.poe.com/v1",
      "type": "poe",
      "api_key_required": true,
+      "api_key": "YOUR_POE_API_KEY",
      "rate_limit": 0
    },
    "lanai": {
@@ -155,6 +169,7 @@
      "endpoint": "https://api.lanai.ai/v1",
      "type": "lanai",
      "api_key_required": true,
+      "api_key": "YOUR_LANAI_API_KEY",
      "rate_limit": 0
    },
    "amazon": {
@@ -163,6 +178,7 @@
      "endpoint": "https://api.amazon.com/bedrock/v1",
      "type": "amazon",
      "api_key_required": true,
+      "api_key": "YOUR_AMAZON_API_KEY",
      "rate_limit": 0
    },
    "ibm": {
@@ -171,6 +187,7 @@
      "endpoint": "https://api.ibm.com/watson/v1",
      "type": "ibm",
      "api_key_required": true,
+      "api_key": "YOUR_IBM_API_KEY",
      "rate_limit": 0
    },
    "microsoft": {
@@ -179,6 +196,7 @@
      "endpoint": "https://api.microsoft.com/v1",
      "type": "microsoft",
      "api_key_required": true,
+      "api_key": "YOUR_MICROSOFT_API_KEY",
      "rate_limit": 0
    },
    "kiro": {

--- a/config/rotations.json
+++ b/config/rotations.json
@@ -7,7 +7,6 @@
      "providers": [
        {
          "provider_id": "gemini",
-          "api_key": "YOUR_GEMINI_API_KEY",
          "models": [
            {
              "name": "gemini-2.0-flash",
@@ -37,7 +36,6 @@
        },
        {
          "provider_id": "openai",
-          "api_key": "YOUR_OPENAI_API_KEY",
          "models": [
            {
              "name": "gpt-4",
@@ -61,7 +59,6 @@
        },
        {
          "provider_id": "anthropic",
-          "api_key": "YOUR_ANTHROPIC_API_KEY",
          "models": [
            {
              "name": "claude-3-5-sonnet-20241022",
@@ -91,7 +88,6 @@
      "providers": [
        {
          "provider_id": "gemini",
-          "api_key": "YOUR_GEMINI_API_KEY",
          "models": [
            {
              "name": "gemini-1.5-pro",
@@ -107,7 +103,6 @@
        },
        {
          "provider_id": "openai",
-          "api_key": "YOUR_OPENAI_API_KEY",
          "models": [
            {
              "name": "gpt-4",
@@ -129,7 +124,6 @@
      "providers": [
        {
          "provider_id": "gemini",
-          "api_key": "YOUR_GEMINI_API_KEY",
          "models": [
            {
              "name": "gemini-2.0-flash",
@@ -151,7 +145,6 @@
      "providers": [
        {
          "provider_id": "kiro",
-          "api_key": "YOUR_KIRO_GATEWAY_API_KEY",
          "models": [
            {
              "name": "claude-sonnet-4-5",

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "aisbf"
-version = "0.3.3"
+version = "0.4.0"
 description = "AISBF - AI Service Broker Framework || AI Should Be Free - A modular proxy server for managing multiple AI provider integrations"
 readme = "README.md"
 license = "GPL-3.0-or-later"

--- a/requirements.txt
+++ b/requirements.txt
@@ -11,3 +11,5 @@ openai
 anthropic
 langchain-text-splitters
 tiktoken
+torch
+transformers
\ No newline at end of file
--- a/setup.py
+++ b/setup.py
@@ -49,7 +49,7 @@ class InstallCommand(_install):

 setup(
    name="aisbf",
-    version="0.3.3",
+    version="0.4.0",
    author="AISBF Contributors",
    author_email="stefy@nexlab.net",
    description="AISBF - AI Service Broker Framework || AI Should Be Free - A modular proxy server for managing multiple AI provider integrations",

--- a/kiro-gateway @ e6f23c22
+++ b/kiro-gateway @ e6f23c22
+Subproject commit e6f23c22fc5e9aa7a22e4c31af56cdc6f859afbd