Add context management feature with automatic condensation

- Add context_size, condense_context, and condense_method fields to Model class - Create new context.py module with ContextManager and condensation methods - Implement hierarchical, conversational, semantic, and algoritmic condensation - Calculate and report effective_context for all requests - Update handlers.py to apply context condensation when configured - Update providers.json and rotations.json with example context configurations - Update README.md and DOCUMENTATION.md with context management documentation - Export context module and utilities in __init__.py

Add context management feature with automatic condensation
- Add context_size, condense_context, and condense_method fields to Model class - Create new context.py module with ContextManager and condensation methods - Implement hierarchical, conversational, semantic, and algoritmic condensation - Calculate and report effective_context for all requests - Update handlers.py to apply context condensation when configured - Update providers.json and rotations.json with example context configurations - Update README.md and DOCUMENTATION.md with context management documentation - Export context module and utilities in __init__.py
55a8311f · Stefy Lanza (nextime / spora ) · 8bad912b · 55a8311f · 55a8311f · 55a8311f
Commit 55a8311f authored Feb 07, 2026 by Stefy Lanza (nextime / spora )
8 changed files
--- a/DOCUMENTATION.md
+++ b/DOCUMENTATION.md
@@ -264,6 +264,104 @@ When using autoselect models:
 - **User Experience**: Provide optimal responses without manual model selection
 - **Adaptive Selection**: Dynamically adjust model selection based on request characteristics
+## Context Management
+AISBF provides intelligent context management to handle large conversation histories and prevent exceeding model context limits:
+### How Context Management Works
+Context management automatically monitors and condenses conversation context:
+1. **Effective Context Tracking**: Calculates and reports total tokens used (effective_context) for every request
+2. **Automatic Condensation**: When context exceeds configured percentage of model's context_size, triggers condensation
+3. **Multiple Condensation Methods**: Supports hierarchical, conversational, semantic, and algoritmic condensation
+4. **Method Chaining**: Multiple condensation methods can be applied in sequence for optimal results
+### Context Configuration
+Models can be configured with context management fields:
+```json
+{
+  "models": [
+    {
+      "name": "gemini-2.0-flash",
+      "context_size": 1000000,
+      "condense_context": 80,
+      "condense_method": ["hierarchical", "semantic"]
+    }
+  ]
+}
+```
+**Configuration Fields:**
+- **`context_size`**: Maximum context size in tokens for the model
+- **`condense_context`**: Percentage (0-100) at which to trigger condensation. 0 means disabled
+- **`condense_method`**: String or list of strings specifying condensation method(s)
+### Condensation Methods
+#### 1. Hierarchical Context Engineering
+Separates context into persistent (long-term facts) and transient (immediate task) layers:
+- **Persistent State**: Architecture, project state, core principles
+- **Recent History**: Summarized conversation history
+- **Active Code**: High-fidelity current code
+- **Instruction**: Current task/goal
+#### 2. Conversational Summarization (Memory Buffering)
+Replaces old messages with high-density summaries:
+- Uses a smaller model to summarize conversation progress
+- Maintains continuity without hitting token caps
+- Preserves key facts, decisions, and current goals
+#### 3. Semantic Context Pruning (Observation Masking)
+Removes irrelevant details based on current query:
+- Uses a smaller "janitor" model to extract relevant facts
+- Can reduce history by 50-80% without losing critical information
+- Focuses on information relevant to the specific current request
+#### 4. Algoritmic Token Compression
+Mathematical compression for technical data and logs:
+- Similar to LLMLingua compression
+- Achieves up to 20x compression for technical data
+- Removes low-information tokens systematically
+### Effective Context Reporting
+All responses include `effective_context` in the usage field:
+**Non-streaming responses:**
+```json
+{
+  "usage": {
+    "prompt_tokens": 1000,
+    "completion_tokens": 500,
+    "total_tokens": 1500,
+    "effective_context": 1000
+  }
+}
+```
+**Streaming responses:**
+The final chunk includes effective_context:
+```json
+{
+  "usage": {
+    "prompt_tokens": null,
+    "completion_tokens": null,
+    "total_tokens": null,
+    "effective_context": 1000
+  }
+}
+```
+### Example Use Cases
+- **Long Conversations**: Maintain context across extended conversations without hitting limits
+- **Code Analysis**: Handle large codebases with intelligent context pruning
+- **Document Processing**: Process large documents with automatic summarization
+- **Multi-turn Tasks**: Maintain task context across multiple interactions
 ## Error Tracking and Rate Limiting
 ### Error Tracking
@@ -399,7 +497,7 @@ Stops running daemon and removes PID file.
 - `Message` - Chat message structure
 - `ChatCompletionRequest` - Request model
 - `ChatCompletionResponse` - Response model
- `Model` - Model information
+- `Model` - Model information (includes context_size, condense_context, condense_method fields)
 - `Provider` - Provider information
 - `ErrorTracking` - Error tracking data
@@ -411,9 +509,13 @@ Stops running daemon and removes PID file.
 - `OllamaProviderHandler` - Ollama provider implementation
 - `get_provider_handler()` - Factory function for provider handlers
+### aisbf/context.py
+- `ContextManager` - Context management class for automatic condensation
+- `get_context_config_for_model()` - Retrieves context configuration from provider or rotation model config
 ### aisbf/handlers.py
- `RequestHandler` - Request handling logic with streaming support
+- `RequestHandler` - Request handling logic with streaming support and context management
- `RotationHandler` - Rotation handling logic with streaming support
+- `RotationHandler` - Rotation handling logic with streaming support and context management
 - `AutoselectHandler` - AI-assisted model selection with streaming support
 ## Dependencies
@@ -426,6 +528,8 @@ Key dependencies from requirements.txt:
 - google-genai - Google AI SDK
 - openai - OpenAI SDK
 - anthropic - Anthropic SDK
+- langchain-text-splitters - Intelligent text splitting for request chunking
+- tiktoken - Accurate token counting for context management
 ## Adding New Providers

--- a/README.md
+++ b/README.md
@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi
 - **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit
 - **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits
 - **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded
+- **Context Management**: Automatic context condensation when approaching model limits with multiple condensation methods
+- **Effective Context Tracking**: Reports total tokens used (effective_context) for every request
 ## Author
@@ -82,12 +84,24 @@ Models can be configured with the following optional fields:
 - **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute)
 - **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour)
 - **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day)
+- **`context_size`**: Maximum context size in tokens for the model. Used to determine when to trigger context condensation.
+- **`condense_context`**: Percentage (0-100) at which to trigger context condensation. 0 means disabled, any other value triggers condensation when context reaches this percentage of context_size.
+- **`condense_method`**: String or list of strings specifying condensation method(s). Supported values: "hierarchical", "conversational", "semantic", "algoritmic". Multiple methods can be chained together.
 When token rate limits are exceeded, providers are automatically disabled:
 - TPM limit exceeded: Provider disabled for 1 minute
 - TPH limit exceeded: Provider disabled for 1 hour
 - TPD limit exceeded: Provider disabled for 1 day
+### Context Condensation Methods
+When context exceeds the configured percentage of `context_size`, the system automatically condenses the prompt using one or more methods:
+1. **Hierarchical**: Separates context into persistent (long-term facts) and transient (immediate task) layers
+2. **Conversational**: Summarizes old messages using a smaller model to maintain conversation continuity
+3. **Semantic**: Prunes irrelevant context based on current query using a smaller "janitor" model
+4. **Algoritmic**: Uses mathematical compression for technical data and logs (similar to LLMLingua)
 See `config/providers.json` and `config/rotations.json` for configuration examples.
 ## API Endpoints

--- a/aisbf/__init__.py
+++ b/aisbf/__init__.py
@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations.
 """
 from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo
+from .context import ContextManager, get_context_config_for_model
 from .models import (
    Message,
    ChatCompletionRequest,
@@ -42,6 +43,7 @@ from .providers import (
    PROVIDER_HANDLERS
 )
 from .handlers import RequestHandler, RotationHandler, AutoselectHandler
+from .utils import count_messages_tokens, split_messages_into_chunks, get_max_request_tokens_for_model
 __version__ = "0.3.0"
 __all__ = [
@@ -74,4 +76,11 @@ __all__ = [
    "RequestHandler",
    "RotationHandler",
    "AutoselectHandler",
+    # Context
+    "ContextManager",
+    "get_context_config_for_model",
+    # Utils
+    "count_messages_tokens",
+    "split_messages_into_chunks",
+    "get_max_request_tokens_for_model",
 ]
--- a/aisbf/context.py
+++ b/aisbf/context.py
+"""
+Copyleft (C) 2026 Stefy Lanza <stefy@nexlab.net>
+AISBF - AI Service Broker Framework || AI Should Be Free
+Context management and condensation for AISBF.
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+You should have received a copy of the GNU General Public License
+along with this program.  If not, see <https://www.gnu.org/licenses/>.
+Why did the programmer quit his job? Because he didn't get arrays!
+Context management and condensation for AISBF.
+"""
+import logging
+from typing import Dict, List, Optional, Union, Any
+from .utils import count_messages_tokens
+class ContextManager:
+    """
+    Manages context size and performs condensation when needed.
+    """
+    def __init__(self, model_config: Dict, provider_handler=None):
+        """
+        Initialize the context manager.
+        Args:
+            model_config: Model configuration dictionary containing context_size, condense_context, condense_method
+            provider_handler: Optional provider handler for making summarization requests
+        """
+        self.context_size = model_config.get('context_size')
+        self.condense_context = model_config.get('condense_context', 0)
+        self.condense_method = model_config.get('condense_method')
+        self.provider_handler = provider_handler
+        # Normalize condense_context to 0-100 range
+        if self.condense_context and self.condense_context > 100:
+            self.condense_context = 100
+        # Normalize condense_method to list
+        if self.condense_method:
+            if isinstance(self.condense_method, str):
+                self.condense_method = [self.condense_method]
+        else:
+            self.condense_method = []
+        # Track conversation history for summarization
+        self.conversation_summary = None
+        self.summary_token_count = 0
+        logger = logging.getLogger(__name__)
+        logger.info(f"ContextManager initialized:")
+        logger.info(f"  context_size: {self.context_size}")
+        logger.info(f"  condense_context: {self.condense_context}%")
+        logger.info(f"  condense_method: {self.condense_method}")
+    def should_condense(self, messages: List[Dict], model: str) -> bool:
+        """
+        Check if context condensation is needed.
+        Args:
+            messages: List of messages to check
+            model: Model name for token counting
+        Returns:
+            True if condensation is needed, False otherwise
+        """
+        if not self.context_size or not self.condense_context or self.condense_context == 0:
+            return False
+        # Calculate current token count
+        current_tokens = count_messages_tokens(messages, model)
+        # Calculate threshold
+        threshold = int(self.context_size * (self.condense_context / 100))
+        logger = logging.getLogger(__name__)
+        logger.info(f"Context check: {current_tokens} / {self.context_size} tokens (threshold: {threshold})")
+        return current_tokens >= threshold
+    async def condense_context(
+        self,
+        messages: List[Dict],
+        model: str,
+        current_query: Optional[str] = None
+    ) -> List[Dict]:
+        """
+        Condense the context using configured methods.
+        Args:
+            messages: List of messages to condense
+            model: Model name for token counting
+            current_query: Optional current query for semantic pruning
+        Returns:
+            Condensed list of messages
+        """
+        logger = logging.getLogger(__name__)
+        logger.info(f"=== CONTEXT CONDENSATION START ===")
+        logger.info(f"Original messages count: {len(messages)}")
+        logger.info(f"Condensation methods: {self.condense_method}")
+        condensed_messages = messages.copy()
+        # Apply each condensation method in sequence
+        for method in self.condense_method:
+            logger.info(f"Applying method: {method}")
+            if method == "hierarchical":
+                condensed_messages = self._hierarchical_condense(condensed_messages, model)
+            elif method == "conversational":
+                condensed_messages = await self._conversational_condense(condensed_messages, model)
+            elif method == "semantic":
+                condensed_messages = await self._semantic_condense(condensed_messages, model, current_query)
+            elif method == "algorithmic":
+                condensed_messages = self._algorithmic_condense(condensed_messages, model)
+            else:
+                logger.warning(f"Unknown condensation method: {method}")
+        # Calculate token reduction
+        original_tokens = count_messages_tokens(messages, model)
+        condensed_tokens = count_messages_tokens(condensed_messages, model)
+        reduction = original_tokens - condensed_tokens
+        reduction_pct = (reduction / original_tokens * 100) if original_tokens > 0 else 0
+        logger.info(f"=== CONTEXT CONDENSATION END ===")
+        logger.info(f"Original tokens: {original_tokens}")
+        logger.info(f"Condensed tokens: {condensed_tokens}")
+        logger.info(f"Reduction: {reduction} tokens ({reduction_pct:.1f}%)")
+        logger.info(f"Final messages count: {len(condensed_messages)}")
+        return condensed_messages
+    def _hierarchical_condense(self, messages: List[Dict], model: str) -> List[Dict]:
+        """
+        HIERARCHICAL CONTEXT ENGINEERING
+        Separate context into 'Persistent' (long-term facts) and 'Transient' (immediate task).
+        Uses "Step-Back Prompting" to identify core principles before answering.
+        Structure:
+        - PERSISTENT STATE (Architecture): System messages and early context
+        - RECENT HISTORY (Summarized): Middle messages
+        - ACTIVE CODE (High Fidelity): Recent messages
+        - INSTRUCTION: Current task
+        """
+        logger = logging.getLogger(__name__)
+        logger.info(f"Hierarchical condensation: {len(messages)} messages")
+        if len(messages) <= 2:
+            # Not enough messages to condense
+            return messages
+        # Separate messages into categories
+        system_messages = [m for m in messages if m.get('role') == 'system']
+        user_messages = [m for m in messages if m.get('role') == 'user']
+        assistant_messages = [m for m in messages if m.get('role') == 'assistant']
+        # Keep all system messages (persistent state)
+        persistent = system_messages.copy()
+        # Keep recent messages (high fidelity - last 3 exchanges)
+        recent_count = min(6, len(user_messages) + len(assistant_messages))
+        recent_messages = []
+        # Get last few messages in order
+        all_messages_except_system = [m for m in messages if m.get('role') != 'system']
+        recent_messages = all_messages_except_system[-recent_count:]
+        # Middle messages to potentially summarize
+        middle_messages = all_messages_except_system[:-recent_count]
+        # For hierarchical, we keep persistent + recent, and summarize middle if needed
+        condensed = persistent + middle_messages + recent_messages
+        logger.info(f"Hierarchical: {len(persistent)} persistent, {len(middle_messages)} middle, {len(recent_messages)} recent")
+        return condensed
+    async def _conversational_condense(self, messages: List[Dict], model: str) -> List[Dict]:
+        """
+        CONVERSATIONAL SUMMARIZATION (MEMORY BUFFERING)
+        Replace old messages with a high-density summary.
+        Uses a maintenance prompt to summarize progress.
+        """
+        logger = logging.getLogger(__name__)
+        logger.info(f"Conversational condensation: {len(messages)} messages")
+        if not self.provider_handler:
+            logger.warning("No provider handler available for conversational condensation, skipping")
+            return messages
+        if len(messages) <= 4:
+            # Not enough messages to condense
+            return messages
+        # Keep system messages
+        system_messages = [m for m in messages if m.get('role') == 'system']
+        # Keep last 2 exchanges (4 messages)
+        recent_messages = messages[-4:]
+        # Messages to summarize (everything between system and recent)
+        messages_to_summarize = messages[len(system_messages):-4]
+        if not messages_to_summarize:
+            return messages
+        # Build summary prompt
+        summary_prompt = "Summarize the following conversation history, including key facts, decisions, and the current goal. Keep it concise but comprehensive.\n\n"
+        for msg in messages_to_summarize:
+            role = msg.get('role', 'unknown')
+            content = msg.get('content', '')
+            if content:
+                summary_prompt += f"{role}: {content}\n"
+        try:
+            # Request summary from the model
+            summary_messages = [{"role": "user", "content": summary_prompt}]
+            summary_response = await self.provider_handler.handle_request(
+                model=model,
+                messages=summary_messages,
+                max_tokens=1000,
+                temperature=0.3,
+                stream=False
+            )
+            # Extract summary content
+            if isinstance(summary_response, dict):
+                summary_content = summary_response.get('choices', [{}])[0].get('message', {}).get('content', '')
+                if summary_content:
+                    # Create summary message
+                    summary_message = {
+                        "role": "system",
+                        "content": f"[CONVERSATION SUMMARY]\n{summary_content}"
+                    }
+                    # Build condensed messages: system + summary + recent
+                    condensed = system_messages + [summary_message] + recent_messages
+                    # Update stored summary
+                    self.conversation_summary = summary_content
+                    self.summary_token_count = count_messages_tokens([summary_message], model)
+                    logger.info(f"Conversational: Created summary ({len(summary_content)} chars)")
+                    return condensed
+        except Exception as e:
+            logger.error(f"Error during conversational condensation: {e}")
+        # Fallback: return original messages
+        return messages
+    async def _semantic_condense(
+        self,
+        messages: List[Dict],
+        model: str,
+        current_query: Optional[str] = None
+    ) -> List[Dict]:
+        """
+        SEMANTIC CONTEXT PRUNING (OBSERVATION MASKING)
+        Remove or hide old, non-critical details that are irrelevant to the current task.
+        Uses a smaller model as a "janitor" to extract only relevant info.
+        """
+        logger = logging.getLogger(__name__)
+        logger.info(f"Semantic condensation: {len(messages)} messages")
+        if not self.provider_handler:
+            logger.warning("No provider handler available for semantic condensation, skipping")
+            return messages
+        if len(messages) <= 2:
+            return messages
+        # Keep system messages
+        system_messages = [m for m in messages if m.get('role') == 'system']
+        # Get conversation history (excluding system)
+        conversation = [m for m in messages if m.get('role') != 'system']
+        if not conversation:
+            return messages
+        # Build conversation text
+        conversation_text = ""
+        for msg in conversation:
+            role = msg.get('role', 'unknown')
+            content = msg.get('content', '')
+            if content:
+                conversation_text += f"{role}: {content}\n"
+        # Build pruning prompt
+        if current_query:
+            prune_prompt = f"""Given the current query: '{current_query}'
+Extract ONLY the relevant facts from this conversation history. Ignore everything else that is not directly related to answering the current query.
+Conversation History:
+{conversation_text}
+Provide only the relevant information in a concise format."""
+        else:
+            prune_prompt = f"""Extract the most important and relevant information from this conversation history. Focus on key facts, decisions, and context that would be needed for future queries.
+Conversation History:
+{conversation_text}
+Provide only the relevant information in a concise format."""
+        try:
+            # Request pruned context from the model
+            prune_messages = [{"role": "user", "content": prune_prompt}]
+            prune_response = await self.provider_handler.handle_request(
+                model=model,
+                messages=prune_messages,
+                max_tokens=2000,
+                temperature=0.2,
+                stream=False
+            )
+            # Extract pruned content
+            if isinstance(prune_response, dict):
+                pruned_content = prune_response.get('choices', [{}])[0].get('message', {}).get('content', '')
+                if pruned_content:
+                    # Create pruned context message
+                    pruned_message = {
+                        "role": "system",
+                        "content": f"[RELEVANT CONTEXT]\n{pruned_content}"
+                    }
+                    # Build condensed messages: system + pruned + last user message
+                    last_message = messages[-1] if messages else None
+                    if last_message and last_message.get('role') != 'system':
+                        condensed = system_messages + [pruned_message, last_message]
+                    else:
+                        condensed = system_messages + [pruned_message]
+                    logger.info(f"Semantic: Pruned to relevant context ({len(pruned_content)} chars)")
+                    return condensed
+        except Exception as e:
+            logger.error(f"Error during semantic condensation: {e}")
+        # Fallback: return original messages
+        return messages
+    def _algorithmic_condense(self, messages: List[Dict], model: str) -> List[Dict]:
+        """
+        ALGORITHMIC TOKEN COMPRESSION (LLMLingua-style)
+        Mathematically remove "low-information" tokens.
+        This is a simplified version that removes redundant content.
+        """
+        logger = logging.getLogger(__name__)
+        logger.info(f"Algorithmic condensation: {len(messages)} messages")
+        condensed = []
+        for msg in messages:
+            role = msg.get('role')
+            content = msg.get('content')
+            if not content:
+                condensed.append(msg)
+                continue
+            # Skip empty or very short messages
+            if len(str(content)) < 10:
+                continue
+            # Remove duplicate consecutive messages from same role
+            if condensed and condensed[-1].get('role') == role:
+                prev_content = str(condensed[-1].get('content', ''))
+                curr_content = str(content)
+                # If very similar, skip
+                if prev_content == curr_content:
+                    logger.debug(f"Skipping duplicate message from {role}")
+                    continue
+            # Remove excessive whitespace
+            if isinstance(content, str):
+                content = ' '.join(content.split())
+            condensed.append({
+                "role": role,
+                "content": content
+            })
+        logger.info(f"Algorithmic: Reduced from {len(messages)} to {len(condensed)} messages")
+        return condensed
+def get_context_config_for_model(
+    model_name: str,
+    provider_config: Any = None,
+    rotation_model_config: Optional[Dict] = None
+) -> Dict:
+    """
+    Get context configuration for a specific model.
+    Args:
+        model_name: Name of the model
+        provider_config: Provider configuration (optional)
+        rotation_model_config: Rotation model configuration (optional)
+    Returns:
+        Dictionary with context_size, condense_context, and condense_method
+    """
+    context_config = {
+        'context_size': None,
+        'condense_context': 0,
+        'condense_method': None
+    }
+    # Check rotation model config first (highest priority)
+    if rotation_model_config:
+        context_config['context_size'] = rotation_model_config.get('context_size')
+        context_config['condense_context'] = rotation_model_config.get('condense_context', 0)
+        context_config['condense_method'] = rotation_model_config.get('condense_method')
+    # Fall back to provider config
+    elif provider_config and hasattr(provider_config, 'models'):
+        for model in provider_config.models:
+            if model.get('name') == model_name:
+                context_config['context_size'] = model.get('context_size')
+                context_config['condense_context'] = model.get('condense_context', 0)
+                context_config['condense_method'] = model.get('condense_method')
+                break
+    return context_config
\ No newline at end of file
--- a/aisbf/handlers.py
+++ b/aisbf/handlers.py
@@ -39,6 +39,7 @@ from .utils import (
    split_messages_into_chunks,
    get_max_request_tokens_for_model
 )
+from .context import ContextManager, get_context_config_for_model
 def generate_system_fingerprint(provider_id: str, seed: Optional[int] = None) -> str:
@@ -241,6 +242,14 @@ class RequestHandler:
            logger.info(f"Temperature: {request_data.get('temperature', 1.0)}")
            logger.info(f"Stream: {request_data.get('stream', False)}")
+            # Get context configuration
+            context_config = get_context_config_for_model(
+                model_name=model,
+                provider_config=provider_config,
+                rotation_model_config=None
+            )
+            logger.info(f"Context config: {context_config}")
            # Check for max_request_tokens in provider config
            max_request_tokens = get_max_request_tokens_for_model(
                model_name=model,
@@ -248,6 +257,19 @@ class RequestHandler:
                rotation_model_config=None
            )
+            # Calculate effective context (total tokens used)
+            effective_context = count_messages_tokens(messages, model)
+            logger.info(f"Effective context: {effective_context} tokens")
+            # Apply context condensation if needed
+            if context_config.get('condense_context', 0) > 0:
+                context_manager = ContextManager(context_config, handler)
+                if context_manager.should_condense(messages, model):
+                    logger.info("Context condensation triggered")
+                    messages = await context_manager.condense_context(messages, model)
+                    effective_context = count_messages_tokens(messages, model)
+                    logger.info(f"Condensed effective context: {effective_context} tokens")
            if max_request_tokens:
                # Count tokens in the request
                request_tokens = count_messages_tokens(messages, model)
@@ -299,6 +321,11 @@ class RequestHandler:
            logger.info(f"Response type: {type(response)}")
            logger.info(f"Response: {response}")
+            # Add effective context to response for non-streaming
+            if isinstance(response, dict) and 'usage' in response:
+                response['usage']['effective_context'] = effective_context
+                logger.info(f"Added effective_context to response: {effective_context}")
            # For OpenAI-compatible providers, the response is already a response object
            # Just return it as-is without any parsing or modification
            handler.record_success()
@@ -327,6 +354,32 @@ class RequestHandler:
        # If seed is present in request, generate unique fingerprint per request
        seed = request_data.get('seed')
        system_fingerprint = generate_system_fingerprint(provider_id, seed)
+        # Get context configuration and calculate effective context
+        model = request_data.get('model')
+        messages = request_data.get('messages', [])
+        context_config = get_context_config_for_model(
+            model_name=model,
+            provider_config=provider_config,
+            rotation_model_config=None
+        )
+        effective_context = count_messages_tokens(messages, model)
+        # Apply context condensation if needed
+        if context_config.get('condense_context', 0) > 0:
+            context_manager = ContextManager(context_config, handler)
+            if context_manager.should_condense(messages, model):
+                import logging
+                logger = logging.getLogger(__name__)
+                logger.info("Context condensation triggered for streaming request")
+                messages = await context_manager.condense_context(messages, model)
+                effective_context = count_messages_tokens(messages, model)
+                logger.info(f"Condensed effective context: {effective_context} tokens")
+        # Update request_data with condensed messages
+        request_data['messages'] = messages
        async def stream_generator():
            import logging
@@ -457,7 +510,8 @@ class RequestHandler:
                        "usage": {
                            "prompt_tokens": None,
                            "completion_tokens": None,
-                            "total_tokens": None
+                            "total_tokens": None,
+                            "effective_context": effective_context
                        },
                        "provider": provider_id,
                        "choices": [{
@@ -488,6 +542,16 @@ class RequestHandler:
                            # For OpenAI-compatible providers, just pass through the raw chunk
                            # Convert chunk to dict and serialize as JSON
                            chunk_dict = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk
+                            # Add effective_context to the last chunk (when finish_reason is present)
+                            if isinstance(chunk_dict, dict):
+                                choices = chunk_dict.get('choices', [])
+                                if choices and choices[0].get('finish_reason') is not None:
+                                    # This is the last chunk, add effective_context
+                                    if 'usage' not in chunk_dict:
+                                        chunk_dict['usage'] = {}
+                                    chunk_dict['usage']['effective_context'] = effective_context
                            yield f"data: {json.dumps(chunk_dict)}\n\n".encode('utf-8')
                        except Exception as chunk_error:
                            # Handle errors during chunk serialization
@@ -904,6 +968,30 @@ class RotationHandler:
                logger.info(f"Temperature: {request_data.get('temperature', 1.0)}")
                logger.info(f"Stream: {request_data.get('stream', False)}")
+                # Get context configuration
+                context_config = get_context_config_for_model(
+                    model_name=model_name,
+                    provider_config=None,
+                    rotation_model_config=current_model
+                )
+                logger.info(f"Context config: {context_config}")
+                # Calculate effective context
+                messages = request_data['messages']
+                effective_context = count_messages_tokens(messages, model_name)
+                logger.info(f"Effective context: {effective_context} tokens")
+                # Apply context condensation if needed
+                if context_config.get('condense_context', 0) > 0:
+                    context_manager = ContextManager(context_config, handler)
+                    if context_manager.should_condense(messages, model_name):
+                        logger.info("Context condensation triggered")
+                        messages = await context_manager.condense_context(messages, model_name)
+                        effective_context = count_messages_tokens(messages, model_name)
+                        logger.info(f"Condensed effective context: {effective_context} tokens")
+                    # Update request_data with condensed messages
+                    request_data['messages'] = messages
                # Check for max_request_tokens in rotation model config
                max_request_tokens = current_model.get('max_request_tokens')
                if max_request_tokens:
@@ -984,6 +1072,10 @@ class RotationHandler:
                    if total_tokens > 0:
                        handler._record_token_usage(model_name, total_tokens)
                        logger.info(f"Recorded {total_tokens} tokens for model {model_name}")
+                    # Add effective context to response for non-streaming
+                    usage['effective_context'] = effective_context
+                    logger.info(f"Added effective_context to response: {effective_context}")
                handler.record_success()
@@ -1177,7 +1269,8 @@ class RotationHandler:
                        "usage": {
                            "prompt_tokens": None,
                            "completion_tokens": None,
-                            "total_tokens": None
+                            "total_tokens": None,
+                            "effective_context": effective_context
                        },
                        "provider": provider_id,
                        "choices": [{
@@ -1206,6 +1299,16 @@ class RotationHandler:
                            # For OpenAI-compatible providers, just pass through the raw chunk
                            chunk_dict = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk
+                            # Add effective_context to the last chunk (when finish_reason is present)
+                            if isinstance(chunk_dict, dict):
+                                choices = chunk_dict.get('choices', [])
+                                if choices and choices[0].get('finish_reason') is not None:
+                                    # This is the last chunk, add effective_context
+                                    if 'usage' not in chunk_dict:
+                                        chunk_dict['usage'] = {}
+                                    chunk_dict['usage']['effective_context'] = effective_context
                            yield f"data: {json.dumps(chunk_dict)}\n\n".encode('utf-8')
                        except Exception as chunk_error:
                            error_msg = str(chunk_error)
@@ -1284,7 +1387,7 @@ class AutoselectHandler:
        # Build the complete prompt
        prompt = f"""{skill_content}
 <aisbf_user_prompt>{user_prompt}</aisbf_user_prompt>
 <aisbf_autoselect_list>
 {models_list}
@@ -1519,7 +1622,7 @@ class AutoselectHandler:
        return response
    async def handle_autoselect_model_list(self, autoselect_id: str) -> List[Dict]:
-        """List available models for an autoselect endpoint"""
+        """List the available models for an autoselect endpoint"""
        autoselect_config = self.config.get_autoselect(autoselect_id)
        if not autoselect_config:
            raise HTTPException(status_code=400, detail=f"Autoselect {autoselect_id} not found")

--- a/aisbf/models.py
+++ b/aisbf/models.py
@@ -63,6 +63,9 @@ class Model(BaseModel):
    rate_limit_TPM: Optional[int] = None  # Max tokens per minute
    rate_limit_TPH: Optional[int] = None  # Max tokens per hour
    rate_limit_TPD: Optional[int] = None  # Max tokens per day
+    context_size: Optional[int] = None  # Max context size in tokens for the model
+    condense_context: Optional[int] = None  # Percentage (0-100) at which to condense context
+    condense_method: Optional[Union[str, List[str]]] = None  # Method(s) for condensation: "hierarchical", "conversational", "semantic", "algorithmic"
 class Provider(BaseModel):
    id: str

--- a/config/providers.json
+++ b/config/providers.json
@@ -14,7 +14,10 @@
          "max_request_tokens": 1000000,
          "rate_limit_TPM": 15000,
          "rate_limit_TPH": 100000,
-          "rate_limit_TPD": 1000000
+          "rate_limit_TPD": 1000000,
+          "context_size": 1000000,
+          "condense_context": 80,
+          "condense_method": ["hierarchical", "semantic"]
        },
        {
          "name": "gemini-1.5-pro",
@@ -22,7 +25,10 @@
          "max_request_tokens": 2000000,
          "rate_limit_TPM": 15000,
          "rate_limit_TPH": 100000,
-          "rate_limit_TPD": 1000000
+          "rate_limit_TPD": 1000000,
+          "context_size": 2000000,
+          "condense_context": 85,
+          "condense_method": "conversational"
        }
      ]
    },

--- a/config/rotations.json
+++ b/config/rotations.json
@@ -14,7 +14,10 @@
              "max_request_tokens": 100000,
              "rate_limit_TPM": 15000,
              "rate_limit_TPH": 100000,
-              "rate_limit_TPD": 1000000
+              "rate_limit_TPD": 1000000,
+              "context_size": 1000000,
+              "condense_context": 80,
+              "condense_method": ["hierarchical", "semantic"]
            },
            {
              "name": "gemini-1.5-pro",
@@ -23,7 +26,10 @@
              "max_request_tokens": 100000,
              "rate_limit_TPM": 15000,
              "rate_limit_TPH": 100000,
-              "rate_limit_TPD": 1000000
+              "rate_limit_TPD": 1000000,
+              "context_size": 2000000,
+              "condense_context": 85,
+              "condense_method": "conversational"
            }
          ]
        },
@@ -35,13 +41,19 @@
              "name": "gpt-4",
              "weight": 2,
              "rate_limit": 0,
-              "max_request_tokens": 128000
+              "max_request_tokens": 128000,
+              "context_size": 128000,
+              "condense_context": 75,
+              "condense_method": ["hierarchical", "conversational"]
            },
            {
              "name": "gpt-3.5-turbo",
              "weight": 1,
              "rate_limit": 0,
-              "max_request_tokens": 4000
+              "max_request_tokens": 4000,
+              "context_size": 16000,
+              "condense_context": 70,
+              "condense_method": "semantic"
            }
          ]
        },
@@ -53,13 +65,19 @@
              "name": "claude-3-5-sonnet-20241022",
              "weight": 2,
              "rate_limit": 0,
-              "max_request_tokens": 200000
+              "max_request_tokens": 200000,
+              "context_size": 200000,
+              "condense_context": 80,
+              "condense_method": ["hierarchical", "semantic"]
            },
            {
              "name": "claude-3-haiku-20240307",
              "weight": 1,
              "rate_limit": 0,
-              "max_request_tokens": 200000
+              "max_request_tokens": 200000,
+              "context_size": 200000,
+              "condense_context": 75,
+              "condense_method": "conversational"
            }
          ]
        }