Add context management feature with automatic condensation

- Add context_size, condense_context, and condense_method fields to Model class - Create new context.py module with ContextManager and condensation methods - Implement hierarchical, conversational, semantic, and algoritmic condensation - Calculate and report effective_context for all requests - Update handlers.py to apply context condensation when configured - Update providers.json and rotations.json with example context configurations - Update README.md and DOCUMENTATION.md with context management documentation - Export context module and utilities in __init__.py

Add context management feature with automatic condensation
- Add context_size, condense_context, and condense_method fields to Model class - Create new context.py module with ContextManager and condensation methods - Implement hierarchical, conversational, semantic, and algoritmic condensation - Calculate and report effective_context for all requests - Update handlers.py to apply context condensation when configured - Update providers.json and rotations.json with example context configurations - Update README.md and DOCUMENTATION.md with context management documentation - Export context module and utilities in __init__.py
55a8311f · Stefy Lanza (nextime / spora ) · 8bad912b · 55a8311f · 55a8311f · 55a8311f
Commit 55a8311f authored Feb 07, 2026 by Stefy Lanza (nextime / spora )
8 changed files
--- a/DOCUMENTATION.md
+++ b/DOCUMENTATION.md
@@ -264,6 +264,104 @@ When using autoselect models:
 - **User Experience**: Provide optimal responses without manual model selection
 - **Adaptive Selection**: Dynamically adjust model selection based on request characteristics

+## Context Management
+
+AISBF provides intelligent context management to handle large conversation histories and prevent exceeding model context limits:
+
+### How Context Management Works
+
+Context management automatically monitors and condenses conversation context:
+
+1. **Effective Context Tracking**: Calculates and reports total tokens used (effective_context) for every request
+2. **Automatic Condensation**: When context exceeds configured percentage of model's context_size, triggers condensation
+3. **Multiple Condensation Methods**: Supports hierarchical, conversational, semantic, and algoritmic condensation
+4. **Method Chaining**: Multiple condensation methods can be applied in sequence for optimal results
+
+### Context Configuration
+
+Models can be configured with context management fields:
+
+```json
+{
+  "models": [
+    {
+      "name": "gemini-2.0-flash",
+      "context_size": 1000000,
+      "condense_context": 80,
+      "condense_method": ["hierarchical", "semantic"]
+    }
+  ]
+}
+```
+
+**Configuration Fields:**
+- **`context_size`**: Maximum context size in tokens for the model
+- **`condense_context`**: Percentage (0-100) at which to trigger condensation. 0 means disabled
+- **`condense_method`**: String or list of strings specifying condensation method(s)
+
+### Condensation Methods
+
+#### 1. Hierarchical Context Engineering
+Separates context into persistent (long-term facts) and transient (immediate task) layers:
+- **Persistent State**: Architecture, project state, core principles
+- **Recent History**: Summarized conversation history
+- **Active Code**: High-fidelity current code
+- **Instruction**: Current task/goal
+
+#### 2. Conversational Summarization (Memory Buffering)
+Replaces old messages with high-density summaries:
+- Uses a smaller model to summarize conversation progress
+- Maintains continuity without hitting token caps
+- Preserves key facts, decisions, and current goals
+
+#### 3. Semantic Context Pruning (Observation Masking)
+Removes irrelevant details based on current query:
+- Uses a smaller "janitor" model to extract relevant facts
+- Can reduce history by 50-80% without losing critical information
+- Focuses on information relevant to the specific current request
+
+#### 4. Algoritmic Token Compression
+Mathematical compression for technical data and logs:
+- Similar to LLMLingua compression
+- Achieves up to 20x compression for technical data
+- Removes low-information tokens systematically
+
+### Effective Context Reporting
+
+All responses include `effective_context` in the usage field:
+
+**Non-streaming responses:**
+```json
+{
+  "usage": {
+    "prompt_tokens": 1000,
+    "completion_tokens": 500,
+    "total_tokens": 1500,
+    "effective_context": 1000
+  }
+}
+```
+
+**Streaming responses:**
+The final chunk includes effective_context:
+```json
+{
+  "usage": {
+    "prompt_tokens": null,
+    "completion_tokens": null,
+    "total_tokens": null,
+    "effective_context": 1000
+  }
+}
+```
+
+### Example Use Cases
+
+- **Long Conversations**: Maintain context across extended conversations without hitting limits
+- **Code Analysis**: Handle large codebases with intelligent context pruning
+- **Document Processing**: Process large documents with automatic summarization
+- **Multi-turn Tasks**: Maintain task context across multiple interactions
+
 ## Error Tracking and Rate Limiting

 ### Error Tracking
@@ -399,7 +497,7 @@ Stops running daemon and removes PID file.
 - `Message` - Chat message structure
 - `ChatCompletionRequest` - Request model
 - `ChatCompletionResponse` - Response model
- `Model` - Model information
+- `Model` - Model information (includes context_size, condense_context, condense_method fields)
 - `Provider` - Provider information
 - `ErrorTracking` - Error tracking data

@@ -411,9 +509,13 @@ Stops running daemon and removes PID file.
 - `OllamaProviderHandler` - Ollama provider implementation
 - `get_provider_handler()` - Factory function for provider handlers

+### aisbf/context.py
+- `ContextManager` - Context management class for automatic condensation
+- `get_context_config_for_model()` - Retrieves context configuration from provider or rotation model config
+
 ### aisbf/handlers.py
- `RequestHandler` - Request handling logic with streaming support
- `RotationHandler` - Rotation handling logic with streaming support
+- `RequestHandler` - Request handling logic with streaming support and context management
+- `RotationHandler` - Rotation handling logic with streaming support and context management
 - `AutoselectHandler` - AI-assisted model selection with streaming support

 ## Dependencies
@@ -426,6 +528,8 @@ Key dependencies from requirements.txt:
 - google-genai - Google AI SDK
 - openai - OpenAI SDK
 - anthropic - Anthropic SDK
+- langchain-text-splitters - Intelligent text splitting for request chunking
+- tiktoken - Accurate token counting for context management

 ## Adding New Providers


--- a/README.md
+++ b/README.md
@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi
 - **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit
 - **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits
 - **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded
+- **Context Management**: Automatic context condensation when approaching model limits with multiple condensation methods
+- **Effective Context Tracking**: Reports total tokens used (effective_context) for every request

 ## Author

@@ -82,12 +84,24 @@ Models can be configured with the following optional fields:
 - **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute)
 - **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour)
 - **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day)
+- **`context_size`**: Maximum context size in tokens for the model. Used to determine when to trigger context condensation.
+- **`condense_context`**: Percentage (0-100) at which to trigger context condensation. 0 means disabled, any other value triggers condensation when context reaches this percentage of context_size.
+- **`condense_method`**: String or list of strings specifying condensation method(s). Supported values: "hierarchical", "conversational", "semantic", "algoritmic". Multiple methods can be chained together.

 When token rate limits are exceeded, providers are automatically disabled:
 - TPM limit exceeded: Provider disabled for 1 minute
 - TPH limit exceeded: Provider disabled for 1 hour
 - TPD limit exceeded: Provider disabled for 1 day

+### Context Condensation Methods
+
+When context exceeds the configured percentage of `context_size`, the system automatically condenses the prompt using one or more methods:
+
+1. **Hierarchical**: Separates context into persistent (long-term facts) and transient (immediate task) layers
+2. **Conversational**: Summarizes old messages using a smaller model to maintain conversation continuity
+3. **Semantic**: Prunes irrelevant context based on current query using a smaller "janitor" model
+4. **Algoritmic**: Uses mathematical compression for technical data and logs (similar to LLMLingua)
+
 See `config/providers.json` and `config/rotations.json` for configuration examples.

 ## API Endpoints

--- a/aisbf/__init__.py
+++ b/aisbf/__init__.py
@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations.
 """

 from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo
+from .context import ContextManager, get_context_config_for_model
 from .models import (
    Message,
    ChatCompletionRequest,
@@ -42,6 +43,7 @@ from .providers import (
    PROVIDER_HANDLERS
 )
 from .handlers import RequestHandler, RotationHandler, AutoselectHandler
+from .utils import count_messages_tokens, split_messages_into_chunks, get_max_request_tokens_for_model

 __version__ = "0.3.0"
 __all__ = [
@@ -74,4 +76,11 @@ __all__ = [
    "RequestHandler",
    "RotationHandler",
    "AutoselectHandler",
+    # Context
+    "ContextManager",
+    "get_context_config_for_model",
+    # Utils
+    "count_messages_tokens",
+    "split_messages_into_chunks",
+    "get_max_request_tokens_for_model",
 ]
--- a/aisbf/context.py
+++ b/aisbf/context.py
--- a/aisbf/handlers.py
+++ b/aisbf/handlers.py
--- a/aisbf/models.py
+++ b/aisbf/models.py
@@ -63,6 +63,9 @@ class Model(BaseModel):
    rate_limit_TPM: Optional[int] = None  # Max tokens per minute
    rate_limit_TPH: Optional[int] = None  # Max tokens per hour
    rate_limit_TPD: Optional[int] = None  # Max tokens per day
+    context_size: Optional[int] = None  # Max context size in tokens for the model
+    condense_context: Optional[int] = None  # Percentage (0-100) at which to condense context
+    condense_method: Optional[Union[str, List[str]]] = None  # Method(s) for condensation: "hierarchical", "conversational", "semantic", "algorithmic"

 class Provider(BaseModel):
    id: str

--- a/config/providers.json
+++ b/config/providers.json
@@ -14,7 +14,10 @@
          "max_request_tokens": 1000000,
          "rate_limit_TPM": 15000,
          "rate_limit_TPH": 100000,
-          "rate_limit_TPD": 1000000
+          "rate_limit_TPD": 1000000,
+          "context_size": 1000000,
+          "condense_context": 80,
+          "condense_method": ["hierarchical", "semantic"]
        },
        {
          "name": "gemini-1.5-pro",
@@ -22,7 +25,10 @@
          "max_request_tokens": 2000000,
          "rate_limit_TPM": 15000,
          "rate_limit_TPH": 100000,
-          "rate_limit_TPD": 1000000
+          "rate_limit_TPD": 1000000,
+          "context_size": 2000000,
+          "condense_context": 85,
+          "condense_method": "conversational"
        }
      ]
    },

--- a/config/rotations.json
+++ b/config/rotations.json
@@ -14,7 +14,10 @@
              "max_request_tokens": 100000,
              "rate_limit_TPM": 15000,
              "rate_limit_TPH": 100000,
-              "rate_limit_TPD": 1000000
+              "rate_limit_TPD": 1000000,
+              "context_size": 1000000,
+              "condense_context": 80,
+              "condense_method": ["hierarchical", "semantic"]
            },
            {
              "name": "gemini-1.5-pro",
@@ -23,7 +26,10 @@
              "max_request_tokens": 100000,
              "rate_limit_TPM": 15000,
              "rate_limit_TPH": 100000,
-              "rate_limit_TPD": 1000000
+              "rate_limit_TPD": 1000000,
+              "context_size": 2000000,
+              "condense_context": 85,
+              "condense_method": "conversational"
            }
          ]
        },
@@ -35,13 +41,19 @@
              "name": "gpt-4",
              "weight": 2,
              "rate_limit": 0,
-              "max_request_tokens": 128000
+              "max_request_tokens": 128000,
+              "context_size": 128000,
+              "condense_context": 75,
+              "condense_method": ["hierarchical", "conversational"]
            },
            {
              "name": "gpt-3.5-turbo",
              "weight": 1,
              "rate_limit": 0,
-              "max_request_tokens": 4000
+              "max_request_tokens": 4000,
+              "context_size": 16000,
+              "condense_context": 70,
+              "condense_method": "semantic"
            }
          ]
        },
@@ -53,13 +65,19 @@
              "name": "claude-3-5-sonnet-20241022",
              "weight": 2,
              "rate_limit": 0,
-              "max_request_tokens": 200000
+              "max_request_tokens": 200000,
+              "context_size": 200000,
+              "condense_context": 80,
+              "condense_method": ["hierarchical", "semantic"]
            },
            {
              "name": "claude-3-haiku-20240307",
              "weight": 1,
              "rate_limit": 0,
-              "max_request_tokens": 200000
+              "max_request_tokens": 200000,
+              "context_size": 200000,
+              "condense_context": 75,
+              "condense_method": "conversational"
            }
          ]
        }