Add context management feature with automatic condensation

- Add context_size, condense_context, and condense_method fields to Model class
- Create new context.py module with ContextManager and condensation methods
- Implement hierarchical, conversational, semantic, and algoritmic condensation
- Calculate and report effective_context for all requests
- Update handlers.py to apply context condensation when configured
- Update providers.json and rotations.json with example context configurations
- Update README.md and DOCUMENTATION.md with context management documentation
- Export context module and utilities in __init__.py
parent 8bad912b
...@@ -264,6 +264,104 @@ When using autoselect models: ...@@ -264,6 +264,104 @@ When using autoselect models:
- **User Experience**: Provide optimal responses without manual model selection - **User Experience**: Provide optimal responses without manual model selection
- **Adaptive Selection**: Dynamically adjust model selection based on request characteristics - **Adaptive Selection**: Dynamically adjust model selection based on request characteristics
## Context Management
AISBF provides intelligent context management to handle large conversation histories and prevent exceeding model context limits:
### How Context Management Works
Context management automatically monitors and condenses conversation context:
1. **Effective Context Tracking**: Calculates and reports total tokens used (effective_context) for every request
2. **Automatic Condensation**: When context exceeds configured percentage of model's context_size, triggers condensation
3. **Multiple Condensation Methods**: Supports hierarchical, conversational, semantic, and algoritmic condensation
4. **Method Chaining**: Multiple condensation methods can be applied in sequence for optimal results
### Context Configuration
Models can be configured with context management fields:
```json
{
"models": [
{
"name": "gemini-2.0-flash",
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}
]
}
```
**Configuration Fields:**
- **`context_size`**: Maximum context size in tokens for the model
- **`condense_context`**: Percentage (0-100) at which to trigger condensation. 0 means disabled
- **`condense_method`**: String or list of strings specifying condensation method(s)
### Condensation Methods
#### 1. Hierarchical Context Engineering
Separates context into persistent (long-term facts) and transient (immediate task) layers:
- **Persistent State**: Architecture, project state, core principles
- **Recent History**: Summarized conversation history
- **Active Code**: High-fidelity current code
- **Instruction**: Current task/goal
#### 2. Conversational Summarization (Memory Buffering)
Replaces old messages with high-density summaries:
- Uses a smaller model to summarize conversation progress
- Maintains continuity without hitting token caps
- Preserves key facts, decisions, and current goals
#### 3. Semantic Context Pruning (Observation Masking)
Removes irrelevant details based on current query:
- Uses a smaller "janitor" model to extract relevant facts
- Can reduce history by 50-80% without losing critical information
- Focuses on information relevant to the specific current request
#### 4. Algoritmic Token Compression
Mathematical compression for technical data and logs:
- Similar to LLMLingua compression
- Achieves up to 20x compression for technical data
- Removes low-information tokens systematically
### Effective Context Reporting
All responses include `effective_context` in the usage field:
**Non-streaming responses:**
```json
{
"usage": {
"prompt_tokens": 1000,
"completion_tokens": 500,
"total_tokens": 1500,
"effective_context": 1000
}
}
```
**Streaming responses:**
The final chunk includes effective_context:
```json
{
"usage": {
"prompt_tokens": null,
"completion_tokens": null,
"total_tokens": null,
"effective_context": 1000
}
}
```
### Example Use Cases
- **Long Conversations**: Maintain context across extended conversations without hitting limits
- **Code Analysis**: Handle large codebases with intelligent context pruning
- **Document Processing**: Process large documents with automatic summarization
- **Multi-turn Tasks**: Maintain task context across multiple interactions
## Error Tracking and Rate Limiting ## Error Tracking and Rate Limiting
### Error Tracking ### Error Tracking
...@@ -399,7 +497,7 @@ Stops running daemon and removes PID file. ...@@ -399,7 +497,7 @@ Stops running daemon and removes PID file.
- `Message` - Chat message structure - `Message` - Chat message structure
- `ChatCompletionRequest` - Request model - `ChatCompletionRequest` - Request model
- `ChatCompletionResponse` - Response model - `ChatCompletionResponse` - Response model
- `Model` - Model information - `Model` - Model information (includes context_size, condense_context, condense_method fields)
- `Provider` - Provider information - `Provider` - Provider information
- `ErrorTracking` - Error tracking data - `ErrorTracking` - Error tracking data
...@@ -411,9 +509,13 @@ Stops running daemon and removes PID file. ...@@ -411,9 +509,13 @@ Stops running daemon and removes PID file.
- `OllamaProviderHandler` - Ollama provider implementation - `OllamaProviderHandler` - Ollama provider implementation
- `get_provider_handler()` - Factory function for provider handlers - `get_provider_handler()` - Factory function for provider handlers
### aisbf/context.py
- `ContextManager` - Context management class for automatic condensation
- `get_context_config_for_model()` - Retrieves context configuration from provider or rotation model config
### aisbf/handlers.py ### aisbf/handlers.py
- `RequestHandler` - Request handling logic with streaming support - `RequestHandler` - Request handling logic with streaming support and context management
- `RotationHandler` - Rotation handling logic with streaming support - `RotationHandler` - Rotation handling logic with streaming support and context management
- `AutoselectHandler` - AI-assisted model selection with streaming support - `AutoselectHandler` - AI-assisted model selection with streaming support
## Dependencies ## Dependencies
...@@ -426,6 +528,8 @@ Key dependencies from requirements.txt: ...@@ -426,6 +528,8 @@ Key dependencies from requirements.txt:
- google-genai - Google AI SDK - google-genai - Google AI SDK
- openai - OpenAI SDK - openai - OpenAI SDK
- anthropic - Anthropic SDK - anthropic - Anthropic SDK
- langchain-text-splitters - Intelligent text splitting for request chunking
- tiktoken - Accurate token counting for context management
## Adding New Providers ## Adding New Providers
......
...@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi ...@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi
- **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit - **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit
- **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits - **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits
- **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded - **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded
- **Context Management**: Automatic context condensation when approaching model limits with multiple condensation methods
- **Effective Context Tracking**: Reports total tokens used (effective_context) for every request
## Author ## Author
...@@ -82,12 +84,24 @@ Models can be configured with the following optional fields: ...@@ -82,12 +84,24 @@ Models can be configured with the following optional fields:
- **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute) - **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute)
- **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour) - **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour)
- **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day) - **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day)
- **`context_size`**: Maximum context size in tokens for the model. Used to determine when to trigger context condensation.
- **`condense_context`**: Percentage (0-100) at which to trigger context condensation. 0 means disabled, any other value triggers condensation when context reaches this percentage of context_size.
- **`condense_method`**: String or list of strings specifying condensation method(s). Supported values: "hierarchical", "conversational", "semantic", "algoritmic". Multiple methods can be chained together.
When token rate limits are exceeded, providers are automatically disabled: When token rate limits are exceeded, providers are automatically disabled:
- TPM limit exceeded: Provider disabled for 1 minute - TPM limit exceeded: Provider disabled for 1 minute
- TPH limit exceeded: Provider disabled for 1 hour - TPH limit exceeded: Provider disabled for 1 hour
- TPD limit exceeded: Provider disabled for 1 day - TPD limit exceeded: Provider disabled for 1 day
### Context Condensation Methods
When context exceeds the configured percentage of `context_size`, the system automatically condenses the prompt using one or more methods:
1. **Hierarchical**: Separates context into persistent (long-term facts) and transient (immediate task) layers
2. **Conversational**: Summarizes old messages using a smaller model to maintain conversation continuity
3. **Semantic**: Prunes irrelevant context based on current query using a smaller "janitor" model
4. **Algoritmic**: Uses mathematical compression for technical data and logs (similar to LLMLingua)
See `config/providers.json` and `config/rotations.json` for configuration examples. See `config/providers.json` and `config/rotations.json` for configuration examples.
## API Endpoints ## API Endpoints
......
...@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations. ...@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations.
""" """
from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo
from .context import ContextManager, get_context_config_for_model
from .models import ( from .models import (
Message, Message,
ChatCompletionRequest, ChatCompletionRequest,
...@@ -42,6 +43,7 @@ from .providers import ( ...@@ -42,6 +43,7 @@ from .providers import (
PROVIDER_HANDLERS PROVIDER_HANDLERS
) )
from .handlers import RequestHandler, RotationHandler, AutoselectHandler from .handlers import RequestHandler, RotationHandler, AutoselectHandler
from .utils import count_messages_tokens, split_messages_into_chunks, get_max_request_tokens_for_model
__version__ = "0.3.0" __version__ = "0.3.0"
__all__ = [ __all__ = [
...@@ -74,4 +76,11 @@ __all__ = [ ...@@ -74,4 +76,11 @@ __all__ = [
"RequestHandler", "RequestHandler",
"RotationHandler", "RotationHandler",
"AutoselectHandler", "AutoselectHandler",
# Context
"ContextManager",
"get_context_config_for_model",
# Utils
"count_messages_tokens",
"split_messages_into_chunks",
"get_max_request_tokens_for_model",
] ]
This diff is collapsed.
This diff is collapsed.
...@@ -63,6 +63,9 @@ class Model(BaseModel): ...@@ -63,6 +63,9 @@ class Model(BaseModel):
rate_limit_TPM: Optional[int] = None # Max tokens per minute rate_limit_TPM: Optional[int] = None # Max tokens per minute
rate_limit_TPH: Optional[int] = None # Max tokens per hour rate_limit_TPH: Optional[int] = None # Max tokens per hour
rate_limit_TPD: Optional[int] = None # Max tokens per day rate_limit_TPD: Optional[int] = None # Max tokens per day
context_size: Optional[int] = None # Max context size in tokens for the model
condense_context: Optional[int] = None # Percentage (0-100) at which to condense context
condense_method: Optional[Union[str, List[str]]] = None # Method(s) for condensation: "hierarchical", "conversational", "semantic", "algorithmic"
class Provider(BaseModel): class Provider(BaseModel):
id: str id: str
......
...@@ -14,7 +14,10 @@ ...@@ -14,7 +14,10 @@
"max_request_tokens": 1000000, "max_request_tokens": 1000000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}, },
{ {
"name": "gemini-1.5-pro", "name": "gemini-1.5-pro",
...@@ -22,7 +25,10 @@ ...@@ -22,7 +25,10 @@
"max_request_tokens": 2000000, "max_request_tokens": 2000000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 2000000,
"condense_context": 85,
"condense_method": "conversational"
} }
] ]
}, },
......
...@@ -14,7 +14,10 @@ ...@@ -14,7 +14,10 @@
"max_request_tokens": 100000, "max_request_tokens": 100000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}, },
{ {
"name": "gemini-1.5-pro", "name": "gemini-1.5-pro",
...@@ -23,7 +26,10 @@ ...@@ -23,7 +26,10 @@
"max_request_tokens": 100000, "max_request_tokens": 100000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 2000000,
"condense_context": 85,
"condense_method": "conversational"
} }
] ]
}, },
...@@ -35,13 +41,19 @@ ...@@ -35,13 +41,19 @@
"name": "gpt-4", "name": "gpt-4",
"weight": 2, "weight": 2,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 128000 "max_request_tokens": 128000,
"context_size": 128000,
"condense_context": 75,
"condense_method": ["hierarchical", "conversational"]
}, },
{ {
"name": "gpt-3.5-turbo", "name": "gpt-3.5-turbo",
"weight": 1, "weight": 1,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 4000 "max_request_tokens": 4000,
"context_size": 16000,
"condense_context": 70,
"condense_method": "semantic"
} }
] ]
}, },
...@@ -53,13 +65,19 @@ ...@@ -53,13 +65,19 @@
"name": "claude-3-5-sonnet-20241022", "name": "claude-3-5-sonnet-20241022",
"weight": 2, "weight": 2,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 200000 "max_request_tokens": 200000,
"context_size": 200000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}, },
{ {
"name": "claude-3-haiku-20240307", "name": "claude-3-haiku-20240307",
"weight": 1, "weight": 1,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 200000 "max_request_tokens": 200000,
"context_size": 200000,
"condense_context": 75,
"condense_method": "conversational"
} }
] ]
} }
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment