Add context management feature with automatic condensation

- Add context_size, condense_context, and condense_method fields to Model class
- Create new context.py module with ContextManager and condensation methods
- Implement hierarchical, conversational, semantic, and algoritmic condensation
- Calculate and report effective_context for all requests
- Update handlers.py to apply context condensation when configured
- Update providers.json and rotations.json with example context configurations
- Update README.md and DOCUMENTATION.md with context management documentation
- Export context module and utilities in __init__.py
parent 8bad912b
......@@ -264,6 +264,104 @@ When using autoselect models:
- **User Experience**: Provide optimal responses without manual model selection
- **Adaptive Selection**: Dynamically adjust model selection based on request characteristics
## Context Management
AISBF provides intelligent context management to handle large conversation histories and prevent exceeding model context limits:
### How Context Management Works
Context management automatically monitors and condenses conversation context:
1. **Effective Context Tracking**: Calculates and reports total tokens used (effective_context) for every request
2. **Automatic Condensation**: When context exceeds configured percentage of model's context_size, triggers condensation
3. **Multiple Condensation Methods**: Supports hierarchical, conversational, semantic, and algoritmic condensation
4. **Method Chaining**: Multiple condensation methods can be applied in sequence for optimal results
### Context Configuration
Models can be configured with context management fields:
```json
{
"models": [
{
"name": "gemini-2.0-flash",
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}
]
}
```
**Configuration Fields:**
- **`context_size`**: Maximum context size in tokens for the model
- **`condense_context`**: Percentage (0-100) at which to trigger condensation. 0 means disabled
- **`condense_method`**: String or list of strings specifying condensation method(s)
### Condensation Methods
#### 1. Hierarchical Context Engineering
Separates context into persistent (long-term facts) and transient (immediate task) layers:
- **Persistent State**: Architecture, project state, core principles
- **Recent History**: Summarized conversation history
- **Active Code**: High-fidelity current code
- **Instruction**: Current task/goal
#### 2. Conversational Summarization (Memory Buffering)
Replaces old messages with high-density summaries:
- Uses a smaller model to summarize conversation progress
- Maintains continuity without hitting token caps
- Preserves key facts, decisions, and current goals
#### 3. Semantic Context Pruning (Observation Masking)
Removes irrelevant details based on current query:
- Uses a smaller "janitor" model to extract relevant facts
- Can reduce history by 50-80% without losing critical information
- Focuses on information relevant to the specific current request
#### 4. Algoritmic Token Compression
Mathematical compression for technical data and logs:
- Similar to LLMLingua compression
- Achieves up to 20x compression for technical data
- Removes low-information tokens systematically
### Effective Context Reporting
All responses include `effective_context` in the usage field:
**Non-streaming responses:**
```json
{
"usage": {
"prompt_tokens": 1000,
"completion_tokens": 500,
"total_tokens": 1500,
"effective_context": 1000
}
}
```
**Streaming responses:**
The final chunk includes effective_context:
```json
{
"usage": {
"prompt_tokens": null,
"completion_tokens": null,
"total_tokens": null,
"effective_context": 1000
}
}
```
### Example Use Cases
- **Long Conversations**: Maintain context across extended conversations without hitting limits
- **Code Analysis**: Handle large codebases with intelligent context pruning
- **Document Processing**: Process large documents with automatic summarization
- **Multi-turn Tasks**: Maintain task context across multiple interactions
## Error Tracking and Rate Limiting
### Error Tracking
......@@ -399,7 +497,7 @@ Stops running daemon and removes PID file.
- `Message` - Chat message structure
- `ChatCompletionRequest` - Request model
- `ChatCompletionResponse` - Response model
- `Model` - Model information
- `Model` - Model information (includes context_size, condense_context, condense_method fields)
- `Provider` - Provider information
- `ErrorTracking` - Error tracking data
......@@ -411,9 +509,13 @@ Stops running daemon and removes PID file.
- `OllamaProviderHandler` - Ollama provider implementation
- `get_provider_handler()` - Factory function for provider handlers
### aisbf/context.py
- `ContextManager` - Context management class for automatic condensation
- `get_context_config_for_model()` - Retrieves context configuration from provider or rotation model config
### aisbf/handlers.py
- `RequestHandler` - Request handling logic with streaming support
- `RotationHandler` - Rotation handling logic with streaming support
- `RequestHandler` - Request handling logic with streaming support and context management
- `RotationHandler` - Rotation handling logic with streaming support and context management
- `AutoselectHandler` - AI-assisted model selection with streaming support
## Dependencies
......@@ -426,6 +528,8 @@ Key dependencies from requirements.txt:
- google-genai - Google AI SDK
- openai - OpenAI SDK
- anthropic - Anthropic SDK
- langchain-text-splitters - Intelligent text splitting for request chunking
- tiktoken - Accurate token counting for context management
## Adding New Providers
......
......@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi
- **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit
- **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits
- **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded
- **Context Management**: Automatic context condensation when approaching model limits with multiple condensation methods
- **Effective Context Tracking**: Reports total tokens used (effective_context) for every request
## Author
......@@ -82,12 +84,24 @@ Models can be configured with the following optional fields:
- **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute)
- **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour)
- **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day)
- **`context_size`**: Maximum context size in tokens for the model. Used to determine when to trigger context condensation.
- **`condense_context`**: Percentage (0-100) at which to trigger context condensation. 0 means disabled, any other value triggers condensation when context reaches this percentage of context_size.
- **`condense_method`**: String or list of strings specifying condensation method(s). Supported values: "hierarchical", "conversational", "semantic", "algoritmic". Multiple methods can be chained together.
When token rate limits are exceeded, providers are automatically disabled:
- TPM limit exceeded: Provider disabled for 1 minute
- TPH limit exceeded: Provider disabled for 1 hour
- TPD limit exceeded: Provider disabled for 1 day
### Context Condensation Methods
When context exceeds the configured percentage of `context_size`, the system automatically condenses the prompt using one or more methods:
1. **Hierarchical**: Separates context into persistent (long-term facts) and transient (immediate task) layers
2. **Conversational**: Summarizes old messages using a smaller model to maintain conversation continuity
3. **Semantic**: Prunes irrelevant context based on current query using a smaller "janitor" model
4. **Algoritmic**: Uses mathematical compression for technical data and logs (similar to LLMLingua)
See `config/providers.json` and `config/rotations.json` for configuration examples.
## API Endpoints
......
......@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations.
"""
from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo
from .context import ContextManager, get_context_config_for_model
from .models import (
Message,
ChatCompletionRequest,
......@@ -42,6 +43,7 @@ from .providers import (
PROVIDER_HANDLERS
)
from .handlers import RequestHandler, RotationHandler, AutoselectHandler
from .utils import count_messages_tokens, split_messages_into_chunks, get_max_request_tokens_for_model
__version__ = "0.3.0"
__all__ = [
......@@ -74,4 +76,11 @@ __all__ = [
"RequestHandler",
"RotationHandler",
"AutoselectHandler",
# Context
"ContextManager",
"get_context_config_for_model",
# Utils
"count_messages_tokens",
"split_messages_into_chunks",
"get_max_request_tokens_for_model",
]
This diff is collapsed.
This diff is collapsed.
......@@ -63,6 +63,9 @@ class Model(BaseModel):
rate_limit_TPM: Optional[int] = None # Max tokens per minute
rate_limit_TPH: Optional[int] = None # Max tokens per hour
rate_limit_TPD: Optional[int] = None # Max tokens per day
context_size: Optional[int] = None # Max context size in tokens for the model
condense_context: Optional[int] = None # Percentage (0-100) at which to condense context
condense_method: Optional[Union[str, List[str]]] = None # Method(s) for condensation: "hierarchical", "conversational", "semantic", "algorithmic"
class Provider(BaseModel):
id: str
......
......@@ -14,7 +14,10 @@
"max_request_tokens": 1000000,
"rate_limit_TPM": 15000,
"rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000
"rate_limit_TPD": 1000000,
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
},
{
"name": "gemini-1.5-pro",
......@@ -22,7 +25,10 @@
"max_request_tokens": 2000000,
"rate_limit_TPM": 15000,
"rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000
"rate_limit_TPD": 1000000,
"context_size": 2000000,
"condense_context": 85,
"condense_method": "conversational"
}
]
},
......
......@@ -14,7 +14,10 @@
"max_request_tokens": 100000,
"rate_limit_TPM": 15000,
"rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000
"rate_limit_TPD": 1000000,
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
},
{
"name": "gemini-1.5-pro",
......@@ -23,7 +26,10 @@
"max_request_tokens": 100000,
"rate_limit_TPM": 15000,
"rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000
"rate_limit_TPD": 1000000,
"context_size": 2000000,
"condense_context": 85,
"condense_method": "conversational"
}
]
},
......@@ -35,13 +41,19 @@
"name": "gpt-4",
"weight": 2,
"rate_limit": 0,
"max_request_tokens": 128000
"max_request_tokens": 128000,
"context_size": 128000,
"condense_context": 75,
"condense_method": ["hierarchical", "conversational"]
},
{
"name": "gpt-3.5-turbo",
"weight": 1,
"rate_limit": 0,
"max_request_tokens": 4000
"max_request_tokens": 4000,
"context_size": 16000,
"condense_context": 70,
"condense_method": "semantic"
}
]
},
......@@ -53,13 +65,19 @@
"name": "claude-3-5-sonnet-20241022",
"weight": 2,
"rate_limit": 0,
"max_request_tokens": 200000
"max_request_tokens": 200000,
"context_size": 200000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
},
{
"name": "claude-3-haiku-20240307",
"weight": 1,
"rate_limit": 0,
"max_request_tokens": 200000
"max_request_tokens": 200000,
"context_size": 200000,
"condense_context": 75,
"condense_method": "conversational"
}
]
}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment