Add context management feature with automatic condensation

- Add context_size, condense_context, and condense_method fields to Model class
- Create new context.py module with ContextManager and condensation methods
- Implement hierarchical, conversational, semantic, and algoritmic condensation
- Calculate and report effective_context for all requests
- Update handlers.py to apply context condensation when configured
- Update providers.json and rotations.json with example context configurations
- Update README.md and DOCUMENTATION.md with context management documentation
- Export context module and utilities in __init__.py
parent 8bad912b
...@@ -264,6 +264,104 @@ When using autoselect models: ...@@ -264,6 +264,104 @@ When using autoselect models:
- **User Experience**: Provide optimal responses without manual model selection - **User Experience**: Provide optimal responses without manual model selection
- **Adaptive Selection**: Dynamically adjust model selection based on request characteristics - **Adaptive Selection**: Dynamically adjust model selection based on request characteristics
## Context Management
AISBF provides intelligent context management to handle large conversation histories and prevent exceeding model context limits:
### How Context Management Works
Context management automatically monitors and condenses conversation context:
1. **Effective Context Tracking**: Calculates and reports total tokens used (effective_context) for every request
2. **Automatic Condensation**: When context exceeds configured percentage of model's context_size, triggers condensation
3. **Multiple Condensation Methods**: Supports hierarchical, conversational, semantic, and algoritmic condensation
4. **Method Chaining**: Multiple condensation methods can be applied in sequence for optimal results
### Context Configuration
Models can be configured with context management fields:
```json
{
"models": [
{
"name": "gemini-2.0-flash",
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}
]
}
```
**Configuration Fields:**
- **`context_size`**: Maximum context size in tokens for the model
- **`condense_context`**: Percentage (0-100) at which to trigger condensation. 0 means disabled
- **`condense_method`**: String or list of strings specifying condensation method(s)
### Condensation Methods
#### 1. Hierarchical Context Engineering
Separates context into persistent (long-term facts) and transient (immediate task) layers:
- **Persistent State**: Architecture, project state, core principles
- **Recent History**: Summarized conversation history
- **Active Code**: High-fidelity current code
- **Instruction**: Current task/goal
#### 2. Conversational Summarization (Memory Buffering)
Replaces old messages with high-density summaries:
- Uses a smaller model to summarize conversation progress
- Maintains continuity without hitting token caps
- Preserves key facts, decisions, and current goals
#### 3. Semantic Context Pruning (Observation Masking)
Removes irrelevant details based on current query:
- Uses a smaller "janitor" model to extract relevant facts
- Can reduce history by 50-80% without losing critical information
- Focuses on information relevant to the specific current request
#### 4. Algoritmic Token Compression
Mathematical compression for technical data and logs:
- Similar to LLMLingua compression
- Achieves up to 20x compression for technical data
- Removes low-information tokens systematically
### Effective Context Reporting
All responses include `effective_context` in the usage field:
**Non-streaming responses:**
```json
{
"usage": {
"prompt_tokens": 1000,
"completion_tokens": 500,
"total_tokens": 1500,
"effective_context": 1000
}
}
```
**Streaming responses:**
The final chunk includes effective_context:
```json
{
"usage": {
"prompt_tokens": null,
"completion_tokens": null,
"total_tokens": null,
"effective_context": 1000
}
}
```
### Example Use Cases
- **Long Conversations**: Maintain context across extended conversations without hitting limits
- **Code Analysis**: Handle large codebases with intelligent context pruning
- **Document Processing**: Process large documents with automatic summarization
- **Multi-turn Tasks**: Maintain task context across multiple interactions
## Error Tracking and Rate Limiting ## Error Tracking and Rate Limiting
### Error Tracking ### Error Tracking
...@@ -399,7 +497,7 @@ Stops running daemon and removes PID file. ...@@ -399,7 +497,7 @@ Stops running daemon and removes PID file.
- `Message` - Chat message structure - `Message` - Chat message structure
- `ChatCompletionRequest` - Request model - `ChatCompletionRequest` - Request model
- `ChatCompletionResponse` - Response model - `ChatCompletionResponse` - Response model
- `Model` - Model information - `Model` - Model information (includes context_size, condense_context, condense_method fields)
- `Provider` - Provider information - `Provider` - Provider information
- `ErrorTracking` - Error tracking data - `ErrorTracking` - Error tracking data
...@@ -411,9 +509,13 @@ Stops running daemon and removes PID file. ...@@ -411,9 +509,13 @@ Stops running daemon and removes PID file.
- `OllamaProviderHandler` - Ollama provider implementation - `OllamaProviderHandler` - Ollama provider implementation
- `get_provider_handler()` - Factory function for provider handlers - `get_provider_handler()` - Factory function for provider handlers
### aisbf/context.py
- `ContextManager` - Context management class for automatic condensation
- `get_context_config_for_model()` - Retrieves context configuration from provider or rotation model config
### aisbf/handlers.py ### aisbf/handlers.py
- `RequestHandler` - Request handling logic with streaming support - `RequestHandler` - Request handling logic with streaming support and context management
- `RotationHandler` - Rotation handling logic with streaming support - `RotationHandler` - Rotation handling logic with streaming support and context management
- `AutoselectHandler` - AI-assisted model selection with streaming support - `AutoselectHandler` - AI-assisted model selection with streaming support
## Dependencies ## Dependencies
...@@ -426,6 +528,8 @@ Key dependencies from requirements.txt: ...@@ -426,6 +528,8 @@ Key dependencies from requirements.txt:
- google-genai - Google AI SDK - google-genai - Google AI SDK
- openai - OpenAI SDK - openai - OpenAI SDK
- anthropic - Anthropic SDK - anthropic - Anthropic SDK
- langchain-text-splitters - Intelligent text splitting for request chunking
- tiktoken - Accurate token counting for context management
## Adding New Providers ## Adding New Providers
......
...@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi ...@@ -13,6 +13,8 @@ A modular proxy server for managing multiple AI provider integrations with unifi
- **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit - **Request Splitting**: Automatic splitting of large requests when exceeding `max_request_tokens` limit
- **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits - **Token Rate Limiting**: Per-model token usage tracking with TPM (tokens per minute), TPH (tokens per hour), and TPD (tokens per day) limits
- **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded - **Automatic Provider Disabling**: Providers automatically disabled when token rate limits are exceeded
- **Context Management**: Automatic context condensation when approaching model limits with multiple condensation methods
- **Effective Context Tracking**: Reports total tokens used (effective_context) for every request
## Author ## Author
...@@ -82,12 +84,24 @@ Models can be configured with the following optional fields: ...@@ -82,12 +84,24 @@ Models can be configured with the following optional fields:
- **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute) - **`rate_limit_TPM`**: Maximum tokens allowed per minute (Tokens Per Minute)
- **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour) - **`rate_limit_TPH`**: Maximum tokens allowed per hour (Tokens Per Hour)
- **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day) - **`rate_limit_TPD`**: Maximum tokens allowed per day (Tokens Per Day)
- **`context_size`**: Maximum context size in tokens for the model. Used to determine when to trigger context condensation.
- **`condense_context`**: Percentage (0-100) at which to trigger context condensation. 0 means disabled, any other value triggers condensation when context reaches this percentage of context_size.
- **`condense_method`**: String or list of strings specifying condensation method(s). Supported values: "hierarchical", "conversational", "semantic", "algoritmic". Multiple methods can be chained together.
When token rate limits are exceeded, providers are automatically disabled: When token rate limits are exceeded, providers are automatically disabled:
- TPM limit exceeded: Provider disabled for 1 minute - TPM limit exceeded: Provider disabled for 1 minute
- TPH limit exceeded: Provider disabled for 1 hour - TPH limit exceeded: Provider disabled for 1 hour
- TPD limit exceeded: Provider disabled for 1 day - TPD limit exceeded: Provider disabled for 1 day
### Context Condensation Methods
When context exceeds the configured percentage of `context_size`, the system automatically condenses the prompt using one or more methods:
1. **Hierarchical**: Separates context into persistent (long-term facts) and transient (immediate task) layers
2. **Conversational**: Summarizes old messages using a smaller model to maintain conversation continuity
3. **Semantic**: Prunes irrelevant context based on current query using a smaller "janitor" model
4. **Algoritmic**: Uses mathematical compression for technical data and logs (similar to LLMLingua)
See `config/providers.json` and `config/rotations.json` for configuration examples. See `config/providers.json` and `config/rotations.json` for configuration examples.
## API Endpoints ## API Endpoints
......
...@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations. ...@@ -24,6 +24,7 @@ A modular proxy server for managing multiple AI provider integrations.
""" """
from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo from .config import config, Config, ProviderConfig, RotationConfig, AppConfig, AutoselectConfig, AutoselectModelInfo
from .context import ContextManager, get_context_config_for_model
from .models import ( from .models import (
Message, Message,
ChatCompletionRequest, ChatCompletionRequest,
...@@ -42,6 +43,7 @@ from .providers import ( ...@@ -42,6 +43,7 @@ from .providers import (
PROVIDER_HANDLERS PROVIDER_HANDLERS
) )
from .handlers import RequestHandler, RotationHandler, AutoselectHandler from .handlers import RequestHandler, RotationHandler, AutoselectHandler
from .utils import count_messages_tokens, split_messages_into_chunks, get_max_request_tokens_for_model
__version__ = "0.3.0" __version__ = "0.3.0"
__all__ = [ __all__ = [
...@@ -74,4 +76,11 @@ __all__ = [ ...@@ -74,4 +76,11 @@ __all__ = [
"RequestHandler", "RequestHandler",
"RotationHandler", "RotationHandler",
"AutoselectHandler", "AutoselectHandler",
# Context
"ContextManager",
"get_context_config_for_model",
# Utils
"count_messages_tokens",
"split_messages_into_chunks",
"get_max_request_tokens_for_model",
] ]
"""
Copyleft (C) 2026 Stefy Lanza <stefy@nexlab.net>
AISBF - AI Service Broker Framework || AI Should Be Free
Context management and condensation for AISBF.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Why did the programmer quit his job? Because he didn't get arrays!
Context management and condensation for AISBF.
"""
import logging
from typing import Dict, List, Optional, Union, Any
from .utils import count_messages_tokens
class ContextManager:
"""
Manages context size and performs condensation when needed.
"""
def __init__(self, model_config: Dict, provider_handler=None):
"""
Initialize the context manager.
Args:
model_config: Model configuration dictionary containing context_size, condense_context, condense_method
provider_handler: Optional provider handler for making summarization requests
"""
self.context_size = model_config.get('context_size')
self.condense_context = model_config.get('condense_context', 0)
self.condense_method = model_config.get('condense_method')
self.provider_handler = provider_handler
# Normalize condense_context to 0-100 range
if self.condense_context and self.condense_context > 100:
self.condense_context = 100
# Normalize condense_method to list
if self.condense_method:
if isinstance(self.condense_method, str):
self.condense_method = [self.condense_method]
else:
self.condense_method = []
# Track conversation history for summarization
self.conversation_summary = None
self.summary_token_count = 0
logger = logging.getLogger(__name__)
logger.info(f"ContextManager initialized:")
logger.info(f" context_size: {self.context_size}")
logger.info(f" condense_context: {self.condense_context}%")
logger.info(f" condense_method: {self.condense_method}")
def should_condense(self, messages: List[Dict], model: str) -> bool:
"""
Check if context condensation is needed.
Args:
messages: List of messages to check
model: Model name for token counting
Returns:
True if condensation is needed, False otherwise
"""
if not self.context_size or not self.condense_context or self.condense_context == 0:
return False
# Calculate current token count
current_tokens = count_messages_tokens(messages, model)
# Calculate threshold
threshold = int(self.context_size * (self.condense_context / 100))
logger = logging.getLogger(__name__)
logger.info(f"Context check: {current_tokens} / {self.context_size} tokens (threshold: {threshold})")
return current_tokens >= threshold
async def condense_context(
self,
messages: List[Dict],
model: str,
current_query: Optional[str] = None
) -> List[Dict]:
"""
Condense the context using configured methods.
Args:
messages: List of messages to condense
model: Model name for token counting
current_query: Optional current query for semantic pruning
Returns:
Condensed list of messages
"""
logger = logging.getLogger(__name__)
logger.info(f"=== CONTEXT CONDENSATION START ===")
logger.info(f"Original messages count: {len(messages)}")
logger.info(f"Condensation methods: {self.condense_method}")
condensed_messages = messages.copy()
# Apply each condensation method in sequence
for method in self.condense_method:
logger.info(f"Applying method: {method}")
if method == "hierarchical":
condensed_messages = self._hierarchical_condense(condensed_messages, model)
elif method == "conversational":
condensed_messages = await self._conversational_condense(condensed_messages, model)
elif method == "semantic":
condensed_messages = await self._semantic_condense(condensed_messages, model, current_query)
elif method == "algorithmic":
condensed_messages = self._algorithmic_condense(condensed_messages, model)
else:
logger.warning(f"Unknown condensation method: {method}")
# Calculate token reduction
original_tokens = count_messages_tokens(messages, model)
condensed_tokens = count_messages_tokens(condensed_messages, model)
reduction = original_tokens - condensed_tokens
reduction_pct = (reduction / original_tokens * 100) if original_tokens > 0 else 0
logger.info(f"=== CONTEXT CONDENSATION END ===")
logger.info(f"Original tokens: {original_tokens}")
logger.info(f"Condensed tokens: {condensed_tokens}")
logger.info(f"Reduction: {reduction} tokens ({reduction_pct:.1f}%)")
logger.info(f"Final messages count: {len(condensed_messages)}")
return condensed_messages
def _hierarchical_condense(self, messages: List[Dict], model: str) -> List[Dict]:
"""
HIERARCHICAL CONTEXT ENGINEERING
Separate context into 'Persistent' (long-term facts) and 'Transient' (immediate task).
Uses "Step-Back Prompting" to identify core principles before answering.
Structure:
- PERSISTENT STATE (Architecture): System messages and early context
- RECENT HISTORY (Summarized): Middle messages
- ACTIVE CODE (High Fidelity): Recent messages
- INSTRUCTION: Current task
"""
logger = logging.getLogger(__name__)
logger.info(f"Hierarchical condensation: {len(messages)} messages")
if len(messages) <= 2:
# Not enough messages to condense
return messages
# Separate messages into categories
system_messages = [m for m in messages if m.get('role') == 'system']
user_messages = [m for m in messages if m.get('role') == 'user']
assistant_messages = [m for m in messages if m.get('role') == 'assistant']
# Keep all system messages (persistent state)
persistent = system_messages.copy()
# Keep recent messages (high fidelity - last 3 exchanges)
recent_count = min(6, len(user_messages) + len(assistant_messages))
recent_messages = []
# Get last few messages in order
all_messages_except_system = [m for m in messages if m.get('role') != 'system']
recent_messages = all_messages_except_system[-recent_count:]
# Middle messages to potentially summarize
middle_messages = all_messages_except_system[:-recent_count]
# For hierarchical, we keep persistent + recent, and summarize middle if needed
condensed = persistent + middle_messages + recent_messages
logger.info(f"Hierarchical: {len(persistent)} persistent, {len(middle_messages)} middle, {len(recent_messages)} recent")
return condensed
async def _conversational_condense(self, messages: List[Dict], model: str) -> List[Dict]:
"""
CONVERSATIONAL SUMMARIZATION (MEMORY BUFFERING)
Replace old messages with a high-density summary.
Uses a maintenance prompt to summarize progress.
"""
logger = logging.getLogger(__name__)
logger.info(f"Conversational condensation: {len(messages)} messages")
if not self.provider_handler:
logger.warning("No provider handler available for conversational condensation, skipping")
return messages
if len(messages) <= 4:
# Not enough messages to condense
return messages
# Keep system messages
system_messages = [m for m in messages if m.get('role') == 'system']
# Keep last 2 exchanges (4 messages)
recent_messages = messages[-4:]
# Messages to summarize (everything between system and recent)
messages_to_summarize = messages[len(system_messages):-4]
if not messages_to_summarize:
return messages
# Build summary prompt
summary_prompt = "Summarize the following conversation history, including key facts, decisions, and the current goal. Keep it concise but comprehensive.\n\n"
for msg in messages_to_summarize:
role = msg.get('role', 'unknown')
content = msg.get('content', '')
if content:
summary_prompt += f"{role}: {content}\n"
try:
# Request summary from the model
summary_messages = [{"role": "user", "content": summary_prompt}]
summary_response = await self.provider_handler.handle_request(
model=model,
messages=summary_messages,
max_tokens=1000,
temperature=0.3,
stream=False
)
# Extract summary content
if isinstance(summary_response, dict):
summary_content = summary_response.get('choices', [{}])[0].get('message', {}).get('content', '')
if summary_content:
# Create summary message
summary_message = {
"role": "system",
"content": f"[CONVERSATION SUMMARY]\n{summary_content}"
}
# Build condensed messages: system + summary + recent
condensed = system_messages + [summary_message] + recent_messages
# Update stored summary
self.conversation_summary = summary_content
self.summary_token_count = count_messages_tokens([summary_message], model)
logger.info(f"Conversational: Created summary ({len(summary_content)} chars)")
return condensed
except Exception as e:
logger.error(f"Error during conversational condensation: {e}")
# Fallback: return original messages
return messages
async def _semantic_condense(
self,
messages: List[Dict],
model: str,
current_query: Optional[str] = None
) -> List[Dict]:
"""
SEMANTIC CONTEXT PRUNING (OBSERVATION MASKING)
Remove or hide old, non-critical details that are irrelevant to the current task.
Uses a smaller model as a "janitor" to extract only relevant info.
"""
logger = logging.getLogger(__name__)
logger.info(f"Semantic condensation: {len(messages)} messages")
if not self.provider_handler:
logger.warning("No provider handler available for semantic condensation, skipping")
return messages
if len(messages) <= 2:
return messages
# Keep system messages
system_messages = [m for m in messages if m.get('role') == 'system']
# Get conversation history (excluding system)
conversation = [m for m in messages if m.get('role') != 'system']
if not conversation:
return messages
# Build conversation text
conversation_text = ""
for msg in conversation:
role = msg.get('role', 'unknown')
content = msg.get('content', '')
if content:
conversation_text += f"{role}: {content}\n"
# Build pruning prompt
if current_query:
prune_prompt = f"""Given the current query: '{current_query}'
Extract ONLY the relevant facts from this conversation history. Ignore everything else that is not directly related to answering the current query.
Conversation History:
{conversation_text}
Provide only the relevant information in a concise format."""
else:
prune_prompt = f"""Extract the most important and relevant information from this conversation history. Focus on key facts, decisions, and context that would be needed for future queries.
Conversation History:
{conversation_text}
Provide only the relevant information in a concise format."""
try:
# Request pruned context from the model
prune_messages = [{"role": "user", "content": prune_prompt}]
prune_response = await self.provider_handler.handle_request(
model=model,
messages=prune_messages,
max_tokens=2000,
temperature=0.2,
stream=False
)
# Extract pruned content
if isinstance(prune_response, dict):
pruned_content = prune_response.get('choices', [{}])[0].get('message', {}).get('content', '')
if pruned_content:
# Create pruned context message
pruned_message = {
"role": "system",
"content": f"[RELEVANT CONTEXT]\n{pruned_content}"
}
# Build condensed messages: system + pruned + last user message
last_message = messages[-1] if messages else None
if last_message and last_message.get('role') != 'system':
condensed = system_messages + [pruned_message, last_message]
else:
condensed = system_messages + [pruned_message]
logger.info(f"Semantic: Pruned to relevant context ({len(pruned_content)} chars)")
return condensed
except Exception as e:
logger.error(f"Error during semantic condensation: {e}")
# Fallback: return original messages
return messages
def _algorithmic_condense(self, messages: List[Dict], model: str) -> List[Dict]:
"""
ALGORITHMIC TOKEN COMPRESSION (LLMLingua-style)
Mathematically remove "low-information" tokens.
This is a simplified version that removes redundant content.
"""
logger = logging.getLogger(__name__)
logger.info(f"Algorithmic condensation: {len(messages)} messages")
condensed = []
for msg in messages:
role = msg.get('role')
content = msg.get('content')
if not content:
condensed.append(msg)
continue
# Skip empty or very short messages
if len(str(content)) < 10:
continue
# Remove duplicate consecutive messages from same role
if condensed and condensed[-1].get('role') == role:
prev_content = str(condensed[-1].get('content', ''))
curr_content = str(content)
# If very similar, skip
if prev_content == curr_content:
logger.debug(f"Skipping duplicate message from {role}")
continue
# Remove excessive whitespace
if isinstance(content, str):
content = ' '.join(content.split())
condensed.append({
"role": role,
"content": content
})
logger.info(f"Algorithmic: Reduced from {len(messages)} to {len(condensed)} messages")
return condensed
def get_context_config_for_model(
model_name: str,
provider_config: Any = None,
rotation_model_config: Optional[Dict] = None
) -> Dict:
"""
Get context configuration for a specific model.
Args:
model_name: Name of the model
provider_config: Provider configuration (optional)
rotation_model_config: Rotation model configuration (optional)
Returns:
Dictionary with context_size, condense_context, and condense_method
"""
context_config = {
'context_size': None,
'condense_context': 0,
'condense_method': None
}
# Check rotation model config first (highest priority)
if rotation_model_config:
context_config['context_size'] = rotation_model_config.get('context_size')
context_config['condense_context'] = rotation_model_config.get('condense_context', 0)
context_config['condense_method'] = rotation_model_config.get('condense_method')
# Fall back to provider config
elif provider_config and hasattr(provider_config, 'models'):
for model in provider_config.models:
if model.get('name') == model_name:
context_config['context_size'] = model.get('context_size')
context_config['condense_context'] = model.get('condense_context', 0)
context_config['condense_method'] = model.get('condense_method')
break
return context_config
\ No newline at end of file
...@@ -39,6 +39,7 @@ from .utils import ( ...@@ -39,6 +39,7 @@ from .utils import (
split_messages_into_chunks, split_messages_into_chunks,
get_max_request_tokens_for_model get_max_request_tokens_for_model
) )
from .context import ContextManager, get_context_config_for_model
def generate_system_fingerprint(provider_id: str, seed: Optional[int] = None) -> str: def generate_system_fingerprint(provider_id: str, seed: Optional[int] = None) -> str:
...@@ -241,6 +242,14 @@ class RequestHandler: ...@@ -241,6 +242,14 @@ class RequestHandler:
logger.info(f"Temperature: {request_data.get('temperature', 1.0)}") logger.info(f"Temperature: {request_data.get('temperature', 1.0)}")
logger.info(f"Stream: {request_data.get('stream', False)}") logger.info(f"Stream: {request_data.get('stream', False)}")
# Get context configuration
context_config = get_context_config_for_model(
model_name=model,
provider_config=provider_config,
rotation_model_config=None
)
logger.info(f"Context config: {context_config}")
# Check for max_request_tokens in provider config # Check for max_request_tokens in provider config
max_request_tokens = get_max_request_tokens_for_model( max_request_tokens = get_max_request_tokens_for_model(
model_name=model, model_name=model,
...@@ -248,6 +257,19 @@ class RequestHandler: ...@@ -248,6 +257,19 @@ class RequestHandler:
rotation_model_config=None rotation_model_config=None
) )
# Calculate effective context (total tokens used)
effective_context = count_messages_tokens(messages, model)
logger.info(f"Effective context: {effective_context} tokens")
# Apply context condensation if needed
if context_config.get('condense_context', 0) > 0:
context_manager = ContextManager(context_config, handler)
if context_manager.should_condense(messages, model):
logger.info("Context condensation triggered")
messages = await context_manager.condense_context(messages, model)
effective_context = count_messages_tokens(messages, model)
logger.info(f"Condensed effective context: {effective_context} tokens")
if max_request_tokens: if max_request_tokens:
# Count tokens in the request # Count tokens in the request
request_tokens = count_messages_tokens(messages, model) request_tokens = count_messages_tokens(messages, model)
...@@ -299,6 +321,11 @@ class RequestHandler: ...@@ -299,6 +321,11 @@ class RequestHandler:
logger.info(f"Response type: {type(response)}") logger.info(f"Response type: {type(response)}")
logger.info(f"Response: {response}") logger.info(f"Response: {response}")
# Add effective context to response for non-streaming
if isinstance(response, dict) and 'usage' in response:
response['usage']['effective_context'] = effective_context
logger.info(f"Added effective_context to response: {effective_context}")
# For OpenAI-compatible providers, the response is already a response object # For OpenAI-compatible providers, the response is already a response object
# Just return it as-is without any parsing or modification # Just return it as-is without any parsing or modification
handler.record_success() handler.record_success()
...@@ -327,6 +354,32 @@ class RequestHandler: ...@@ -327,6 +354,32 @@ class RequestHandler:
# If seed is present in request, generate unique fingerprint per request # If seed is present in request, generate unique fingerprint per request
seed = request_data.get('seed') seed = request_data.get('seed')
system_fingerprint = generate_system_fingerprint(provider_id, seed) system_fingerprint = generate_system_fingerprint(provider_id, seed)
# Get context configuration and calculate effective context
model = request_data.get('model')
messages = request_data.get('messages', [])
context_config = get_context_config_for_model(
model_name=model,
provider_config=provider_config,
rotation_model_config=None
)
effective_context = count_messages_tokens(messages, model)
# Apply context condensation if needed
if context_config.get('condense_context', 0) > 0:
context_manager = ContextManager(context_config, handler)
if context_manager.should_condense(messages, model):
import logging
logger = logging.getLogger(__name__)
logger.info("Context condensation triggered for streaming request")
messages = await context_manager.condense_context(messages, model)
effective_context = count_messages_tokens(messages, model)
logger.info(f"Condensed effective context: {effective_context} tokens")
# Update request_data with condensed messages
request_data['messages'] = messages
async def stream_generator(): async def stream_generator():
import logging import logging
...@@ -457,7 +510,8 @@ class RequestHandler: ...@@ -457,7 +510,8 @@ class RequestHandler:
"usage": { "usage": {
"prompt_tokens": None, "prompt_tokens": None,
"completion_tokens": None, "completion_tokens": None,
"total_tokens": None "total_tokens": None,
"effective_context": effective_context
}, },
"provider": provider_id, "provider": provider_id,
"choices": [{ "choices": [{
...@@ -488,6 +542,16 @@ class RequestHandler: ...@@ -488,6 +542,16 @@ class RequestHandler:
# For OpenAI-compatible providers, just pass through the raw chunk # For OpenAI-compatible providers, just pass through the raw chunk
# Convert chunk to dict and serialize as JSON # Convert chunk to dict and serialize as JSON
chunk_dict = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk chunk_dict = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk
# Add effective_context to the last chunk (when finish_reason is present)
if isinstance(chunk_dict, dict):
choices = chunk_dict.get('choices', [])
if choices and choices[0].get('finish_reason') is not None:
# This is the last chunk, add effective_context
if 'usage' not in chunk_dict:
chunk_dict['usage'] = {}
chunk_dict['usage']['effective_context'] = effective_context
yield f"data: {json.dumps(chunk_dict)}\n\n".encode('utf-8') yield f"data: {json.dumps(chunk_dict)}\n\n".encode('utf-8')
except Exception as chunk_error: except Exception as chunk_error:
# Handle errors during chunk serialization # Handle errors during chunk serialization
...@@ -904,6 +968,30 @@ class RotationHandler: ...@@ -904,6 +968,30 @@ class RotationHandler:
logger.info(f"Temperature: {request_data.get('temperature', 1.0)}") logger.info(f"Temperature: {request_data.get('temperature', 1.0)}")
logger.info(f"Stream: {request_data.get('stream', False)}") logger.info(f"Stream: {request_data.get('stream', False)}")
# Get context configuration
context_config = get_context_config_for_model(
model_name=model_name,
provider_config=None,
rotation_model_config=current_model
)
logger.info(f"Context config: {context_config}")
# Calculate effective context
messages = request_data['messages']
effective_context = count_messages_tokens(messages, model_name)
logger.info(f"Effective context: {effective_context} tokens")
# Apply context condensation if needed
if context_config.get('condense_context', 0) > 0:
context_manager = ContextManager(context_config, handler)
if context_manager.should_condense(messages, model_name):
logger.info("Context condensation triggered")
messages = await context_manager.condense_context(messages, model_name)
effective_context = count_messages_tokens(messages, model_name)
logger.info(f"Condensed effective context: {effective_context} tokens")
# Update request_data with condensed messages
request_data['messages'] = messages
# Check for max_request_tokens in rotation model config # Check for max_request_tokens in rotation model config
max_request_tokens = current_model.get('max_request_tokens') max_request_tokens = current_model.get('max_request_tokens')
if max_request_tokens: if max_request_tokens:
...@@ -984,6 +1072,10 @@ class RotationHandler: ...@@ -984,6 +1072,10 @@ class RotationHandler:
if total_tokens > 0: if total_tokens > 0:
handler._record_token_usage(model_name, total_tokens) handler._record_token_usage(model_name, total_tokens)
logger.info(f"Recorded {total_tokens} tokens for model {model_name}") logger.info(f"Recorded {total_tokens} tokens for model {model_name}")
# Add effective context to response for non-streaming
usage['effective_context'] = effective_context
logger.info(f"Added effective_context to response: {effective_context}")
handler.record_success() handler.record_success()
...@@ -1177,7 +1269,8 @@ class RotationHandler: ...@@ -1177,7 +1269,8 @@ class RotationHandler:
"usage": { "usage": {
"prompt_tokens": None, "prompt_tokens": None,
"completion_tokens": None, "completion_tokens": None,
"total_tokens": None "total_tokens": None,
"effective_context": effective_context
}, },
"provider": provider_id, "provider": provider_id,
"choices": [{ "choices": [{
...@@ -1206,6 +1299,16 @@ class RotationHandler: ...@@ -1206,6 +1299,16 @@ class RotationHandler:
# For OpenAI-compatible providers, just pass through the raw chunk # For OpenAI-compatible providers, just pass through the raw chunk
chunk_dict = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk chunk_dict = chunk.model_dump() if hasattr(chunk, 'model_dump') else chunk
# Add effective_context to the last chunk (when finish_reason is present)
if isinstance(chunk_dict, dict):
choices = chunk_dict.get('choices', [])
if choices and choices[0].get('finish_reason') is not None:
# This is the last chunk, add effective_context
if 'usage' not in chunk_dict:
chunk_dict['usage'] = {}
chunk_dict['usage']['effective_context'] = effective_context
yield f"data: {json.dumps(chunk_dict)}\n\n".encode('utf-8') yield f"data: {json.dumps(chunk_dict)}\n\n".encode('utf-8')
except Exception as chunk_error: except Exception as chunk_error:
error_msg = str(chunk_error) error_msg = str(chunk_error)
...@@ -1284,7 +1387,7 @@ class AutoselectHandler: ...@@ -1284,7 +1387,7 @@ class AutoselectHandler:
# Build the complete prompt # Build the complete prompt
prompt = f"""{skill_content} prompt = f"""{skill_content}
<aisbf_user_prompt>{user_prompt}</aisbf_user_prompt> <aisbf_user_prompt>{user_prompt}</aisbf_user_prompt>
<aisbf_autoselect_list> <aisbf_autoselect_list>
{models_list} {models_list}
...@@ -1519,7 +1622,7 @@ class AutoselectHandler: ...@@ -1519,7 +1622,7 @@ class AutoselectHandler:
return response return response
async def handle_autoselect_model_list(self, autoselect_id: str) -> List[Dict]: async def handle_autoselect_model_list(self, autoselect_id: str) -> List[Dict]:
"""List available models for an autoselect endpoint""" """List the available models for an autoselect endpoint"""
autoselect_config = self.config.get_autoselect(autoselect_id) autoselect_config = self.config.get_autoselect(autoselect_id)
if not autoselect_config: if not autoselect_config:
raise HTTPException(status_code=400, detail=f"Autoselect {autoselect_id} not found") raise HTTPException(status_code=400, detail=f"Autoselect {autoselect_id} not found")
......
...@@ -63,6 +63,9 @@ class Model(BaseModel): ...@@ -63,6 +63,9 @@ class Model(BaseModel):
rate_limit_TPM: Optional[int] = None # Max tokens per minute rate_limit_TPM: Optional[int] = None # Max tokens per minute
rate_limit_TPH: Optional[int] = None # Max tokens per hour rate_limit_TPH: Optional[int] = None # Max tokens per hour
rate_limit_TPD: Optional[int] = None # Max tokens per day rate_limit_TPD: Optional[int] = None # Max tokens per day
context_size: Optional[int] = None # Max context size in tokens for the model
condense_context: Optional[int] = None # Percentage (0-100) at which to condense context
condense_method: Optional[Union[str, List[str]]] = None # Method(s) for condensation: "hierarchical", "conversational", "semantic", "algorithmic"
class Provider(BaseModel): class Provider(BaseModel):
id: str id: str
......
...@@ -14,7 +14,10 @@ ...@@ -14,7 +14,10 @@
"max_request_tokens": 1000000, "max_request_tokens": 1000000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}, },
{ {
"name": "gemini-1.5-pro", "name": "gemini-1.5-pro",
...@@ -22,7 +25,10 @@ ...@@ -22,7 +25,10 @@
"max_request_tokens": 2000000, "max_request_tokens": 2000000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 2000000,
"condense_context": 85,
"condense_method": "conversational"
} }
] ]
}, },
......
...@@ -14,7 +14,10 @@ ...@@ -14,7 +14,10 @@
"max_request_tokens": 100000, "max_request_tokens": 100000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 1000000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}, },
{ {
"name": "gemini-1.5-pro", "name": "gemini-1.5-pro",
...@@ -23,7 +26,10 @@ ...@@ -23,7 +26,10 @@
"max_request_tokens": 100000, "max_request_tokens": 100000,
"rate_limit_TPM": 15000, "rate_limit_TPM": 15000,
"rate_limit_TPH": 100000, "rate_limit_TPH": 100000,
"rate_limit_TPD": 1000000 "rate_limit_TPD": 1000000,
"context_size": 2000000,
"condense_context": 85,
"condense_method": "conversational"
} }
] ]
}, },
...@@ -35,13 +41,19 @@ ...@@ -35,13 +41,19 @@
"name": "gpt-4", "name": "gpt-4",
"weight": 2, "weight": 2,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 128000 "max_request_tokens": 128000,
"context_size": 128000,
"condense_context": 75,
"condense_method": ["hierarchical", "conversational"]
}, },
{ {
"name": "gpt-3.5-turbo", "name": "gpt-3.5-turbo",
"weight": 1, "weight": 1,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 4000 "max_request_tokens": 4000,
"context_size": 16000,
"condense_context": 70,
"condense_method": "semantic"
} }
] ]
}, },
...@@ -53,13 +65,19 @@ ...@@ -53,13 +65,19 @@
"name": "claude-3-5-sonnet-20241022", "name": "claude-3-5-sonnet-20241022",
"weight": 2, "weight": 2,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 200000 "max_request_tokens": 200000,
"context_size": 200000,
"condense_context": 80,
"condense_method": ["hierarchical", "semantic"]
}, },
{ {
"name": "claude-3-haiku-20240307", "name": "claude-3-haiku-20240307",
"weight": 1, "weight": 1,
"rate_limit": 0, "rate_limit": 0,
"max_request_tokens": 200000 "max_request_tokens": 200000,
"context_size": 200000,
"condense_context": 75,
"condense_method": "conversational"
} }
] ]
} }
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment