Error Handling#
📚 What You’ll Learn
Key Concepts:
How the framework provides centralized error handling with LLM-generated responses
Error classification systems and retry policies for different failure types
Recovery strategy implementation and graceful degradation patterns
Best practices for building resilient error experiences
Prerequisites: Understanding of Classification and Routing and Message Generation
Time Investment: 20 minutes for complete understanding
Core Problem#
Agentic systems present a unique opportunity to handle errors more intelligently than traditional software applications. With capable language models at our disposal and users who are often domain experts rather than developers, we can move beyond simply throwing stack traces at users:
Error Context Complexity: Errors occur at different levels with varying context
User Communication: Domain experts need actionable explanations, not technical stack traces
Recovery Strategy Selection: Different error types require different recovery approaches
State Consistency: Errors must be handled without corrupting execution flow
LLM Capabilities: Language models excel at interpreting code errors and generating user-friendly explanations
Framework Solution: Centralized error processing with LLM-generated responses and classification-based recovery.
🚧 Development Note
The LLM-powered error analysis system is actively being optimized. Current focus areas include refining context provision to language models and improving response quality. Users should expect that generated error messages may not always be as informative as desired while these improvements are being implemented.
Architecture Overview#
The error handling system combines intelligent classification, automated recovery, and user-friendly communication:
@infrastructure_node
class ErrorNode(BaseInfrastructureNode):
name = "error"
description = "Error Response Generation"
@staticmethod
def classify_error(exc: Exception, context: dict):
# FATAL classification prevents infinite loops
return ErrorClassification(
severity=ErrorSeverity.FATAL,
user_message="Error node failed during error handling"
)
@staticmethod
async def execute(state: AgentState, **kwargs):
try:
# Create error context from state
error_context = _create_error_context_from_state(state)
# Generate LLM response with context
response = await _generate_error_response(error_context)
return {"messages": [AIMessage(content=response)]}
except Exception as e:
# Fallback response if error handling fails
return {"messages": [AIMessage(content=_create_fallback_response(e, state))]}
Key Principles: 1. Centralized Processing: Single error node handles all scenarios 2. Context Preservation: Errors don’t lose conversation or execution context 3. Graceful Degradation: System continues functioning when components fail 4. User-Friendly Communication: Technical errors become actionable explanations
Error Classification System#
Different error types trigger appropriate recovery strategies using the ErrorSeverity enum:
class ErrorSeverity(Enum):
CRITICAL = "critical" # End execution
RETRIABLE = "retriable" # Retry execution step
REPLANNING = "replanning" # Replan the execution plan
RECLASSIFICATION = "reclassification" # Reclassify task capabilities
FATAL = "fatal" # System-level failure - raise exception immediately
Error Classification Example:
# Example capability error classification
@staticmethod
def classify_error(exc: Exception, context: dict):
# Retry network timeouts
if isinstance(exc, (ConnectionError, TimeoutError)):
return ErrorClassification(
severity=ErrorSeverity.RETRIABLE,
user_message="Network timeout, retrying...",
metadata={"technical_details": str(exc)}
)
# Don't retry validation errors
if isinstance(exc, (ValueError, TypeError)):
return ErrorClassification(
severity=ErrorSeverity.CRITICAL,
user_message="Configuration error",
metadata={
"technical_details": str(exc),
"safety_abort_reason": f"Configuration validation failed: {exc}",
"suggestions": ["Check configuration parameters", "Verify required fields"]
}
)
💡 Metadata Best Practices
Enhanced User Experience Through Structured Metadata
The framework supports flexible metadata structures, but adopting these established patterns enhances error analysis capabilities and improves user communication quality. All metadata content is forwarded to the LLM for dynamic error response generation.
Recommended Metadata Patterns:
# Basic technical details
metadata = {
"technical_details": "Connection timeout after 30 seconds"
}
# Enhanced error context with user guidance
metadata = {
"technical_details": "Missing required parameter 'api_key'",
"safety_abort_reason": "Agent execution aborted due to missing critical parameter",
"suggestions": ["Check configuration", "Check environment variables", "Verify credentials"],
}
# Replanning context for orchestrator
metadata = {
"technical_details": "Step expected 'SENSOR_DATA' but found None",
"replanning_reason": "Missing required input data",
"required_context": ["SENSOR_DATA", "TIME_RANGE"],
"suggestions": "Ask clarifying question to user if required context and capabilities are not available"
}
Error Context Generation#
System automatically creates comprehensive error context:
@dataclass
class ErrorContext:
error_type: ErrorType
error_message: str
failed_operation: str
current_task: str
capability_name: Optional[str] = None
error_metadata: Optional[Dict[str, Any]] = None
total_operations: int = 0
execution_time: Optional[float] = None
retry_count: Optional[int] = None
successful_steps: List[str] = None
failed_steps: List[str] = None
Context Creation:
def _create_error_context_from_state(state: AgentState) -> ErrorContext:
# Get current task from task state
current_task = state.get("task_current_task", "Unknown task")
# Get error information that was set by the router
error_info = state.get('control_error_info')
if not isinstance(error_info, dict):
error_info = {}
# Extract error details from router-provided error info
capability_name = error_info.get('capability_name') or error_info.get('node_name')
original_error = error_info.get('original_error', 'Unknown error occurred')
user_message = error_info.get('user_message', original_error)
execution_time = error_info.get('execution_time', 0.0)
# Use error classification directly
error_classification = error_info.get('classification')
if not error_classification:
# Create fallback classification if none provided
error_classification = ErrorClassification(
severity=ErrorSeverity.CRITICAL,
user_message=original_error,
metadata={"technical_details": original_error}
)
# Extract metadata from error classification
error_metadata = None
if error_classification and hasattr(error_classification, 'metadata'):
error_metadata = error_classification.metadata
return ErrorContext(
error_classification=error_classification,
current_task=current_task,
failed_operation=capability_name or "Unknown operation",
execution_time=execution_time,
retry_count=state.get('control_retry_count', 0),
total_operations=StateManager.get_current_step_index(state) + 1
)
LLM-Generated Error Responses#
Error responses combine structured reports with LLM analysis:
async def _generate_error_response(error_context: ErrorContext) -> str:
# Build structured error report
error_report_sections = _build_structured_error_report(error_context)
# Generate LLM explanation
llm_explanation = await asyncio.to_thread(_generate_llm_explanation, error_context)
return f"{error_report_sections}\n\n{llm_explanation}"
Structured Report Components:
def _build_structured_error_report(error_context: ErrorContext) -> str:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
report_sections = [
f"⚠️ **ERROR REPORT** - {timestamp}",
f"**Error Type:** {error_context.error_type.value.upper()}",
f"**Task:** {error_context.current_task}",
f"**Failed Operation:** {error_context.failed_operation}",
f"**Error Message:** {error_context.error_message}"
]
# Add capability information if available
if error_context.capability_name:
report_sections.append(f"**Capability:** {error_context.capability_name}")
# Add complete metadata if available (displayed as JSON)
if error_context.error_metadata:
import json
metadata_str = json.dumps(error_context.error_metadata, indent=2, ensure_ascii=False)
report_sections.append(f"**Error Metadata:**\n```json\n{metadata_str}\n```")
# Add execution statistics
stats_parts = []
if error_context.total_operations > 0:
stats_parts.append(f"Total operations: {error_context.total_operations}")
if error_context.execution_time is not None:
stats_parts.append(f"Execution time: {error_context.execution_time:.1f}s")
if error_context.retry_count is not None and error_context.retry_count > 0:
stats_parts.append(f"Retry attempts: {error_context.retry_count}")
if stats_parts:
report_sections.append(f"**Execution Stats:** {', '.join(stats_parts)}")
return "\n".join(report_sections)
LLM Analysis Generation:
The LLM receives the complete error context including all metadata for intelligent analysis:
def _generate_llm_explanation(error_context: ErrorContext) -> str:
try:
capabilities_overview = get_registry().get_capabilities_overview()
prompt_provider = get_framework_prompts()
error_builder = prompt_provider.get_error_analysis_prompt_builder()
# Error context (including complete metadata) is passed to LLM
prompt = error_builder.get_system_instructions(
capabilities_overview=capabilities_overview,
error_context=error_context # Complete ErrorContext with metadata
)
explanation = get_chat_completion(
model_config=get_model_config("framework", "response"),
message=prompt,
max_tokens=500
)
return f"**Analysis:** {explanation.strip()}"
except Exception:
return "**Analysis:** Error details are provided in the structured report above."
Note
The LLM can leverage rich metadata fields like suggestions
, safety_abort_reason
, and domain-specific context to provide more informed error analysis and recovery guidance to the user.
Retry Policy Framework#
Router handles retries embedded in conditional edge function:
def router_conditional_edge(state: AgentState) -> str:
# Check for errors first
if state.get('control_has_error', False):
error_info = state.get('control_error_info', {})
error_classification = error_info.get('classification')
capability_name = error_info.get('capability_name')
if error_classification and capability_name:
retry_count = state.get('control_retry_count', 0)
retry_policy = error_info.get('retry_policy', {})
max_retries = retry_policy.get('max_attempts', 3)
if error_classification.severity == ErrorSeverity.RETRIABLE:
if retry_count < max_retries:
# Apply exponential backoff
delay = retry_policy.get('delay_seconds', 1.0)
backoff = retry_policy.get('backoff_factor', 1.5)
actual_delay = delay * (backoff ** retry_count)
time.sleep(actual_delay)
state['control_retry_count'] = retry_count + 1
return capability_name # Retry same capability
else:
return "error" # Route to error node
elif error_classification.severity == ErrorSeverity.REPLANNING:
return "orchestrator" # Route for re-planning
elif error_classification.severity == ErrorSeverity.RECLASSIFICATION:
return "classifier" # Route for re-classification
else:
return "error" # Route to error node
# Normal routing logic continues...
Recovery Strategies: - RETRIABLE: Automatic retry with exponential backoff - REPLANNING: Route to orchestrator for new execution plan - RECLASSIFICATION: Route to classifier for new capability selection - CRITICAL: Route to error node for user communication - FATAL: Terminate execution immediately
Error Context Enhancement#
System automatically enhances context with execution history:
def _populate_error_context(error_context: ErrorContext, state: AgentState):
# Generate execution summary from execution_step_results (ordered by step_index)
step_results = state.get("execution_step_results", {})
if step_results:
# Sort by step_index for proper ordering
ordered_results = sorted(step_results.items(), key=lambda x: x[1].get('step_index', 0))
for step_key, result in ordered_results:
step_index = result.get('step_index', 0)
capability_name = result.get('capability', 'unknown')
task_objective = result.get('task_objective', capability_name)
if result.get('success', False):
error_context.successful_steps.append(f"Step {step_index + 1}: {task_objective}")
else:
error_context.failed_steps.append(f"Step {step_index + 1}: {task_objective} - Failed")
Integration Patterns#
Capability Error Classification:
class MyCapability(BaseCapability):
@staticmethod
def classify_error(exc: Exception, context: dict):
# Retry network timeouts
if isinstance(exc, (ConnectionError, TimeoutError)):
return ErrorClassification(
severity=ErrorSeverity.RETRIABLE,
user_message="Network issue detected, retrying...",
metadata={"technical_details": str(exc)}
)
# Default to critical for unknown errors
return ErrorClassification(
severity=ErrorSeverity.CRITICAL,
user_message=f"Unexpected error: {exc}",
metadata={
"technical_details": str(exc),
"safety_abort_reason": f"Unhandled capability error: {exc}",
"suggestions": ["Check system status", "Contact support if issue persists"]
}
)
Custom Retry Policies:
@staticmethod
def get_retry_policy() -> Dict[str, Any]:
return {
"max_attempts": 5, # More attempts for network operations
"delay_seconds": 2.0, # Longer delay for external services
"backoff_factor": 2.0 # Exponential backoff
}
Best Practices#
Error Classification Guidelines: - Use RETRIABLE for network/temporary issues - Use CRITICAL for configuration/validation errors - Use REPLANNING for execution plan strategy issues - Use RECLASSIFICATION for capability selection issues - Use FATAL only for error node failures
State Management:
- Use control_error_info
for error details
- Use control_retry_count
for tracking attempts
- Use control_has_error
as the error state flag
User Communication: - Provide structured error reports with timestamps - Include execution context and recovery suggestions - Use LLM-generated explanations for clarity
Troubleshooting#
Error Node Infinite Loops: ErrorNode uses FATAL classification to prevent loops when error handling fails.
Missing Error Context: System provides fallback responses when error context creation fails.
Router Retry Issues:
Retry handling is embedded in router_conditional_edge()
- check state field consistency.
See also
- Exception Reference
API reference for error classification and recovery systems
- State Management Architecture
Error classification systems and retry policies
- Message Generation
Error response generation and user communication patterns
Next Steps#
Human Approval - Error integration with approval systems
State Management Architecture - State management during errors
Message Generation - How error responses are formatted
Error Handling Infrastructure provides the resilience and user-friendly error communication that makes the Alpha Berkeley Framework production-ready, ensuring graceful failure handling while keeping users informed and engaged.