Error Handling#

Sophisticated error handling and recovery system designed for production-grade agentic systems. The Alpha Berkeley Framework implements a comprehensive three-layer error management architecture that provides intelligent error classification, automatic recovery coordination, and graceful degradation patterns.

Architecture Overview#

The Alpha Berkeley Framework implements Manual Retry Coordination with intelligent recovery strategies designed for domain experts who need fast debugging cycles without extensive stack traces. While traditional error traces remain valuable for developers, agentic systems are often used by scientists and researchers who need immediate, actionable feedback when experiments fail:

Traditional Approach:

Error → Generic Retry → Fail → Cryptic User Notification

Intelligent Recovery Approach:

Error → Classification → Targeted Recovery → Escalation → LLM-Powered Response

Benefits: User-friendly error messages, automatic recovery without intervention, rapid experiment iteration.

The Three Layers#

🎯 Classification System

Intelligent Error Analysis

Severity-based classification with sophisticated recovery strategy selection and context-aware analysis.

Classification System

📋 Exception Hierarchy

Comprehensive Error Catalog

Structured exception classes with domain-specific recovery hints and detailed categorization.

Exception Reference

🔄 Recovery Coordination

Automated Recovery Strategies

Router-based recovery coordination with retry policies, replanning, and graceful termination.

Recovery Coordination

Recovery Strategy Integration#

The system coordinates recovery through a unified strategy hierarchy:

Error Classification

How errors are analyzed and categorized:

# Domain-specific error classification
@staticmethod
def classify_error(exc: Exception, context: dict) -> ErrorClassification:
    if isinstance(exc, (ConnectionError, TimeoutError)):
       return ErrorClassification(
           severity=ErrorSeverity.RETRIABLE,
           user_message="Network issue detected, retrying...",
           metadata={"technical_details": str(exc)}
       )
    elif isinstance(exc, KeyError) and "context" in str(exc):
                        return ErrorClassification(
           severity=ErrorSeverity.REPLANNING,
           user_message="Required data not available, trying different approach",
           metadata={
               "technical_details": f"Missing context data: {str(exc)}",
               "replanning_reason": f"Missing required context: {exc}",
               "suggestions": ["Verify data dependencies", "Check previous steps"]
           }
       )
    # Default from BaseCapability implementation
    capability_name = context.get('capability', 'unknown_capability')
                return ErrorClassification(
       severity=ErrorSeverity.CRITICAL,
       user_message=f"Unhandled error in {capability_name}: {exc}",
       metadata={
           "technical_details": str(exc),
           "safety_abort_reason": f"Unhandled error in {capability_name}: {exc}",
           "suggestions": ["Check capability logs", "Contact support team"]
       }
   )

Recovery Coordination

Router-based automatic recovery strategies:

import time

# Router coordinates all recovery strategies
if error_classification.severity == ErrorSeverity.RETRIABLE:
    if retry_count < max_retries:
        # Calculate delay with backoff for this retry attempt
        actual_delay = delay_seconds * (backoff_factor ** (retry_count - 1)) if retry_count > 0 else 0

        # Apply delay if this is a retry (not the first attempt)
        if retry_count > 0 and actual_delay > 0:
            time.sleep(actual_delay)  # Simple sleep for now, could be async

        # Increment retry count in state before routing back
        state['control_retry_count'] = retry_count + 1
        return capability_name  # Retry same capability
    else:
        return "error"  # Retries exhausted → ErrorNode

elif error_classification.severity == ErrorSeverity.REPLANNING:
    # Check how many plans have been created by orchestrator
    current_plans_created = state.get('control_plans_created_count', 0)

    # Get max planning attempts from execution limits config
    limits = get_execution_limits()
    max_planning_attempts = limits.get('max_planning_attempts', 2)

    if current_plans_created < max_planning_attempts:
        return "orchestrator"  # Create new execution plan
    else:
        return "error"  # Planning attempts exhausted → ErrorNode

elif error_classification.severity == ErrorSeverity.CRITICAL:
    return "error"  # Immediate termination → ErrorNode

Production Patterns

Real-world error handling patterns:

# LLM-aware retry policy for infrastructure operations
@staticmethod
def get_retry_policy() -> Dict[str, Any]:
    return {
        "max_attempts": 4,        # More attempts for LLM operations
        "delay_seconds": 2.0,     # Longer initial delay
        "backoff_factor": 2.0     # Aggressive backoff for rate limiting
    }

# Category-based Python executor error handling
try:
    result = await executor.execute_code(code)
except PythonExecutorException as e:
    if e.should_retry_execution():
        # Infrastructure error - retry same code
        logger.info("Infrastructure issue, retrying execution...")
        await retry_execution_with_backoff(code)
    elif e.should_retry_code_generation():
        # Code error - regenerate and retry
        logger.info("Code issue, regenerating with feedback...")
        improved_code = await regenerate_with_feedback(str(e))
        await execute_code(improved_code)
    else:
        # Workflow error - requires intervention
        logger.error(f"Execution failed: {e.message}")
        await notify_user(f"Execution failed: {e.message}")

Note

The framework uses manual retry coordination rather than LangGraph’s native retry policies to ensure consistent behavior and sophisticated error classification across all components.