Classification System#

Core error classification and severity management system.

The classification system provides the foundation for intelligent error handling by enabling automatic recovery strategy selection based on error severity and context. This system integrates seamlessly with both capability execution and infrastructure operations.

Error Severity Levels#

ErrorSeverity#

class framework.base.errors.ErrorSeverity(value)[source]#

Bases: Enum

Enumeration of error severity levels with comprehensive recovery strategies.

This enum defines the complete spectrum of error severity classifications and their corresponding recovery strategies used throughout the Alpha Berkeley Framework. Each severity level triggers specific recovery behavior designed to maintain robust system operation while enabling intelligent error handling and graceful degradation.

The severity levels form a hierarchy of recovery strategies from simple retries to complete execution termination. The framework’s error handling system uses these classifications to coordinate recovery efforts between capabilities, infrastructure nodes, and the overall execution system.

Recovery Strategy Hierarchy: 1. Automatic Recovery: RETRIABLE errors with retry mechanisms 2. Strategy Adjustment: REPLANNING for execution plan adaptation 3. Capability Adjustment: RECLASSIFICATION for capability selection adaptation 4. Execution Control: CRITICAL for graceful termination 5. System Protection: FATAL for immediate termination

Parameters:

CRITICAL (str) – End execution immediately - unrecoverable errors requiring termination
RETRIABLE (str) – Retry current execution step with same parameters - transient failures
REPLANNING (str) – Create new execution plan with different strategy - approach failures
RECLASSIFICATION (str) – Reclassify task to select different capabilities - selection failures
FATAL (str) – System-level failure requiring immediate termination - corruption prevention

Note

The framework uses manual retry coordination rather than automatic retries to ensure consistent behavior and sophisticated error analysis across all components.

Warning

FATAL errors immediately raise exceptions to terminate execution and prevent system corruption. Use FATAL only for errors that indicate serious system issues that could compromise framework integrity.

Examples

Network error classification:

if isinstance(exc, YourCustomConnectionError):
    return ErrorClassification(severity=ErrorSeverity.RETRIABLE, ...)
elif isinstance(exc, YourCustomAuthenticationError):
    return ErrorClassification(severity=ErrorSeverity.CRITICAL, ...)

Data validation error handling (example exception classes):

if isinstance(exc, ValidationError):
    return ErrorClassification(severity=ErrorSeverity.REPLANNING, ...)
elif isinstance(exc, YourCustomCapabilityMismatchError):
    return ErrorClassification(severity=ErrorSeverity.RECLASSIFICATION, ...)
elif isinstance(exc, YourCustomCorruptionError):
    return ErrorClassification(severity=ErrorSeverity.FATAL, ...)

Note

The exception classes in these examples (YourCustomCapabilityMismatchError, YourCustomCorruptionError) are not provided by the framework - they are examples of domain-specific exceptions you might implement in your capabilities.

See also

ErrorClassification : Structured error analysis with severity ExecutionError : Comprehensive error information container

Enumeration of error severity levels with recovery strategies:

RETRIABLE: Retry execution with exponential backoff
REPLANNING: Route to orchestrator for new execution plan
RECLASSIFICATION: Route to classifier for new capability selection
CRITICAL: Graceful termination with user notification
FATAL: Immediate system termination

Usage Pattern

if isinstance(exc, ConnectionError):
    return ErrorClassification(severity=ErrorSeverity.RETRIABLE, ...)
elif isinstance(exc, AuthenticationError):
    return ErrorClassification(
        severity=ErrorSeverity.CRITICAL,
        metadata={"safety_abort_reason": "Authentication failed"}
    )
elif isinstance(exc, CapabilityMismatchError):
    return ErrorClassification(
        severity=ErrorSeverity.RECLASSIFICATION,
        user_message="Required capability not available",
        metadata={"reclassification_reason": "Capability mismatch detected"}
    )

CRITICAL = 'critical'#

RETRIABLE = 'retriable'#

REPLANNING = 'replanning'#

RECLASSIFICATION = 'reclassification'#

FATAL = 'fatal'#

Classification Results#

ErrorClassification#

class framework.base.errors.ErrorClassification(severity, user_message=None, metadata=None)[source]#

Bases: object

Comprehensive error classification result with recovery strategy coordination.

This dataclass provides sophisticated error classification results that enable intelligent recovery strategy selection and coordination across the framework. It serves as the primary interface between error analysis and recovery systems, supporting both automated recovery mechanisms and human-guided error resolution.

ErrorClassification enables comprehensive error handling by providing: 1. Severity Assessment: Clear classification of error impact and recovery strategy 2. User Communication: Human-readable error descriptions for interfaces 3. Technical Context: Detailed debugging information for developers 4. Extensible Metadata: Additional context for capability-specific error handling

The classification system supports multiple recovery approaches including automatic retries, execution replanning, and graceful degradation patterns. The severity field determines the recovery strategy while user_message and metadata provide contextual information for logging, debugging, and recovery guidance.

Parameters:

severity (ErrorSeverity) – Error severity level determining recovery strategy
user_message (Optional[str]) – Human-readable error description for user interfaces and logs
metadata (Optional[Dict[str, Any]]) – Structured error context including technical details and recovery hints

Note

The framework uses this classification to coordinate recovery strategies across multiple system components. Different severity levels trigger different recovery workflows through the router system.

Warning

Ensure severity levels are chosen carefully as they directly impact system behavior and recovery strategies. Inappropriate classifications can lead to ineffective error handling.

Examples

Network timeout classification:

classification = ErrorClassification(
    severity=ErrorSeverity.RETRIABLE,
    user_message="Network connection timeout, retrying...",
    metadata={"technical_details": "HTTP request timeout after 30 seconds"}
)

Missing step input requiring replanning:

classification = ErrorClassification(
    severity=ErrorSeverity.REPLANNING,
    user_message="Required data not available, need different approach",
    metadata={
        "technical_details": "Step expected 'SENSOR_DATA' context but found None",
        "replanning_reason": "Missing required input data"
    }
)

Wrong capability selected requiring reclassification:

classification = ErrorClassification(
    severity=ErrorSeverity.RECLASSIFICATION,
    user_message="This capability cannot handle this type of request",
    metadata={
        "technical_details": "Weather capability received machine operation request",
        "reclassification_reason": "Capability mismatch detected"
    }
)

Comprehensive error with rich metadata:

classification = ErrorClassification(
    severity=ErrorSeverity.CRITICAL,
    user_message="Invalid configuration detected",
    metadata={
        "technical_details": "Missing required parameter 'api_key' in capability config",
        "safety_abort_reason": "Security validation failed",
        "suggestions": ["Check configuration file", "Verify credentials"],
        "error_code": "CONFIG_MISSING_KEY",
        "retry_after": 30
    }
)

See also

ErrorSeverity : Severity levels and recovery strategies ExecutionError : Complete error information container

Structured error analysis result that determines recovery strategy.

Basic Usage Pattern

classification = ErrorClassification(
    severity=ErrorSeverity.RETRIABLE,
    user_message="Network connection timeout, retrying...",
    metadata={"technical_details": "HTTP request timeout after 30 seconds"}
)

Advanced Usage with Rich Metadata

classification = ErrorClassification(
    severity=ErrorSeverity.CRITICAL,
    user_message="Service validation failed",
    metadata={
        "technical_details": "Authentication service returned 403",
        "safety_abort_reason": "Security validation failed",

        "retry_after": 30,
        "error_code": "AUTH_FAILED"
    }
)

severity: ErrorSeverity#

user_message: str | None = None#

metadata: Dict[str, Any] | None = None#

format_for_llm()[source]#

Format this error classification for LLM consumption during replanning.

Converts the error classification into a structured, human-readable format optimized for LLM understanding and processing. Follows the framework’s established format_for_llm() pattern for consistent formatting.

Returns:: Formatted string optimized for LLM prompt inclusion
Return type:: str

Examples

Basic error formatting:

classification = ErrorClassification(
    severity=ErrorSeverity.REPLANNING,
    user_message="Data not available",
    metadata={"technical_details": "Missing sensor data"}
)
formatted = classification.format_for_llm()
# Returns:
# **Previous Execution Error:**
# - **Failed Operation:** unknown operation
# - **User Message:** Data not available
# - **Technical Details:** Missing sensor data

Note

This method formats error classification data independently of the error_info dictionary structure, making it suitable for direct error classification formatting.

__init__(severity, user_message=None, metadata=None)#

ExecutionError#

class framework.base.errors.ExecutionError(severity, message, capability_name=None, metadata=None)[source]#

Bases: object

Comprehensive execution error container with recovery coordination support.

This dataclass provides a complete representation of execution errors including severity classification, recovery suggestions, technical debugging information, and context for coordinating recovery strategies. It serves as the primary error data structure used throughout the framework for error handling, logging, and recovery coordination.

ExecutionError enables sophisticated error management by providing: 1. Error Classification: Severity-based recovery strategy determination 2. User Communication: Clear, actionable error messages for interfaces 3. Developer Support: Technical details and debugging context

System Integration: Context for automated recovery systems

The error structure supports both automated error handling workflows and human-guided error resolution processes. It integrates seamlessly with the framework’s classification system and retry mechanisms to provide comprehensive error management.

Parameters:

severity (ErrorSeverity) – Error severity classification for recovery strategy selection
message (str) – Clear, human-readable description of the error condition
capability_name (Optional[str]) – Name of the capability or component that generated this error
metadata (Optional[Dict[str, Any]]) – Structured error context including technical details and debugging information

Note

ExecutionError instances are typically created by error classification methods in capabilities and infrastructure nodes. The framework’s decorators automatically handle the creation and routing of these errors.

Warning

The severity field directly impacts system behavior through recovery strategy selection. Ensure appropriate severity classification to avoid ineffective error handling or unnecessary system termination.

Examples

Database connection error:

error = ExecutionError(
    severity=ErrorSeverity.RETRIABLE,
    message="Database connection failed",
    capability_name="database_query",

    metadata={"technical_details": "PostgreSQL connection timeout after 30 seconds"}
)

Data corruption requiring immediate attention:

error = ExecutionError(
    severity=ErrorSeverity.FATAL,
    message="Critical data corruption detected",
    capability_name="data_processor",
    metadata={
        "technical_details": "Checksum validation failed on primary data store",
        "safety_abort_reason": "Data integrity compromised"
    },
    suggestions=[
        "Initiate emergency backup procedures",
        "Contact system administrator immediately",
        "Do not proceed with further operations"
    ]
)

See also

ErrorSeverity : Severity levels and recovery strategies ErrorClassification : Error analysis and classification system ExecutionResult : Result containers with error integration

Comprehensive error container with recovery coordination support.

Usage Pattern

error = ExecutionError(
     severity=ErrorSeverity.RETRIABLE,
     message="Database connection failed",
     capability_name="database_query",
     metadata={"technical_details": "PostgreSQL connection timeout after 30 seconds"}
 )

severity: ErrorSeverity#

message: str#

capability_name: str | None = None#

metadata: Dict[str, Any] | None = None#

__init__(severity, message, capability_name=None, metadata=None)#

Classification Methods#

Base Capability Classification#

static BaseCapability.classify_error(exc, context)[source]#

Classify errors for capability-specific error handling and recovery.

This method provides domain-specific error classification to determine appropriate recovery strategies. The default implementation treats all errors as critical, but capabilities should override this method to provide sophisticated error handling based on their specific failure modes.

The error classification determines how the framework responds to failures: - CRITICAL: End execution immediately - RETRIABLE: Retry with same parameters - REPLANNING: Create new execution plan - RECLASSIFICATION: Reclassify task capabilities - FATAL: System-level failure, terminate execution

Parameters:

exc (Exception) – The exception that occurred during capability execution
context (dict) – Error context including capability info and execution state

Returns:

Error classification with recovery strategy, or None to use default

Return type:

Optional[ErrorClassification]

Note

The context dictionary contains useful information including: - ‘capability’: capability name - ‘current_step_index’: step being executed - ‘execution_time’: time spent before failure - ‘current_state’: agent state at time of error

Examples

Network-aware error classification:

@staticmethod
def classify_error(exc: Exception, context: dict) -> ErrorClassification:
    # Retry network timeouts and connection errors
    if isinstance(exc, (ConnectionError, TimeoutError)):
        return ErrorClassification(
            severity=ErrorSeverity.RETRIABLE,
            user_message="Network issue detected, retrying...",
            metadata={"technical_details": str(exc)}
        )


    # Default to critical for unexpected errors
    return ErrorClassification(
        severity=ErrorSeverity.CRITICAL,
        user_message=f"Unexpected error: {exc}",
        metadata={"technical_details": str(exc)}
    )

Missing input data requiring replanning:

@staticmethod
def classify_error(exc: Exception, context: dict) -> ErrorClassification:
    if isinstance(exc, KeyError) and "context" in str(exc):
        return ErrorClassification(
            severity=ErrorSeverity.REPLANNING,
            user_message="Required data not available, trying different approach",
            metadata={"technical_details": f"Missing context data: {str(exc)}"}
        )
    return BaseCapability.classify_error(exc, context)

See also

ErrorClassification : Error classification result structure ErrorSeverity : Available severity levels and their meanings

Domain-specific error classification for capabilities. Override this method to provide sophisticated error handling based on specific failure modes.

Classification Strategy

@staticmethod
def classify_error(exc: Exception, context: dict) -> ErrorClassification:
    if isinstance(exc, ConnectionError):
        return ErrorClassification(
            severity=ErrorSeverity.RETRIABLE,
            user_message="Network issue detected, retrying...",
            metadata={"technical_details": str(exc)}
        )
        return ErrorClassification(
          severity=ErrorSeverity.CRITICAL,
          user_message=f"Unexpected error: {exc}",
          metadata={
              "technical_details": str(exc),
              "safety_abort_reason": f"Unhandled capability error: {exc}",
              "suggestions": ["Check system logs", "Contact support if issue persists"]
          }
    )

Infrastructure Node Classification#

static BaseInfrastructureNode.classify_error(exc, context)[source]#

Classify errors for infrastructure-specific error handling and recovery.

This method provides default error classification for all infrastructure nodes with a conservative approach that treats most errors as critical. Infrastructure nodes handle system-critical functions like orchestration and routing, so failures typically require immediate attention rather than automatic retry attempts.

The default implementation prioritizes system stability by failing fast with clear error messages. Subclasses should override this method only when specific infrastructure components can benefit from retry logic (e.g., LLM-based orchestrators that may encounter temporary API issues).

Parameters:

exc (Exception) – The exception that occurred during infrastructure operation
context (dict) – Error context including node info, execution state, and timing

Returns:

Error classification with severity and recovery strategy

Return type:

ErrorClassification

Note

The context dictionary includes:

infrastructure_node: node name for identification
execution_time: time spent before failure
current_state: agent state at time of error

Example:

@staticmethod
def classify_error(exc: Exception, context: dict) -> ErrorClassification:
    # Retry network timeouts for LLM-based infrastructure
    if isinstance(exc, (ConnectionError, TimeoutError)):
        return ErrorClassification(
            severity=ErrorSeverity.RETRIABLE,
            user_message="Network timeout, retrying...",
            metadata={"technical_details": str(exc)}
        )
    return ErrorClassification(
        severity=ErrorSeverity.CRITICAL,
        user_message=f"Infrastructure error: {exc}",
        metadata={"technical_details": str(exc)}
    )

Note

Infrastructure nodes should generally fail fast, so the default implementation treats most errors as critical. Override this method for infrastructure that can benefit from retries (e.g., LLM-based nodes).

Conservative error classification for infrastructure nodes. Infrastructure nodes handle system-critical functions, so failures typically require immediate attention.

Conservative Strategy

@staticmethod
def classify_error(exc: Exception, context: dict) -> ErrorClassification:
    # Infrastructure defaults to critical for fast failure
    return ErrorClassification(
       severity=ErrorSeverity.CRITICAL,
       user_message=f"Infrastructure error: {exc}",
       metadata={"technical_details": str(exc)}
   )

Retry Policy Configuration#

static BaseCapability.get_retry_policy()[source]#

Get retry policy configuration for failure recovery strategies.

This method provides retry configuration that the framework uses for manual retry handling when capabilities fail with RETRIABLE errors. The default policy provides reasonable defaults for most capabilities, but should be overridden for capabilities with specific timing or retry requirements.

The retry policy controls: - Maximum number of retry attempts before giving up - Initial delay between retry attempts - Backoff factor for exponential delay increase

Returns:: Dictionary containing retry configuration parameters
Return type:: Dict[str, Any]

Note

The framework uses manual retry handling rather than LangGraph’s native retry policies to ensure consistent behavior across all components and to enable sophisticated error classification.

Examples

Aggressive retry for network-dependent capability:

@staticmethod
def get_retry_policy() -> Dict[str, Any]:
    return {
        "max_attempts": 5,      # More attempts for network issues
        "delay_seconds": 2.0,   # Longer delay for external services
        "backoff_factor": 2.0   # Exponential backoff
    }

Conservative retry for expensive operations:

@staticmethod
def get_retry_policy() -> Dict[str, Any]:
    return {
        "max_attempts": 2,      # Minimal retries for expensive ops
        "delay_seconds": 0.1,   # Quick retry for transient issues
        "backoff_factor": 1.0   # No backoff for fast operations
    }

See also

classify_error() : Error classification that determines when to retry ErrorSeverity : RETRIABLE severity triggers retry policy usage

Retry policy configuration for failure recovery strategies.

Default Policy:

{
    "max_attempts": 3,        # Total attempts including initial
    "delay_seconds": 0.5,     # Base delay before first retry
    "backoff_factor": 1.5     # Exponential backoff multiplier
}

static BaseInfrastructureNode.get_retry_policy()[source]#

Get conservative retry policy configuration for infrastructure operations.

This method provides retry configuration optimized for infrastructure nodes that handle system-critical functions. The default policy uses conservative settings with minimal retry attempts and fast failure detection to maintain system stability.

Infrastructure nodes should generally fail fast rather than retry extensively, since failures often indicate system-level issues that require immediate attention. Override this method only for specific infrastructure components that can benefit from retry logic.

Returns:: Dictionary containing conservative retry configuration parameters
Return type:: Dict[str, Any]

Note

Infrastructure default policy: 2 attempts, 0.2s delay, minimal backoff. This prioritizes fast failure detection over retry persistence.

Example:

@staticmethod
def get_retry_policy() -> Dict[str, Any]:
    return {
        "max_attempts": 3,  # More retries for LLM-based infrastructure
        "delay_seconds": 1.0,  # Longer delay for external service calls
        "backoff_factor": 2.0  # Exponential backoff
    }

Note

The router uses this configuration to determine retry behavior. Infrastructure default: 2 attempts, 0.2s delay, minimal backoff.

Conservative retry policy for infrastructure operations.

Infrastructure Policy:

{
    "max_attempts": 2,        # Fast failure for infrastructure
    "delay_seconds": 0.2,     # Quick retry attempt
    "backoff_factor": 1.0     # No backoff
}

Integration Pattern#

Basic Error Handling#

try:
    result = await capability.execute(state)
except Exception as exc:
    # Classify error for recovery strategy
    classification = capability.classify_error(exc, context)

    if classification.severity == ErrorSeverity.RETRIABLE:
        # Handle with retry policy
        policy = capability.get_retry_policy()
        await retry_with_backoff(capability, state, policy)
    elif classification.severity == ErrorSeverity.REPLANNING:
        # Route to orchestrator for new execution plan
        return "orchestrator"
    elif classification.severity == ErrorSeverity.RECLASSIFICATION:
        # Route to classifier for new capability selection
        return "classifier"
    elif classification.severity == ErrorSeverity.CRITICAL:
        # End execution with clear error message
        raise ExecutionError(
           severity=ErrorSeverity.CRITICAL,
           message=classification.user_message,
           metadata=classification.metadata
       )

Primary Error Context: Metadata Field#

The metadata field is the primary mechanism for providing structured error context in ErrorClassification.

Suggested Metadata Keys:

technical_details: Detailed technical information (replaces old technical_details field)
safety_abort_reason: Explanation for critical/fatal errors requiring immediate termination
replanning_reason: Explanation for errors requiring new execution plan generation
reclassification_reason: Explanation for errors requiring new capability selection
suggestions: List of actionable recovery steps for users
error_code: Machine-readable error identifier for programmatic handling
retry_after: Suggested delay before retry attempts (in seconds)

Advanced Usage Patterns:

Structured Technical Details: Replace simple strings with nested objects
Contextual Information: Include relevant state and execution context
Recovery Guidance: Provide specific, actionable recovery steps
System Integration: Enable programmatic error handling and monitoring

# Example: Comprehensive metadata usage
return ErrorClassification(
    severity=ErrorSeverity.REPLANNING,
    user_message="Data query scope too broad for current system limits",
    metadata={
        "technical_details": f"Query returned {result_count} results, limit is {max_limit}",
        "replanning_reason": "Query scope exceeded system performance limits",
        "suggestions": [
            "Reduce time range to last 24 hours",
            "Specify fewer measurement types",
            "Use data filtering parameters"
        ],
        "error_code": "QUERY_SCOPE_EXCEEDED",
        "retry_after": 10,
        "query_metrics": {
            "result_count": result_count,
            "max_limit": max_limit,
            "query_duration": query_time
        }
    }
)

See also

Exception Reference: Complete catalog of framework exceptions
Recovery Coordination: Router coordination and recovery strategies