Classification System#
Core error classification and severity management system.
The classification system provides the foundation for intelligent error handling by enabling automatic recovery strategy selection based on error severity and context. This system integrates seamlessly with both capability execution and infrastructure operations.
Error Severity Levels#
ErrorSeverity#
- class framework.base.errors.ErrorSeverity(value)[source]#
Bases:
Enum
Enumeration of error severity levels with comprehensive recovery strategies.
This enum defines the complete spectrum of error severity classifications and their corresponding recovery strategies used throughout the ALS Expert framework. Each severity level triggers specific recovery behavior designed to maintain robust system operation while enabling intelligent error handling and graceful degradation.
The severity levels form a hierarchy of recovery strategies from simple retries to complete execution termination. The framework’s error handling system uses these classifications to coordinate recovery efforts between capabilities, infrastructure nodes, and the overall execution system.
Recovery Strategy Hierarchy: 1. Automatic Recovery: RETRIABLE errors with retry mechanisms 2. Strategy Adjustment: REPLANNING for execution plan adaptation 3. Capability Adjustment: RECLASSIFICATION for capability selection adaptation 4. Execution Control: CRITICAL for graceful termination 5. System Protection: FATAL for immediate termination
- Parameters:
CRITICAL (str) – End execution immediately - unrecoverable errors requiring termination
RETRIABLE (str) – Retry current execution step with same parameters - transient failures
REPLANNING (str) – Create new execution plan with different strategy - approach failures
RECLASSIFICATION (str) – Reclassify task to select different capabilities - selection failures
FATAL (str) – System-level failure requiring immediate termination - corruption prevention
Note
The framework uses manual retry coordination rather than automatic retries to ensure consistent behavior and sophisticated error analysis across all components.
Warning
FATAL errors immediately raise exceptions to terminate execution and prevent system corruption. Use FATAL only for errors that indicate serious system issues that could compromise framework integrity.
Examples
Network error classification:
if isinstance(exc, YourCustomConnectionError): return ErrorClassification(severity=ErrorSeverity.RETRIABLE, ...) elif isinstance(exc, YourCustomAuthenticationError): return ErrorClassification(severity=ErrorSeverity.CRITICAL, ...)
Data validation error handling (example exception classes):
if isinstance(exc, ValidationError): return ErrorClassification(severity=ErrorSeverity.REPLANNING, ...) elif isinstance(exc, YourCustomCapabilityMismatchError): return ErrorClassification(severity=ErrorSeverity.RECLASSIFICATION, ...) elif isinstance(exc, YourCustomCorruptionError): return ErrorClassification(severity=ErrorSeverity.FATAL, ...)
Note
The exception classes in these examples (YourCustomCapabilityMismatchError, YourCustomCorruptionError) are not provided by the framework - they are examples of domain-specific exceptions you might implement in your capabilities.
See also
ErrorClassification
: Structured error analysis with severityExecutionError
: Comprehensive error information containerEnumeration of error severity levels with recovery strategies:
RETRIABLE: Retry execution with exponential backoff
REPLANNING: Route to orchestrator for new execution plan
RECLASSIFICATION: Route to classifier for new capability selection
CRITICAL: Graceful termination with user notification
FATAL: Immediate system termination
Usage Pattern
if isinstance(exc, ConnectionError): return ErrorClassification(severity=ErrorSeverity.RETRIABLE, ...) elif isinstance(exc, AuthenticationError): return ErrorClassification( severity=ErrorSeverity.CRITICAL, metadata={"safety_abort_reason": "Authentication failed"} ) elif isinstance(exc, CapabilityMismatchError): return ErrorClassification( severity=ErrorSeverity.RECLASSIFICATION, user_message="Required capability not available", metadata={"reclassification_reason": "Capability mismatch detected"} )
- CRITICAL = 'critical'#
- RETRIABLE = 'retriable'#
- REPLANNING = 'replanning'#
- RECLASSIFICATION = 'reclassification'#
- FATAL = 'fatal'#
Classification Results#
ErrorClassification#
- class framework.base.errors.ErrorClassification(severity, user_message=None, metadata=None)[source]#
Bases:
object
Comprehensive error classification result with recovery strategy coordination.
This dataclass provides sophisticated error classification results that enable intelligent recovery strategy selection and coordination across the framework. It serves as the primary interface between error analysis and recovery systems, supporting both automated recovery mechanisms and human-guided error resolution.
ErrorClassification enables comprehensive error handling by providing: 1. Severity Assessment: Clear classification of error impact and recovery strategy 2. User Communication: Human-readable error descriptions for interfaces 3. Technical Context: Detailed debugging information for developers 4. Extensible Metadata: Additional context for capability-specific error handling
The classification system supports multiple recovery approaches including automatic retries, execution replanning, and graceful degradation patterns. The severity field determines the recovery strategy while user_message and metadata provide contextual information for logging, debugging, and recovery guidance.
- Parameters:
severity (ErrorSeverity) – Error severity level determining recovery strategy
user_message (Optional[str]) – Human-readable error description for user interfaces and logs
metadata (Optional[Dict[str, Any]]) – Structured error context including technical details and recovery hints
Note
The framework uses this classification to coordinate recovery strategies across multiple system components. Different severity levels trigger different recovery workflows through the router system.
Warning
Ensure severity levels are chosen carefully as they directly impact system behavior and recovery strategies. Inappropriate classifications can lead to ineffective error handling.
Examples
Network timeout classification:
classification = ErrorClassification( severity=ErrorSeverity.RETRIABLE, user_message="Network connection timeout, retrying...", metadata={"technical_details": "HTTP request timeout after 30 seconds"} )
Missing step input requiring replanning:
classification = ErrorClassification( severity=ErrorSeverity.REPLANNING, user_message="Required data not available, need different approach", metadata={ "technical_details": "Step expected 'SENSOR_DATA' context but found None", "replanning_reason": "Missing required input data" } )
Wrong capability selected requiring reclassification:
classification = ErrorClassification( severity=ErrorSeverity.RECLASSIFICATION, user_message="This capability cannot handle this type of request", metadata={ "technical_details": "Weather capability received machine operation request", "reclassification_reason": "Capability mismatch detected" } )
Comprehensive error with rich metadata:
classification = ErrorClassification( severity=ErrorSeverity.CRITICAL, user_message="Invalid configuration detected", metadata={ "technical_details": "Missing required parameter 'api_key' in capability config", "safety_abort_reason": "Security validation failed", "suggestions": ["Check configuration file", "Verify credentials"], "error_code": "CONFIG_MISSING_KEY", "retry_after": 30 } )
See also
ErrorSeverity
: Severity levels and recovery strategiesExecutionError
: Complete error information containerStructured error analysis result that determines recovery strategy.
Basic Usage Pattern
classification = ErrorClassification( severity=ErrorSeverity.RETRIABLE, user_message="Network connection timeout, retrying...", metadata={"technical_details": "HTTP request timeout after 30 seconds"} )
Advanced Usage with Rich Metadata
classification = ErrorClassification( severity=ErrorSeverity.CRITICAL, user_message="Service validation failed", metadata={ "technical_details": "Authentication service returned 403", "safety_abort_reason": "Security validation failed", "retry_after": 30, "error_code": "AUTH_FAILED" } )
- severity: ErrorSeverity#
- user_message: str | None = None#
- metadata: Dict[str, Any] | None = None#
- format_for_llm()[source]#
Format this error classification for LLM consumption during replanning.
Converts the error classification into a structured, human-readable format optimized for LLM understanding and processing. Follows the framework’s established format_for_llm() pattern for consistent formatting.
- Returns:
Formatted string optimized for LLM prompt inclusion
- Return type:
str
Examples
Basic error formatting:
classification = ErrorClassification( severity=ErrorSeverity.REPLANNING, user_message="Data not available", metadata={"technical_details": "Missing sensor data"} ) formatted = classification.format_for_llm() # Returns: # **Previous Execution Error:** # - **Failed Operation:** unknown operation # - **User Message:** Data not available # - **Technical Details:** Missing sensor data
Note
This method formats error classification data independently of the error_info dictionary structure, making it suitable for direct error classification formatting.
- __init__(severity, user_message=None, metadata=None)#
ExecutionError#
- class framework.base.errors.ExecutionError(severity, message, capability_name=None, metadata=None)[source]#
Bases:
object
Comprehensive execution error container with recovery coordination support.
This dataclass provides a complete representation of execution errors including severity classification, recovery suggestions, technical debugging information, and context for coordinating recovery strategies. It serves as the primary error data structure used throughout the framework for error handling, logging, and recovery coordination.
ExecutionError enables sophisticated error management by providing: 1. Error Classification: Severity-based recovery strategy determination 2. User Communication: Clear, actionable error messages for interfaces 3. Developer Support: Technical details and debugging context
System Integration: Context for automated recovery systems
The error structure supports both automated error handling workflows and human-guided error resolution processes. It integrates seamlessly with the framework’s classification system and retry mechanisms to provide comprehensive error management.
- Parameters:
severity (ErrorSeverity) – Error severity classification for recovery strategy selection
message (str) – Clear, human-readable description of the error condition
capability_name (Optional[str]) – Name of the capability or component that generated this error
metadata (Optional[Dict[str, Any]]) – Structured error context including technical details and debugging information
Note
ExecutionError instances are typically created by error classification methods in capabilities and infrastructure nodes. The framework’s decorators automatically handle the creation and routing of these errors.
Warning
The severity field directly impacts system behavior through recovery strategy selection. Ensure appropriate severity classification to avoid ineffective error handling or unnecessary system termination.
Examples
Database connection error:
error = ExecutionError( severity=ErrorSeverity.RETRIABLE, message="Database connection failed", capability_name="database_query", metadata={"technical_details": "PostgreSQL connection timeout after 30 seconds"} )
Data corruption requiring immediate attention:
error = ExecutionError( severity=ErrorSeverity.FATAL, message="Critical data corruption detected", capability_name="data_processor", metadata={ "technical_details": "Checksum validation failed on primary data store", "safety_abort_reason": "Data integrity compromised" }, suggestions=[ "Initiate emergency backup procedures", "Contact system administrator immediately", "Do not proceed with further operations" ] )
See also
ErrorSeverity
: Severity levels and recovery strategiesErrorClassification
: Error analysis and classification systemExecutionResult
: Result containers with error integrationComprehensive error container with recovery coordination support.
Usage Pattern
error = ExecutionError( severity=ErrorSeverity.RETRIABLE, message="Database connection failed", capability_name="database_query", metadata={"technical_details": "PostgreSQL connection timeout after 30 seconds"} )
- severity: ErrorSeverity#
- message: str#
- capability_name: str | None = None#
- metadata: Dict[str, Any] | None = None#
- __init__(severity, message, capability_name=None, metadata=None)#
Classification Methods#
Base Capability Classification#
- static BaseCapability.classify_error(exc, context)[source]#
Classify errors for capability-specific error handling and recovery.
This method provides domain-specific error classification to determine appropriate recovery strategies. The default implementation treats all errors as critical, but capabilities should override this method to provide sophisticated error handling based on their specific failure modes.
The error classification determines how the framework responds to failures: - CRITICAL: End execution immediately - RETRIABLE: Retry with same parameters - REPLANNING: Create new execution plan - RECLASSIFICATION: Reclassify task capabilities - FATAL: System-level failure, terminate execution
- Parameters:
exc (Exception) – The exception that occurred during capability execution
context (dict) – Error context including capability info and execution state
- Returns:
Error classification with recovery strategy, or None to use default
- Return type:
Optional[ErrorClassification]
Note
The context dictionary contains useful information including: - ‘capability’: capability name - ‘current_step_index’: step being executed - ‘execution_time’: time spent before failure - ‘current_state’: agent state at time of error
Examples
Network-aware error classification:
@staticmethod def classify_error(exc: Exception, context: dict) -> ErrorClassification: # Retry network timeouts and connection errors if isinstance(exc, (ConnectionError, TimeoutError)): return ErrorClassification( severity=ErrorSeverity.RETRIABLE, user_message="Network issue detected, retrying...", metadata={"technical_details": str(exc)} ) # Default to critical for unexpected errors return ErrorClassification( severity=ErrorSeverity.CRITICAL, user_message=f"Unexpected error: {exc}", metadata={"technical_details": str(exc)} )
Missing input data requiring replanning:
@staticmethod def classify_error(exc: Exception, context: dict) -> ErrorClassification: if isinstance(exc, KeyError) and "context" in str(exc): return ErrorClassification( severity=ErrorSeverity.REPLANNING, user_message="Required data not available, trying different approach", metadata={"technical_details": f"Missing context data: {str(exc)}"} ) return BaseCapability.classify_error(exc, context)
See also
ErrorClassification
: Error classification result structureErrorSeverity
: Available severity levels and their meaningsDomain-specific error classification for capabilities. Override this method to provide sophisticated error handling based on specific failure modes.
Classification Strategy
@staticmethod def classify_error(exc: Exception, context: dict) -> ErrorClassification: if isinstance(exc, ConnectionError): return ErrorClassification( severity=ErrorSeverity.RETRIABLE, user_message="Network issue detected, retrying...", metadata={"technical_details": str(exc)} ) return ErrorClassification( severity=ErrorSeverity.CRITICAL, user_message=f"Unexpected error: {exc}", metadata={ "technical_details": str(exc), "safety_abort_reason": f"Unhandled capability error: {exc}", "suggestions": ["Check system logs", "Contact support if issue persists"] } )
Infrastructure Node Classification#
- static BaseInfrastructureNode.classify_error(exc, context)[source]#
Classify errors for infrastructure-specific error handling and recovery.
This method provides default error classification for all infrastructure nodes with a conservative approach that treats most errors as critical. Infrastructure nodes handle system-critical functions like orchestration and routing, so failures typically require immediate attention rather than automatic retry attempts.
The default implementation prioritizes system stability by failing fast with clear error messages. Subclasses should override this method only when specific infrastructure components can benefit from retry logic (e.g., LLM-based orchestrators that may encounter temporary API issues).
- Parameters:
exc (Exception) – The exception that occurred during infrastructure operation
context (dict) – Error context including node info, execution state, and timing
- Returns:
Error classification with severity and recovery strategy
- Return type:
Note
The context dictionary includes:
infrastructure_node
: node name for identificationexecution_time
: time spent before failurecurrent_state
: agent state at time of error
Example:
@staticmethod def classify_error(exc: Exception, context: dict) -> ErrorClassification: # Retry network timeouts for LLM-based infrastructure if isinstance(exc, (ConnectionError, TimeoutError)): return ErrorClassification( severity=ErrorSeverity.RETRIABLE, user_message="Network timeout, retrying...", metadata={"technical_details": str(exc)} ) return ErrorClassification( severity=ErrorSeverity.CRITICAL, user_message=f"Infrastructure error: {exc}", metadata={"technical_details": str(exc)} )
Note
Infrastructure nodes should generally fail fast, so the default implementation treats most errors as critical. Override this method for infrastructure that can benefit from retries (e.g., LLM-based nodes).
Conservative error classification for infrastructure nodes. Infrastructure nodes handle system-critical functions, so failures typically require immediate attention.
Conservative Strategy
@staticmethod def classify_error(exc: Exception, context: dict) -> ErrorClassification: # Infrastructure defaults to critical for fast failure return ErrorClassification( severity=ErrorSeverity.CRITICAL, user_message=f"Infrastructure error: {exc}", metadata={"technical_details": str(exc)} )
Retry Policy Configuration#
- static BaseCapability.get_retry_policy()[source]#
Get retry policy configuration for failure recovery strategies.
This method provides retry configuration that the framework uses for manual retry handling when capabilities fail with RETRIABLE errors. The default policy provides reasonable defaults for most capabilities, but should be overridden for capabilities with specific timing or retry requirements.
The retry policy controls: - Maximum number of retry attempts before giving up - Initial delay between retry attempts - Backoff factor for exponential delay increase
- Returns:
Dictionary containing retry configuration parameters
- Return type:
Dict[str, Any]
Note
The framework uses manual retry handling rather than LangGraph’s native retry policies to ensure consistent behavior across all components and to enable sophisticated error classification.
Examples
Aggressive retry for network-dependent capability:
@staticmethod def get_retry_policy() -> Dict[str, Any]: return { "max_attempts": 5, # More attempts for network issues "delay_seconds": 2.0, # Longer delay for external services "backoff_factor": 2.0 # Exponential backoff }
Conservative retry for expensive operations:
@staticmethod def get_retry_policy() -> Dict[str, Any]: return { "max_attempts": 2, # Minimal retries for expensive ops "delay_seconds": 0.1, # Quick retry for transient issues "backoff_factor": 1.0 # No backoff for fast operations }
See also
classify_error()
: Error classification that determines when to retryErrorSeverity
: RETRIABLE severity triggers retry policy usageRetry policy configuration for failure recovery strategies.
Default Policy:
{ "max_attempts": 3, # Total attempts including initial "delay_seconds": 0.5, # Base delay before first retry "backoff_factor": 1.5 # Exponential backoff multiplier }
- static BaseInfrastructureNode.get_retry_policy()[source]#
Get conservative retry policy configuration for infrastructure operations.
This method provides retry configuration optimized for infrastructure nodes that handle system-critical functions. The default policy uses conservative settings with minimal retry attempts and fast failure detection to maintain system stability.
Infrastructure nodes should generally fail fast rather than retry extensively, since failures often indicate system-level issues that require immediate attention. Override this method only for specific infrastructure components that can benefit from retry logic.
- Returns:
Dictionary containing conservative retry configuration parameters
- Return type:
Dict[str, Any]
Note
Infrastructure default policy: 2 attempts, 0.2s delay, minimal backoff. This prioritizes fast failure detection over retry persistence.
Example:
@staticmethod def get_retry_policy() -> Dict[str, Any]: return { "max_attempts": 3, # More retries for LLM-based infrastructure "delay_seconds": 1.0, # Longer delay for external service calls "backoff_factor": 2.0 # Exponential backoff }
Note
The router uses this configuration to determine retry behavior. Infrastructure default: 2 attempts, 0.2s delay, minimal backoff.
Conservative retry policy for infrastructure operations.
Infrastructure Policy:
{ "max_attempts": 2, # Fast failure for infrastructure "delay_seconds": 0.2, # Quick retry attempt "backoff_factor": 1.0 # No backoff }
Integration Pattern#
Basic Error Handling#
try:
result = await capability.execute(state)
except Exception as exc:
# Classify error for recovery strategy
classification = capability.classify_error(exc, context)
if classification.severity == ErrorSeverity.RETRIABLE:
# Handle with retry policy
policy = capability.get_retry_policy()
await retry_with_backoff(capability, state, policy)
elif classification.severity == ErrorSeverity.REPLANNING:
# Route to orchestrator for new execution plan
return "orchestrator"
elif classification.severity == ErrorSeverity.RECLASSIFICATION:
# Route to classifier for new capability selection
return "classifier"
elif classification.severity == ErrorSeverity.CRITICAL:
# End execution with clear error message
raise ExecutionError(
severity=ErrorSeverity.CRITICAL,
message=classification.user_message,
metadata=classification.metadata
)
Primary Error Context: Metadata Field#
The metadata
field is the primary mechanism for providing structured error context in ErrorClassification
.
Suggested Metadata Keys:
technical_details
: Detailed technical information (replaces old technical_details field)safety_abort_reason
: Explanation for critical/fatal errors requiring immediate terminationreplanning_reason
: Explanation for errors requiring new execution plan generationreclassification_reason
: Explanation for errors requiring new capability selectionsuggestions
: List of actionable recovery steps for userserror_code
: Machine-readable error identifier for programmatic handlingretry_after
: Suggested delay before retry attempts (in seconds)
Advanced Usage Patterns:
Structured Technical Details: Replace simple strings with nested objects
Contextual Information: Include relevant state and execution context
Recovery Guidance: Provide specific, actionable recovery steps
System Integration: Enable programmatic error handling and monitoring
# Example: Comprehensive metadata usage
return ErrorClassification(
severity=ErrorSeverity.REPLANNING,
user_message="Data query scope too broad for current system limits",
metadata={
"technical_details": f"Query returned {result_count} results, limit is {max_limit}",
"replanning_reason": "Query scope exceeded system performance limits",
"suggestions": [
"Reduce time range to last 24 hours",
"Specify fewer measurement types",
"Use data filtering parameters"
],
"error_code": "QUERY_SCOPE_EXCEEDED",
"retry_after": 10,
"query_metrics": {
"result_count": result_count,
"max_limit": max_limit,
"query_duration": query_time
}
}
)
See also
- Exception Reference
Complete catalog of framework exceptions
- Recovery Coordination
Router coordination and recovery strategies