8. Alert framework

StatusStable
Version2.0.0
Last updated2026-01-31
AuthorsOpenALBA Working Group

8.1 Alert tiers

Alerts are categorized into tiers based on risk score, with corresponding response SLAs:

TierRisk ScoreResponse SLANotificationAuto-Escalate
Critical90-10015 minPagerDuty + Slack30 min
High70-891 hourSlack + Email2 hours
Medium40-698 hoursSlack + Digest24 hours
Low20-3924 hoursEmail digestNone
Info0-19WeeklyDashboardNone

8.2 Alert content structure

Alerts SHOULD include sufficient context for investigation without requiring additional queries:

Alert Content Structureyaml
header:
  alert_id: "ALBA-2024-02-15-001234"
  severity: "critical"
  title: "Unusual data access for user john.doe@company.com"
  timestamp: "2024-02-15T14:32:00Z"

scores:
  anomaly_score: 78
  risk_scores:
    security: 85
    ops: 35
    engineering: 40
  confidence: 0.92

entity:
  type: "user"
  id: "john.doe@company.com"
  attributes:
    role: "engineer"
    department: "Product"
  criticality: 1.0

anomaly:
  type: "user_data_access_volume_anomaly"
  description: "User accessed 15,000 customer records in past hour"
  signals:
    - metric: "data_records_accessed"
      current: 15000
      baseline_mean: 120
      zscore: 330.67
  baseline_comparison: "125x normal volume"

context:
  timeline:
    - "14:00 - Normal login"
    - "14:05 - First bulk export endpoint access"
    - "14:10-14:30 - 150 export requests"
  related_entities:
    - "Endpoint: /api/customers/export"
  suppression_status: "None"
  change_windows: "None active"

investigation:
  suggested_queries:
    - "All requests by user in last 24h"
    - "All users who accessed bulk export today"
    - "User's HR status"
  dashboards: ["User Activity", "Data Access"]
  runbook: "https://wiki/runbooks/data-exfiltration"

actions: ["acknowledge", "suppress", "escalate", "resolve"]

8.3 Alert routing

Alerts are routed based on anomaly type and risk score thresholds:

Alert Routing Rulesyaml
security_team:
  conditions:
    - anomaly_type_in: [geographic, impossible_travel, credential_stuffing,
                        account_takeover, privilege_escalation, data_exfil,
                        new_external_connection]
    - risk_score.security >= 50
  channels:
    critical: [pagerduty:security, slack:#security-critical]
    high: [slack:#security-alerts]
    medium: [email:security@company.com]

sre_team:
  conditions:
    - anomaly_type_in: [error_rate, latency, dependency_failure,
                        traffic_anomaly, capacity, certificate]
    - risk_score.ops >= 50
  channels:
    critical: [pagerduty:sre, slack:#incidents]
    high: [slack:#sre-alerts]
    medium: [email:sre@company.com]

engineering:
  conditions:
    - anomaly_type_in: [new_exception, response_anomaly, api_violation]
    - risk_score.engineering >= 50
  channels:
    critical: [slack:#eng-oncall]
    high: [slack:#eng-alerts]
    medium: [email:eng-leads@company.com]

8.4 Aggregation and deduplication

Aggregation Rulesyaml
aggregation:
  same_entity_type:
    group_by: [entity_type, entity_id, anomaly_type]
    window: 15 minutes
    action: "Merge into single alert"

  service_incident:
    group_by: [service.name]
    window: 30 minutes
    conditions:
      anomaly_types: [error_rate, latency, dependency_failure]
    action: "Create incident linking all anomalies"

  attack_campaign:
    group_by: [anomaly_type]
    window: 1 hour
    conditions:
      affected_entities: "> 10"
    action: "Create campaign alert"

deduplication:
  ongoing_anomaly:
    conditions: [same_entity, same_type, previous_alert_open]
    action: "Update existing, don't create new"

  flapping_prevention:
    conditions:
      - same_entity
      - same_type
      - alert_count > 3 in 1 hour
    action: "Suppress, create meta-alert"

8.5 Feedback loop

Alert feedback improves detection accuracy over time:

Feedback Loopyaml
false_positive:
  actions:
    - Log with context
    - Suggest suppression rule if pattern
    - Adjust baseline if appropriate
  threshold_adjustment:
    after_3_fps: "Widen threshold 10%"
    after_5_fps: "Flag for rule review"

true_positive:
  actions:
    - Link to incident ticket
    - Update risk multipliers (+5%)
    - Add to ML training data

metrics_targets:
  false_positive_rate: "< 5%"
  true_positive_rate: "> 80%"
  mean_time_to_acknowledge: "< SLA"

8.6 Conformance

Implementations:

  • MUST support at least three alert severity tiers
  • SHOULD include anomaly scores and context in alerts
  • SHOULD implement alert deduplication
  • MAY implement feedback loops for continuous improvement

Note

Continue to Section 9: Implementation Guidance for deployment architecture and operational guidance.