9. Implementation guidance

StatusStable
Version2.0.0
Last updated2026-01-31
AuthorsOpenALBA Working Group

9.1 Data pipeline architecture

Reference Architecture
┌─────────────────────────────────────────────────────────────────────┐
│  DATA SOURCES                                                        │
│  Apps with OTel SDK → OTel Collector → Exporters                    │
└─────────────────────────────────────────────────────────────────────┘
                                ↓
┌─────────────────────────────────────────────────────────────────────┐
│  STORAGE (ClickHouse)                                                │
│                                                                      │
│  Raw Tables (7-30d)    Aggregated (90-365d)    ALBA Tables          │
│  - traces_raw          - user_metrics_hourly   - baselines          │
│  - metrics_raw         - service_metrics_min   - anomaly_scores     │
│  - logs_raw            - endpoint_metrics      - risk_scores        │
└─────────────────────────────────────────────────────────────────────┘
                                ↓
┌─────────────────────────────────────────────────────────────────────┐
│  ALBA PROCESSING (K8s CronJobs)                                      │
│                                                                      │
│  Every 5min: Aggregation → Read raw, calculate metrics              │
│  Every 6hr:  Baselines   → Update statistical models                │
│  Every 5min: Detection   → Calculate anomaly scores                 │
│  Every 5min: Risk        → Apply multipliers, decay, suppression    │
│  Every 1min: Alerting    → Evaluate thresholds, route               │
│  Weekly:     ML Update   → Retrain Isolation Forest, clusters       │
└─────────────────────────────────────────────────────────────────────┘
                                ↓
┌─────────────────────────────────────────────────────────────────────┐
│  OUTPUTS                                                             │
│  PagerDuty | Slack | Email | Grafana Dashboards                     │
└─────────────────────────────────────────────────────────────────────┘

9.2 Processing schedule

JobFrequencyTarget DurationDependencies
Aggregation5 min< 2 minRaw data
Baseline Update6 hours< 30 minAggregated metrics
Anomaly Detection5 min< 2 minBaselines
Risk Scoring5 min< 1 minAnomaly scores
Alert Evaluation1 min< 30 secRisk scores
ML Model UpdateWeekly< 2 hoursHistorical data

Warning

Job durations MUST be less than job frequency to avoid backlog accumulation. Monitor job duration and implement backpressure mechanisms.

9.3 Storage estimates

Storage Estimatesyaml
# Assumptions: 1000 services, 10000 users, 1M req/hour

raw_traces:
  retention: 7 days
  per_day: ~50 GB
  total: ~350 GB

aggregated_metrics:
  retention: 365 days
  per_day: ~500 MB
  total: ~180 GB

baselines:
  retention: current + 1 previous
  total: ~1 GB

anomaly_scores:
  retention: 90 days
  per_day: ~100 MB
  total: ~9 GB

total_estimated: ~600 GB

9.4 Cardinality management

Cardinality Managementyaml
high_cardinality_fields:
  - user.id
  - session.id
  - client.address
  - trace_id

strategies:
  pre_aggregation:
    "Count per user per hour, not store each request"

  tiered_retention:
    raw: 7 days
    hourly: 30 days
    daily: 365 days

  sampling:
    errors: 100%
    security_events: 100%
    slow_requests: 100%
    normal: 5%

  sketches:
    hyperloglog: "Unique counts, ~2% error"
    tdigest: "Percentiles"
    count_min_sketch: "Frequency"

  bucketing:
    client.address → country + asn

9.5 Scaling considerations

Scaling Strategiesyaml
horizontal:
  aggregation: "Partition by entity_type/id hash"
  anomaly_detection: "Partition by entity_id"
  baseline_calculation: "Partition by entity × metric"

vertical:
  clickhouse:
    cpu: "Query complexity"
    memory: "Aggregation window size"
    disk: "Retention period"

bottleneck_indicators:
  - job_duration > interval
  - query_timeout_rate > 1%
  - memory_pressure
  - disk > 80%

9.6 Configuration reference

Global Configurationyaml
alba:
  version: "2.0"

  processing:
    aggregation_interval: "5m"
    detection_interval: "5m"
    baseline_update_interval: "6h"
    alert_evaluation_interval: "1m"

  baselines:
    default_window_days: 14
    minimum_samples: 100
    confidence_threshold: 0.8
    seasonal_adjustment: true
    outlier_exclusion:
      method: "winsorized"
      percentile: 5

  anomaly_detection:
    default_method: "modified_zscore"
    zscore_threshold: 3.0
    min_score_to_store: 20
    component_weights:
      deviation: 0.40
      rarity: 0.25
      velocity: 0.20
      persistence: 0.15

  risk_scoring:
    max_score: 100
    time_decay:
      enabled: true
      default_lambda: 0.1

  cold_start:
    enable_population_fallback: true
    enable_peer_group_transfer: true
    enable_confidence_adjustment: true
    min_samples_entity: 100
    samples_full_confidence: 500

9.7 Operational monitoring

ALBA deployments SHOULD monitor the following metrics:

  • Data freshness: Time since last successful ingestion
  • Job health: Success rate, duration, backlog
  • Detection quality: False positive rate, true positive rate
  • Alert volume: Alerts per tier, per team, over time
  • Storage utilization: Disk usage, query performance

9.8 Conformance

Implementations:

  • MUST document their processing schedule
  • SHOULD implement monitoring for job health and data freshness
  • SHOULD implement cardinality management strategies
  • MAY use different storage backends than ClickHouse

Tip

For implementation examples, see the Getting Started Guide or explore the examples directory on GitHub.