9. Implementation guidance
| Status | Stable |
| Version | 2.0.0 |
| Last updated | 2026-01-31 |
| Authors | OpenALBA Working Group |
9.1 Data pipeline architecture
Reference Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ Apps with OTel SDK → OTel Collector → Exporters │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ STORAGE (ClickHouse) │
│ │
│ Raw Tables (7-30d) Aggregated (90-365d) ALBA Tables │
│ - traces_raw - user_metrics_hourly - baselines │
│ - metrics_raw - service_metrics_min - anomaly_scores │
│ - logs_raw - endpoint_metrics - risk_scores │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ ALBA PROCESSING (K8s CronJobs) │
│ │
│ Every 5min: Aggregation → Read raw, calculate metrics │
│ Every 6hr: Baselines → Update statistical models │
│ Every 5min: Detection → Calculate anomaly scores │
│ Every 5min: Risk → Apply multipliers, decay, suppression │
│ Every 1min: Alerting → Evaluate thresholds, route │
│ Weekly: ML Update → Retrain Isolation Forest, clusters │
└─────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ OUTPUTS │
│ PagerDuty | Slack | Email | Grafana Dashboards │
└─────────────────────────────────────────────────────────────────────┘9.2 Processing schedule
| Job | Frequency | Target Duration | Dependencies |
|---|---|---|---|
| Aggregation | 5 min | < 2 min | Raw data |
| Baseline Update | 6 hours | < 30 min | Aggregated metrics |
| Anomaly Detection | 5 min | < 2 min | Baselines |
| Risk Scoring | 5 min | < 1 min | Anomaly scores |
| Alert Evaluation | 1 min | < 30 sec | Risk scores |
| ML Model Update | Weekly | < 2 hours | Historical data |
Warning
Job durations MUST be less than job frequency to avoid backlog accumulation. Monitor job duration and implement backpressure mechanisms.
9.3 Storage estimates
Storage Estimatesyaml
# Assumptions: 1000 services, 10000 users, 1M req/hour
raw_traces:
retention: 7 days
per_day: ~50 GB
total: ~350 GB
aggregated_metrics:
retention: 365 days
per_day: ~500 MB
total: ~180 GB
baselines:
retention: current + 1 previous
total: ~1 GB
anomaly_scores:
retention: 90 days
per_day: ~100 MB
total: ~9 GB
total_estimated: ~600 GB9.4 Cardinality management
Cardinality Managementyaml
high_cardinality_fields:
- user.id
- session.id
- client.address
- trace_id
strategies:
pre_aggregation:
"Count per user per hour, not store each request"
tiered_retention:
raw: 7 days
hourly: 30 days
daily: 365 days
sampling:
errors: 100%
security_events: 100%
slow_requests: 100%
normal: 5%
sketches:
hyperloglog: "Unique counts, ~2% error"
tdigest: "Percentiles"
count_min_sketch: "Frequency"
bucketing:
client.address → country + asn9.5 Scaling considerations
Scaling Strategiesyaml
horizontal:
aggregation: "Partition by entity_type/id hash"
anomaly_detection: "Partition by entity_id"
baseline_calculation: "Partition by entity × metric"
vertical:
clickhouse:
cpu: "Query complexity"
memory: "Aggregation window size"
disk: "Retention period"
bottleneck_indicators:
- job_duration > interval
- query_timeout_rate > 1%
- memory_pressure
- disk > 80%9.6 Configuration reference
Global Configurationyaml
alba:
version: "2.0"
processing:
aggregation_interval: "5m"
detection_interval: "5m"
baseline_update_interval: "6h"
alert_evaluation_interval: "1m"
baselines:
default_window_days: 14
minimum_samples: 100
confidence_threshold: 0.8
seasonal_adjustment: true
outlier_exclusion:
method: "winsorized"
percentile: 5
anomaly_detection:
default_method: "modified_zscore"
zscore_threshold: 3.0
min_score_to_store: 20
component_weights:
deviation: 0.40
rarity: 0.25
velocity: 0.20
persistence: 0.15
risk_scoring:
max_score: 100
time_decay:
enabled: true
default_lambda: 0.1
cold_start:
enable_population_fallback: true
enable_peer_group_transfer: true
enable_confidence_adjustment: true
min_samples_entity: 100
samples_full_confidence: 5009.7 Operational monitoring
ALBA deployments SHOULD monitor the following metrics:
- Data freshness: Time since last successful ingestion
- Job health: Success rate, duration, backlog
- Detection quality: False positive rate, true positive rate
- Alert volume: Alerts per tier, per team, over time
- Storage utilization: Disk usage, query performance
9.8 Conformance
Implementations:
- MUST document their processing schedule
- SHOULD implement monitoring for job health and data freshness
- SHOULD implement cardinality management strategies
- MAY use different storage backends than ClickHouse
Tip
For implementation examples, see the Getting Started Guide or explore the examples directory on GitHub.