Cost-efficient audit logging on Amazon OpenSearch with UltraWarm
When a UK enterprise needed a resilient audit logging platform capable of year-long retention without breaking the budget, traditional approaches proved economically unfeasible. Our two-AZ Amazon OpenSearch solution delivered fast search performance on recent data, whilst making extended compliance retention affordable through intelligent storage tiering.

Requirements at a glance
- Resilient, cost-efficient audit logging platform with predictable long retention
- Low-latency search capabilities for operational investigations
- Approximately 120GB daily ingestion with 10-15% monthly growth accommodation
- Multi-AZ availability with RTO approximately 1 hour and RPO 24 hours
- Least-privilege SSO integration with comprehensive audit trails

Industry: Enterprise IT

Solution: Amazon OpenSearch Service with UltraWarm

Result: Predictable year-long retention at reduced cost

Key metric: 120GB daily ingestion | 346-day retention | Sub-second recent queries
Summary
A UK-based enterprise organisation required an audit logging platform that balanced regulatory compliance demands for extended retention with operational needs for rapid search capabilities. With approximately 120GB of audit data generated daily and growth projections of 10-15% monthly, the challenge demanded both performance and cost efficiency.
Working with our infrastructure specialists, the organisation implemented a two-AZ Amazon OpenSearch Service platform with sophisticated storage tiering. Through the strategic use of NVMe-backed hot nodes for recent data and UltraWarm storage for historical retention, we delivered a solution that provides predictable year-long retention while maintaining sub-second query performance on recent logs and acceptable performance for historical compliance searches.
The challenge
The organisation faced the classic audit logging dilemma: regulatory requirements demanded long-term retention whilst operational budgets couldn't sustain premium storage for rarely accessed historical data.
Technical challenges
- Balanced performance requirements: Need for sub-second queries on recent operational data, whilst maintaining acceptable search performance across year-long retention for compliance investigations
- Write-heavy workload optimisation: Consistent 120GB daily ingestion requiring high-throughput write performance without impacting query responsiveness
- Growth accommodation: Architecture needed to support 10-15% monthly growth without performance degradation or cost surprises
- Resilience constraints: Multi-AZ availability with specific recovery objectives (RTO approximately 1 hour, RPO 24 hours) whilst optimising storage costs
Operational requirements
The platform required least-privilege access controls integrated with existing SSO infrastructure, enabling secure self-service access for audit reviews whilst maintaining comprehensive audit trails of platform access itself.
Automated recovery capabilities became essential as the organisation sought to optimise costs through architectural decisions that required robust backup and restore procedures. Manual intervention during recovery scenarios would violate RTO objectives.
Traditional approaches of maintaining replicas across all retention periods proved economically prohibitive at year-long retention scales, requiring innovative solutions to balance resilience with cost efficiency.
The solution
We designed a two-AZ Amazon OpenSearch Service architecture that separates performance concerns across storage tiers, optimises for write-heavy audit workloads, and automates resilience operations.
Intelligent lifecycle management
Intelligent lifecycle management
The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions:
ISM lifecycle strategy:
- Hot tier: 31 days on NVMe-backed nodes for operational investigations and real-time audit analysis
- UltraWarm tier: 315 days on cost-optimised storage for compliance retention and historical investigations
- Automated deletion: After 346 days total retention, meeting regulatory requirements whilst controlling storage growth
Shard optimisation: Daily index rollover with approximately 40GB target shard size, utilising zero replicas in the hot tier whilst relying on automated snapshots for resilience. This approach dramatically reduced storage requirements without compromising recovery capabilities.
Resilience engineering
Resilience engineering
Rather than maintaining expensive replicas across all retention tiers, we implemented comprehensive backup and recovery automation:
- Daily automated domain snapshots at 00:00 UTC with 14-day retention
- Lambda-based red index recovery workflow enabling automated remediation without manual intervention
- Multiple restore paths ensured RTO objectives remained achievable even during complex failure scenarios
- Serverless ingestion and remediation functions sized for concurrency and reliability
Security and access management
Security and access management
Identity-first security architecture simplified management whilst strengthening governance:
- Okta SAML SSO with multi-factor authentication for secure access
- Admin and Viewer roles are mapped to organisational groups with no local user accounts
- IAM federation and STS for short-lived access credentials, reducing credential management overhead
- Comprehensive CloudWatch logging of all platform activities, with alerts to Slack via Amazon SNS and EventBridge
Core architecture components
The implementation utilised purpose-built node configurations optimised for distinct workload phases:
Performance tier:
- Hot nodes on i3.2xlarge.search instances leveraging local NVMe storage for high write throughput and fast queries on recent data
- Three m5.large.search master nodes providing a stable cluster quorum and coordination
Retention tier:
- Seven ultrawarm1.medium.search nodes delivering economical historical search capabilities for compliance investigations
- Strategic tier separation enabling year-long retention without premium storage costs
Intelligent lifecycle management
The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions:
ISM lifecycle strategy:
- Hot tier: 31 days on NVMe-backed nodes for operational investigations and real-time audit analysis
- UltraWarm tier: 315 days on cost-optimised storage for compliance retention and historical investigations
- Automated deletion: After 346 days total retention, meeting regulatory requirements whilst controlling storage growth
Shard optimisation: Daily index rollover with approximately 40GB target shard size, utilising zero replicas in the hot tier whilst relying on automated snapshots for resilience. This approach dramatically reduced storage requirements without compromising recovery capabilities.
Resilience engineering
Rather than maintaining expensive replicas across all retention tiers, we implemented comprehensive backup and recovery automation:
- Daily automated domain snapshots at 00:00 UTC with 14-day retention
- Lambda-based red index recovery workflow enabling automated remediation without manual intervention
- Multiple restore paths ensured RTO objectives remained achievable even during complex failure scenarios
- Serverless ingestion and remediation functions sized for concurrency and reliability
Security and access management
Identity-first security architecture simplified management whilst strengthening governance:
- Okta SAML SSO with multi-factor authentication for secure access
- Admin and Viewer roles are mapped to organisational groups with no local user accounts
- IAM federation and STS for short-lived access credentials, reducing credential management overhead
- Comprehensive CloudWatch logging of all platform activities, with alerts to Slack via Amazon SNS and EventBridge
The result and business impact
Three months after deployment, our implementation delivered the cost efficiency needed for year-long retention whilst maintaining the performance required for operational investigations.
Performance and cost achievements
- Predictable year-long retention at reduced cost through UltraWarm storage tiering
- Sub-second query performance for recent operational data across the 31-day hot tier
- Acceptable search performance for historical compliance investigations across 315-day UltraWarm retention
- Stable ingestion at approximately 120GB daily with proven headroom for 10-15% monthly growth
Operational excellence
Reduced mean time to recovery through multiple restore paths and automated remediation workflows. A zero-replica strategy in the hot tier is validated through tested recovery procedures, eliminating replica storage costs without compromising resilience.
Comprehensive CloudWatch integration provided visibility into platform health and ingestion patterns, enabling proactive capacity management and performance optimisation.

Business value realisation
Compliance enablement: Year-long retention became economically viable through intelligent storage tiering, meeting regulatory requirements without budget constraints.
Operational efficiency: Identity-first security with SSO and short-lived credentials simplified access management on public endpoints whilst strengthening audit trails.
Predictable costs: Tiered architecture enabled accurate capacity planning and predictable operational expenditure as data volumes grew, eliminating storage cost surprises.

Key learnings and best practices
Workload-driven configuration
Real workload metrics proved essential for ISM timing, shard sizing, and rollover configuration. Initial estimates required refinement based on actual ingestion patterns and query behaviours, with regular review cycles maintaining optimisation as workloads evolved.
NVMe and Auto Tune synergy
NVMe-backed hot nodes, combined with Auto-Tune, stabilised write-heavy audit workloads, delivering consistent ingestion performance without manual tuning intervention. Local NVMe storage eliminated network bottlenecks during high-volume ingestion periods.
Zero replicas viability
Zero replicas in the hot tier proved viable when snapshot and recovery playbooks were comprehensive and tested. This approach required strong operational discipline but delivered substantial cost savings without compromising actual resilience capabilities.
Identity-first security benefits
SSO integration with short-lived credentials simplified management whilst improving security posture. Eliminating local user accounts reduced credential sprawl and improved auditability of platform access patterns.
Looking forward
This implementation demonstrates how strategic architecture decisions can make compliance-driven retention requirements economically sustainable. Purpose-built storage tiers enable the independent optimisation of performance and cost characteristics, while comprehensive automation ensures resilience without an operational burden.
Future enhancements will focus on machine learning integration for anomaly detection within audit patterns and predictive capacity planning based on ingestion trends. The platform's architecture provides proven scalability for continued growth whilst maintaining the cost efficiency that made year-long retention feasible.

Need a resilient audit logging platform? Let's discuss your requirements.
Contact Adaptavist today to discuss how we can help.