Case study: Cost-efficient audit logging with OpenSearch

When a UK enterprise needed a resilient audit logging platform capable of year-long retention without breaking the budget, traditional approaches proved economically unfeasible. Our two-AZ Amazon OpenSearch solution delivered fast search performance on recent data, whilst making extended compliance retention affordable through intelligent storage tiering.

Requirements at a glance

Resilient, cost-efficient audit logging platform with predictable long retention
Low-latency search capabilities for operational investigations
Approximately 120GB daily ingestion with 10-15% monthly growth accommodation
Multi-AZ availability with RTO approximately 1 hour and RPO 24 hours
Least-privilege SSO integration with comprehensive audit trails

Industry: Enterprise IT

Solution: Amazon OpenSearch Service with UltraWarm

Result: Predictable year-long retention at reduced cost

Key metric: 120GB daily ingestion | 346-day retention | Sub-second recent queries

Summary

A UK-based enterprise organisation required an audit logging platform that balanced regulatory compliance demands for extended retention with operational needs for rapid search capabilities. With approximately 120GB of audit data generated daily and growth projections of 10-15% monthly, the challenge demanded both performance and cost efficiency.

Working with our infrastructure specialists, the organisation implemented a two-AZ Amazon OpenSearch Service platform with sophisticated storage tiering. Through the strategic use of NVMe-backed hot nodes for recent data and UltraWarm storage for historical retention, we delivered a solution that provides predictable year-long retention while maintaining sub-second query performance on recent logs and acceptable performance for historical compliance searches.

The challenge

The organisation faced the classic audit logging dilemma: regulatory requirements demanded long-term retention whilst operational budgets couldn't sustain premium storage for rarely accessed historical data.

Technical challenges

Balanced performance requirements: Need for sub-second queries on recent operational data, whilst maintaining acceptable search performance across year-long retention for compliance investigations
Write-heavy workload optimisation: Consistent 120GB daily ingestion requiring high-throughput write performance without impacting query responsiveness
Growth accommodation: Architecture needed to support 10-15% monthly growth without performance degradation or cost surprises
Resilience constraints: Multi-AZ availability with specific recovery objectives (RTO approximately 1 hour, RPO 24 hours) whilst optimising storage costs

Operational requirements

The platform required least-privilege access controls integrated with existing SSO infrastructure, enabling secure self-service access for audit reviews whilst maintaining comprehensive audit trails of platform access itself.

Automated recovery capabilities became essential as the organisation sought to optimise costs through architectural decisions that required robust backup and restore procedures. Manual intervention during recovery scenarios would violate RTO objectives.

Traditional approaches of maintaining replicas across all retention periods proved economically prohibitive at year-long retention scales, requiring innovative solutions to balance resilience with cost efficiency.

The solution

We designed a two-AZ Amazon OpenSearch Service architecture that separates performance concerns across storage tiers, optimises for write-heavy audit workloads, and automates resilience operations.

Core architecture components

The implementation utilised purpose-built node configurations optimised for distinct workload phases:

Performance tier:

Hot nodes on i3.2xlarge.search instances leveraging local NVMe storage for high write throughput and fast queries on recent data
Three m5.large.search master nodes providing a stable cluster quorum and coordination

Retention tier:

Seven ultrawarm1.medium.search nodes delivering economical historical search capabilities for compliance investigations
Strategic tier separation enabling year-long retention without premium storage costs

Intelligent lifecycle management

The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions:

ISM lifecycle strategy:

Hot tier: 31 days on NVMe-backed nodes for operational investigations and real-time audit analysis
UltraWarm tier: 315 days on cost-optimised storage for compliance retention and historical investigations
Automated deletion: After 346 days total retention, meeting regulatory requirements whilst controlling storage growth

Shard optimisation: Daily index rollover with approximately 40GB target shard size, utilising zero replicas in the hot tier whilst relying on automated snapshots for resilience. This approach dramatically reduced storage requirements without compromising recovery capabilities.

Resilience engineering

Rather than maintaining expensive replicas across all retention tiers, we implemented comprehensive backup and recovery automation:

Daily automated domain snapshots at 00:00 UTC with 14-day retention
Lambda-based red index recovery workflow enabling automated remediation without manual intervention
Multiple restore paths ensured RTO objectives remained achievable even during complex failure scenarios
Serverless ingestion and remediation functions sized for concurrency and reliability

Security and access management

Identity-first security architecture simplified management whilst strengthening governance:

Okta SAML SSO with multi-factor authentication for secure access
Admin and Viewer roles are mapped to organisational groups with no local user accounts
IAM federation and STS for short-lived access credentials, reducing credential management overhead
Comprehensive CloudWatch logging of all platform activities, with alerts to Slack via Amazon SNS and EventBridge

Core architecture components

The implementation utilised purpose-built node configurations optimised for distinct workload phases:

Performance tier:

Hot nodes on i3.2xlarge.search instances leveraging local NVMe storage for high write throughput and fast queries on recent data
Three m5.large.search master nodes providing a stable cluster quorum and coordination

Retention tier:

Seven ultrawarm1.medium.search nodes delivering economical historical search capabilities for compliance investigations
Strategic tier separation enabling year-long retention without premium storage costs

Intelligent lifecycle management

The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions:

ISM lifecycle strategy:

Hot tier: 31 days on NVMe-backed nodes for operational investigations and real-time audit analysis
UltraWarm tier: 315 days on cost-optimised storage for compliance retention and historical investigations
Automated deletion: After 346 days total retention, meeting regulatory requirements whilst controlling storage growth

Resilience engineering

Rather than maintaining expensive replicas across all retention tiers, we implemented comprehensive backup and recovery automation:

Daily automated domain snapshots at 00:00 UTC with 14-day retention
Lambda-based red index recovery workflow enabling automated remediation without manual intervention
Multiple restore paths ensured RTO objectives remained achievable even during complex failure scenarios
Serverless ingestion and remediation functions sized for concurrency and reliability

Security and access management

Identity-first security architecture simplified management whilst strengthening governance:

Okta SAML SSO with multi-factor authentication for secure access
Admin and Viewer roles are mapped to organisational groups with no local user accounts
IAM federation and STS for short-lived access credentials, reducing credential management overhead
Comprehensive CloudWatch logging of all platform activities, with alerts to Slack via Amazon SNS and EventBridge

The result and business impact

Three months after deployment, our implementation delivered the cost efficiency needed for year-long retention whilst maintaining the performance required for operational investigations.

Performance and cost achievements

Predictable year-long retention at reduced cost through UltraWarm storage tiering
Sub-second query performance for recent operational data across the 31-day hot tier
Acceptable search performance for historical compliance investigations across 315-day UltraWarm retention
Stable ingestion at approximately 120GB daily with proven headroom for 10-15% monthly growth

Operational excellence

Reduced mean time to recovery through multiple restore paths and automated remediation workflows. A zero-replica strategy in the hot tier is validated through tested recovery procedures, eliminating replica storage costs without compromising resilience.

Comprehensive CloudWatch integration provided visibility into platform health and ingestion patterns, enabling proactive capacity management and performance optimisation.

Business value realisation

Compliance enablement: Year-long retention became economically viable through intelligent storage tiering, meeting regulatory requirements without budget constraints.

Operational efficiency: Identity-first security with SSO and short-lived credentials simplified access management on public endpoints whilst strengthening audit trails.

Predictable costs: Tiered architecture enabled accurate capacity planning and predictable operational expenditure as data volumes grew, eliminating storage cost surprises.

Key learnings and best practices

Workload-driven configuration

Real workload metrics proved essential for ISM timing, shard sizing, and rollover configuration. Initial estimates required refinement based on actual ingestion patterns and query behaviours, with regular review cycles maintaining optimisation as workloads evolved.

NVMe and Auto Tune synergy

NVMe-backed hot nodes, combined with Auto-Tune, stabilised write-heavy audit workloads, delivering consistent ingestion performance without manual tuning intervention. Local NVMe storage eliminated network bottlenecks during high-volume ingestion periods.

Zero replicas viability

Zero replicas in the hot tier proved viable when snapshot and recovery playbooks were comprehensive and tested. This approach required strong operational discipline but delivered substantial cost savings without compromising actual resilience capabilities.

Identity-first security benefits

SSO integration with short-lived credentials simplified management whilst improving security posture. Eliminating local user accounts reduced credential sprawl and improved auditability of platform access patterns.

Looking forward

This implementation demonstrates how strategic architecture decisions can make compliance-driven retention requirements economically sustainable. Purpose-built storage tiers enable the independent optimisation of performance and cost characteristics, while comprehensive automation ensures resilience without an operational burden.

Future enhancements will focus on machine learning integration for anomaly detection within audit patterns and predictive capacity planning based on ingestion trends. The platform's architecture provides proven scalability for continued growth whilst maintaining the cost efficiency that made year-long retention feasible.

Unlock powerful search and analytics with expert Amazon OpenSearch implementation

As an Amazon OpenSearch Service Delivery Partner, Adaptavist brings validated expertise in implementing search, analytics, and observability solutions that transform how you manage and analyse your data. With proven success and deep AWS expertise across the full spectrum of cloud services, our certified specialists help you build, optimise, and scale your OpenSearch deployments with comprehensive solutions tailored to your business needs.

Discover our AWS solutions and expertise

Need a resilient audit logging platform? Let's discuss your requirements.

Contact Adaptavist today to discuss how we can help.

Cost-efficient audit logging on Amazon OpenSearch with UltraWarm

Requirements at a glance

Summary

The challenge

Technical challenges

Operational requirements

The solution

Core architecture components

Intelligent lifecycle management

Resilience engineering

Security and access management

Core architecture components

Intelligent lifecycle management

Resilience engineering

Security and access management

The result and business impact

Performance and cost achievements

Operational excellence

Business value realisation

Key learnings and best practices

Workload-driven configuration

NVMe and Auto Tune synergy

Zero replicas viability

Identity-first security benefits

Looking forward

Unlock powerful search and analytics with expert Amazon OpenSearch implementation

Need a resilient audit logging platform? Let's discuss your requirements.