High-scale observability on Amazon OpenSearch with ISM and UltraWarm
When a leading SaaS provider needed to scale their log analytics platform to handle multi-terabyte daily ingestion, they faced the classic observability challenge: balancing rapid search performance on recent data with cost-effective historical retention. Our multi-AZ Amazon OpenSearch solution transformed their monitoring capabilities whilst dramatically reducing long-term storage costs.

Requirements at a glance
- Cost-optimised, resilient log analytics platform for multi-terabyte daily ingestion
- Rapid search performance on recent data with affordable historical retention
- Least-privilege SSO integration with role-based access control
- Automated recovery capabilities with minimal manual intervention
- Support for growth above 15% monthly, with consistent performance during burst ingestion

Industry: Software-as-a-Service provider

Solution: Amazon OpenSearch Service with ISM and UltraWarm

Result: 65% reduction in long-term storage costs

Key metric: 3TB daily ingestion | 36-day retention | Zero query timeouts
Summary
A UK-based SaaS provider processing approximately 3TB of log data daily required a robust observability platform that could scale with aggressive growth projections, whilst maintaining cost efficiency. The organisation needed rapid access to recent logs for troubleshooting whilst retaining historical data for compliance and trend analysis.
Working with our infrastructure specialists, the organisation implemented a multi-AZ Amazon OpenSearch Service platform with sophisticated data lifecycle management. Through the strategic use of hot data nodes, UltraWarm storage tiers, and Index State Management policies, we delivered a solution that reduced long-term storage costs by approximately 65%, while eliminating query timeouts and improving overall system resilience.
The organisation's observability requirements presented a complex balancing act between performance, cost, and operational resilience as its platform scaled rapidly.
The challenge
Technical challenges
- Massive ingestion volumes: Consistent 3TB daily ingestion with burst capacity requirements during incident investigations and deployment windows
- Performance during scale: Need to maintain query responsiveness during high-volume ingestion periods without resource contention
- Cost-effective retention: Balance between immediate access to recent data and economical long-term storage for compliance and analysis
- Growth accommodation: Platform architecture needed to support monthly growth exceeding 15% without performance degradation
Operational requirements
The platform required automated lifecycle management to prevent manual intervention as data volumes grew. Traditional approaches of storing all data on high-performance storage proved economically unsustainable at scale.
Resilience became paramount as the observability platform itself became a critical piece of infrastructure. Any downtime or data loss would blind engineering teams to production issues, creating a cascade of operational risks.
The organisation also required least-privilege access controls integrated with existing SSO infrastructure, enabling secure self-service access for engineering teams whilst maintaining audit compliance.

The solution
We designed a sophisticated multi-AZ Amazon OpenSearch Service architecture that separated concerns across dedicated node roles, implemented intelligent data lifecycle management, and automated resilience operations.
Intelligent lifecycle management
Intelligent lifecycle management
The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions across storage tiers:
ISM lifecycle strategy:
- Hot tier: 4 days on high-performance gp3-backed nodes for rapid troubleshooting and real-time analysis
- UltraWarm tier: 32 days on cost-optimised storage for historical analysis and compliance retention
- Automated deletion: After 36 days of total retention, data removal to control storage growth
Resilience engineering: The policies incorporated 10 retries with 12-hour delays between transition attempts, ensuring reliable data movement even during peak ingestion periods when cluster resources experienced contention
Performance optimisation
Performance optimisation
Beyond storage tiering, we implemented comprehensive performance tuning addressing specific observability workload patterns:
Query optimisation:
- Auto Tune enabled for continuous performance adjustment based on workload characteristics
- indices.query.bool.max_clause_count tuned to accommodate complex filter queries common in log analysis
- Coordinator nodes preventing query overhead from impacting ingestion performance
Operational resilience:
- Encrypted hourly snapshots to Amazon S3, providing point-in-time recovery capabilities
- Automated red index recovery procedures reduce the mean time to recovery
- Comprehensive monitoring through CloudWatch metrics and logs with alerts to Slack via Amazon SNS and AWS Chatbot
Security and access management
Security and access management
Least-privilege security architecture integrated seamlessly with existing identity infrastructure:
- Okta SAML SSO with multi-factor authentication for secure access
- Role-based access control enabling self-service whilst maintaining security boundaries
- Infrastructure as Code using HashiCorp Terraform, ensuring consistent, auditable deployments
Core architecture components
The implementation utilised purpose-built node configurations optimised for specific workload characteristics:
Cluster management layer:
- Master nodes on m5.xlarge instances, providing stable cluster state management and quorum operations
- Dedicated coordinator nodes to offload query parsing and aggregation, reducing pressure on data nodes during complex analytical queries
Storage tier architecture:
- Hot data nodes backed by gp3 storage, tuned specifically for high-throughput ingestion and low-latency queries on recent data
- 45 UltraWarm nodes providing economical historical search capabilities for older log data
- Strategic tier separation enabling performance optimisation without cost penalties
Intelligent lifecycle management
The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions across storage tiers:
ISM lifecycle strategy:
- Hot tier: 4 days on high-performance gp3-backed nodes for rapid troubleshooting and real-time analysis
- UltraWarm tier: 32 days on cost-optimised storage for historical analysis and compliance retention
- Automated deletion: After 36 days of total retention, data removal to control storage growth
Resilience engineering: The policies incorporated 10 retries with 12-hour delays between transition attempts, ensuring reliable data movement even during peak ingestion periods when cluster resources experienced contention
Performance optimisation
Beyond storage tiering, we implemented comprehensive performance tuning addressing specific observability workload patterns:
Query optimisation:
- Auto Tune enabled for continuous performance adjustment based on workload characteristics
- indices.query.bool.max_clause_count tuned to accommodate complex filter queries common in log analysis
- Coordinator nodes preventing query overhead from impacting ingestion performance
Operational resilience:
- Encrypted hourly snapshots to Amazon S3, providing point-in-time recovery capabilities
- Automated red index recovery procedures reduce the mean time to recovery
- Comprehensive monitoring through CloudWatch metrics and logs with alerts to Slack via Amazon SNS and AWS Chatbot
Security and access management
Least-privilege security architecture integrated seamlessly with existing identity infrastructure:
- Okta SAML SSO with multi-factor authentication for secure access
- Role-based access control enabling self-service whilst maintaining security boundaries
- Infrastructure as Code using HashiCorp Terraform, ensuring consistent, auditable deployments
The result and business impact
Three months after deployment, our implementation delivered measurable improvements across cost, performance, and operational efficiency.
Performance and cost achievements
- Zero query timeouts during burst ingestion periods
- Stable ingestion at approximately 3TB daily with headroom exceeding 15% growth requirements
- Approximately 65% reduction in long-term storage costs through UltraWarm and automated lifecycle management
- Faster dashboard rendering, enabling real-time troubleshooting during incidents
Operational excellence
Hourly encrypted snapshots, combined with automated recovery procedures, significantly reduced the mean time to recovery. Automated ISM policies and Auto Tune optimisation minimised manual intervention requirements whilst comprehensive CloudWatch integration enabled proactive capacity management.
Business value realisation
Extended 36-day retention became economically feasible, supporting audit requirements without budget constraints. The elimination of query timeouts enhanced incident response capabilities, while the platform architecture demonstrated the ability to support aggressive growth projections without performance degradation.

Key learnings and best practices
Coordinator node impact
Coordinator node impact
Dedicated coordinator nodes delivered material performance improvements by offloading query parsing and aggregation from data nodes. This separation proved especially valuable during complex analytical queries across large time ranges, preventing query overhead from impacting ingestion performance.
Auto Tune with analysis
Auto Tune with analysis
Enabling Auto Tune provided baseline optimisation, but combining it with slow query log analysis produced superior results. Regular review of query patterns enabled targeted index configuration adjustments that Auto Tune alone couldn't address.
Resilience through automation
Resilience through automation
Hourly snapshots initially seemed excessive but proved invaluable when optimising for cost. The ability to rapidly recover from red index states or misconfigured lifecycle policies enabled aggressive cost optimisation without operational risk.
Tested recovery runbooks transformed theoretical backup capabilities into practical operational confidence, enabling faster incident resolution.
Lifecycle calibration importance
ISM policies require careful calibration based on real traffic patterns rather than theoretical models. We iterated timing and shard sizing using actual workload data, discovering that generic recommendations often missed workload-specific optimisation opportunities.
Starting conservatively with longer hot retention and gradually optimising based on query patterns proved more reliable than an aggressive initial configuration.
Coordinator node impact
Dedicated coordinator nodes delivered material performance improvements by offloading query parsing and aggregation from data nodes. This separation proved especially valuable during complex analytical queries across large time ranges, preventing query overhead from impacting ingestion performance.
Auto Tune with analysis
Enabling Auto Tune provided baseline optimisation, but combining it with slow query log analysis produced superior results. Regular review of query patterns enabled targeted index configuration adjustments that Auto Tune alone couldn't address.
Resilience through automation
Hourly snapshots initially seemed excessive but proved invaluable when optimising for cost. The ability to rapidly recover from red index states or misconfigured lifecycle policies enabled aggressive cost optimisation without operational risk.
Tested recovery runbooks transformed theoretical backup capabilities into practical operational confidence, enabling faster incident resolution.
Looking forward
This implementation demonstrates how sophisticated architecture and intelligent automation can resolve the traditional tension between observability depth and operational cost at scale. Purpose-built node roles enable independent optimisation of different workload characteristics, whilst intelligent data lifecycle management transforms retention economics.
Future enhancements will focus on integrating machine learning for predictive capacity planning and automating shard sizing based on ingestion patterns. The platform's architecture provides proven scalability beyond current requirements, positioning the organisation for continued rapid growth without observability constraints.

Need a resilient log analytics platform? Let's discuss your requirements.
Contact Adaptavist today to discuss how we can help.