Case study: High-scale OpenSearch observability platform

When a leading SaaS provider needed to scale their log analytics platform to handle multi-terabyte daily ingestion, they faced the classic observability challenge: balancing rapid search performance on recent data with cost-effective historical retention. Our multi-AZ Amazon OpenSearch solution transformed their monitoring capabilities whilst dramatically reducing long-term storage costs.

Requirements at a glance

Cost-optimised, resilient log analytics platform for multi-terabyte daily ingestion
Rapid search performance on recent data with affordable historical retention
Least-privilege SSO integration with role-based access control
Automated recovery capabilities with minimal manual intervention
Support for growth above 15% monthly, with consistent performance during burst ingestion

Industry: Software-as-a-Service provider

Solution: Amazon OpenSearch Service with ISM and UltraWarm

Result: 65% reduction in long-term storage costs

Key metric: 3TB daily ingestion | 36-day retention | Zero query timeouts

Summary

A UK-based SaaS provider processing approximately 3TB of log data daily required a robust observability platform that could scale with aggressive growth projections, whilst maintaining cost efficiency. The organisation needed rapid access to recent logs for troubleshooting whilst retaining historical data for compliance and trend analysis.

Working with our infrastructure specialists, the organisation implemented a multi-AZ Amazon OpenSearch Service platform with sophisticated data lifecycle management. Through the strategic use of hot data nodes, UltraWarm storage tiers, and Index State Management policies, we delivered a solution that reduced long-term storage costs by approximately 65%, while eliminating query timeouts and improving overall system resilience.

The organisation's observability requirements presented a complex balancing act between performance, cost, and operational resilience as its platform scaled rapidly.

The challenge

Technical challenges

Massive ingestion volumes: Consistent 3TB daily ingestion with burst capacity requirements during incident investigations and deployment windows
Performance during scale: Need to maintain query responsiveness during high-volume ingestion periods without resource contention
Cost-effective retention: Balance between immediate access to recent data and economical long-term storage for compliance and analysis
Growth accommodation: Platform architecture needed to support monthly growth exceeding 15% without performance degradation

Operational requirements

The platform required automated lifecycle management to prevent manual intervention as data volumes grew. Traditional approaches of storing all data on high-performance storage proved economically unsustainable at scale.

Resilience became paramount as the observability platform itself became a critical piece of infrastructure. Any downtime or data loss would blind engineering teams to production issues, creating a cascade of operational risks.

The organisation also required least-privilege access controls integrated with existing SSO infrastructure, enabling secure self-service access for engineering teams whilst maintaining audit compliance.

The solution

We designed a sophisticated multi-AZ Amazon OpenSearch Service architecture that separated concerns across dedicated node roles, implemented intelligent data lifecycle management, and automated resilience operations.

Core architecture components

The implementation utilised purpose-built node configurations optimised for specific workload characteristics:

Cluster management layer:

Master nodes on m5.xlarge instances, providing stable cluster state management and quorum operations
Dedicated coordinator nodes to offload query parsing and aggregation, reducing pressure on data nodes during complex analytical queries

Storage tier architecture:

Hot data nodes backed by gp3 storage, tuned specifically for high-throughput ingestion and low-latency queries on recent data
45 UltraWarm nodes providing economical historical search capabilities for older log data
Strategic tier separation enabling performance optimisation without cost penalties

Intelligent lifecycle management

The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions across storage tiers:

ISM lifecycle strategy:

Hot tier: 4 days on high-performance gp3-backed nodes for rapid troubleshooting and real-time analysis
UltraWarm tier: 32 days on cost-optimised storage for historical analysis and compliance retention
Automated deletion: After 36 days of total retention, data removal to control storage growth

Resilience engineering: The policies incorporated 10 retries with 12-hour delays between transition attempts, ensuring reliable data movement even during peak ingestion periods when cluster resources experienced contention

Performance optimisation

Beyond storage tiering, we implemented comprehensive performance tuning addressing specific observability workload patterns:

Query optimisation:

Auto Tune enabled for continuous performance adjustment based on workload characteristics
indices.query.bool.max_clause_count tuned to accommodate complex filter queries common in log analysis
Coordinator nodes preventing query overhead from impacting ingestion performance

Operational resilience:

Encrypted hourly snapshots to Amazon S3, providing point-in-time recovery capabilities
Automated red index recovery procedures reduce the mean time to recovery
Comprehensive monitoring through CloudWatch metrics and logs with alerts to Slack via Amazon SNS and AWS Chatbot

Security and access management

Least-privilege security architecture integrated seamlessly with existing identity infrastructure:

Okta SAML SSO with multi-factor authentication for secure access
Role-based access control enabling self-service whilst maintaining security boundaries
Infrastructure as Code using HashiCorp Terraform, ensuring consistent, auditable deployments

Core architecture components

The implementation utilised purpose-built node configurations optimised for specific workload characteristics:

Cluster management layer:

Master nodes on m5.xlarge instances, providing stable cluster state management and quorum operations
Dedicated coordinator nodes to offload query parsing and aggregation, reducing pressure on data nodes during complex analytical queries

Storage tier architecture:

Hot data nodes backed by gp3 storage, tuned specifically for high-throughput ingestion and low-latency queries on recent data
45 UltraWarm nodes providing economical historical search capabilities for older log data
Strategic tier separation enabling performance optimisation without cost penalties

Intelligent lifecycle management

The breakthrough in cost optimisation came through sophisticated Index State Management policies that automated data transitions across storage tiers:

ISM lifecycle strategy:

Hot tier: 4 days on high-performance gp3-backed nodes for rapid troubleshooting and real-time analysis
UltraWarm tier: 32 days on cost-optimised storage for historical analysis and compliance retention
Automated deletion: After 36 days of total retention, data removal to control storage growth

Performance optimisation

Beyond storage tiering, we implemented comprehensive performance tuning addressing specific observability workload patterns:

Query optimisation:

Auto Tune enabled for continuous performance adjustment based on workload characteristics
indices.query.bool.max_clause_count tuned to accommodate complex filter queries common in log analysis
Coordinator nodes preventing query overhead from impacting ingestion performance

Operational resilience:

Encrypted hourly snapshots to Amazon S3, providing point-in-time recovery capabilities
Automated red index recovery procedures reduce the mean time to recovery
Comprehensive monitoring through CloudWatch metrics and logs with alerts to Slack via Amazon SNS and AWS Chatbot

Security and access management

Least-privilege security architecture integrated seamlessly with existing identity infrastructure:

Okta SAML SSO with multi-factor authentication for secure access
Role-based access control enabling self-service whilst maintaining security boundaries
Infrastructure as Code using HashiCorp Terraform, ensuring consistent, auditable deployments

The result and business impact

Three months after deployment, our implementation delivered measurable improvements across cost, performance, and operational efficiency.

Performance and cost achievements

Zero query timeouts during burst ingestion periods
Stable ingestion at approximately 3TB daily with headroom exceeding 15% growth requirements
Approximately 65% reduction in long-term storage costs through UltraWarm and automated lifecycle management
Faster dashboard rendering, enabling real-time troubleshooting during incidents

Operational excellence

Hourly encrypted snapshots, combined with automated recovery procedures, significantly reduced the mean time to recovery. Automated ISM policies and Auto Tune optimisation minimised manual intervention requirements whilst comprehensive CloudWatch integration enabled proactive capacity management.

Business value realisation

Extended 36-day retention became economically feasible, supporting audit requirements without budget constraints. The elimination of query timeouts enhanced incident response capabilities, while the platform architecture demonstrated the ability to support aggressive growth projections without performance degradation.

Key learnings and best practices

Lifecycle calibration importance

ISM policies require careful calibration based on real traffic patterns rather than theoretical models. We iterated timing and shard sizing using actual workload data, discovering that generic recommendations often missed workload-specific optimisation opportunities.

Starting conservatively with longer hot retention and gradually optimising based on query patterns proved more reliable than an aggressive initial configuration.

Coordinator node impact

Dedicated coordinator nodes delivered material performance improvements by offloading query parsing and aggregation from data nodes. This separation proved especially valuable during complex analytical queries across large time ranges, preventing query overhead from impacting ingestion performance.

Auto Tune with analysis

Enabling Auto Tune provided baseline optimisation, but combining it with slow query log analysis produced superior results. Regular review of query patterns enabled targeted index configuration adjustments that Auto Tune alone couldn't address.

Resilience through automation

Hourly snapshots initially seemed excessive but proved invaluable when optimising for cost. The ability to rapidly recover from red index states or misconfigured lifecycle policies enabled aggressive cost optimisation without operational risk.

Tested recovery runbooks transformed theoretical backup capabilities into practical operational confidence, enabling faster incident resolution.

Lifecycle calibration importance

Starting conservatively with longer hot retention and gradually optimising based on query patterns proved more reliable than an aggressive initial configuration.

Coordinator node impact

Auto Tune with analysis

Resilience through automation

Tested recovery runbooks transformed theoretical backup capabilities into practical operational confidence, enabling faster incident resolution.

Looking forward

This implementation demonstrates how sophisticated architecture and intelligent automation can resolve the traditional tension between observability depth and operational cost at scale. Purpose-built node roles enable independent optimisation of different workload characteristics, whilst intelligent data lifecycle management transforms retention economics.

Future enhancements will focus on integrating machine learning for predictive capacity planning and automating shard sizing based on ingestion patterns. The platform's architecture provides proven scalability beyond current requirements, positioning the organisation for continued rapid growth without observability constraints.

Delivering enterprise-grade observability at scale with Amazon OpenSearch Service

As an Amazon OpenSearch Service Delivery Partner, Adaptavist has demonstrated expertise in architecting high-performance observability solutions that handle massive data volumes efficiently. Our proven approach helps to optimise costs while maintaining performance, enabling organisations to gain real-time insights from their data at any scale.

Discover our AWS solutions and expertise

Need a resilient log analytics platform? Let's discuss your requirements.

Contact Adaptavist today to discuss how we can help.

High-scale observability on Amazon OpenSearch with ISM and UltraWarm

Requirements at a glance

Summary

The challenge

Technical challenges

Operational requirements

The solution

Core architecture components

Intelligent lifecycle management

Performance optimisation

Security and access management

Core architecture components

Intelligent lifecycle management

Performance optimisation

Security and access management

The result and business impact

Performance and cost achievements

Operational excellence

Business value realisation

Key learnings and best practices

Lifecycle calibration importance

Coordinator node impact

Auto Tune with analysis

Resilience through automation

Lifecycle calibration importance

Coordinator node impact

Auto Tune with analysis

Resilience through automation

Looking forward

Delivering enterprise-grade observability at scale with Amazon OpenSearch Service

Need a resilient log analytics platform? Let's discuss your requirements.