As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including persistence stability, dynamic network topology changes, and resource isolation and scheduling.
This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from architecture selection, resource planning, client integration to observability and operations. By following this guide, users can build an enterprise-class Redis data service that is High Availability (HA), High Performance, and Maintainability.
The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements:
Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.
Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails.
Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.
Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance.
MGET) are restricted by Slot distribution.When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity.
| Feature | Sentinel Mode | Cluster Mode |
|---|---|---|
| Scenarios | Small/Medium business, Read-heavy/Write-light, moderate data | Large business, High concurrency R/W, massive data |
| High Availability | Via Sentinel monitoring and auto-failover | Via node auto-failure detection and recovery |
| Scalability | Vertical (Scale-up), Horizontal (Read-only) | Horizontal (R/W), supports dynamic resharding |
| Read/Write Separation | Supported (Client support required) | Supported (Usually direct connection to shard master, client support required) |
| Data Sharding | None (Single node stores full data) | Yes (Data auto-sharded across multiple nodes) |
| Ops Complexity | Lower, simple architecture | Higher, involves sharding, hash slots, migration |
| Network Constraints | Requires client support for Sentinel protocol | Requires client support for Cluster protocol |
Recommendations:
Alauda Cache Service for Redis OSS currently supports 5.0, 6.0, and 7.2 stable versions. All three versions have undergone complete automated testing and production verification.
For new deployments, we strongly recommend choosing Redis 7.2:
Lifecycle
5.0 / 6.0: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications.7.2: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come.Compatibility
7.2 maintains high compatibility with 5.0 and 6.0 data commands. Most business code can migrate smoothly without modification.7.2 cannot be loaded by 6.0), but this does not affect new services.Key Features
For more details on Redis 7.2 features, please refer to the official Redis 7.2 Release Notes.
To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level:
Memory Allocation (vm.overcommit_memory)
11 (Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors.Connection Queue (net.core.somaxconn)
2048 or highernet.core.somaxconn should be increased to avoid dropping client connection requests.Transparent Huge Pages (THP)
never)Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots.
In Kubernetes containerized environments, we recommend a tiered memory management strategy:
While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies:
Fork Blocking & Page Table Copy
fork() during RDB/AOF Rewrite. Although memory pages are CoW, Process Page Tables must be fully copied, blocking the main thread.Failure Recovery Efficiency (RTO)
To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required:
maxmemory to 100% of the container Memory Limit. Recommend setting to 70% ~ 80% of the Limit.vm.overcommit_memory = 1 to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures.Resource Reservation Formula: Container_Memory_Limit ≈ Redis_MaxMemory / 0.7
Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate at least 2 Cores per Redis instance:
Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks.
When to Enable?
Configuration Best Practices:
SORT, KEYS).Persistence mode directly determines disk quota requirements. Refer to the following calculation formula:
| Mode | Recommended Quota Formula | Details |
|---|---|---|
| Diskless (Cache) | 0 (No PVC) | Used as pure cache, no RDB/AOF. Logs collected via stdout in K8s, no persistence disk needed. |
| RDB (Snapshot) | MaxMemory * 2 | RDB uses CoW. During snapshot generation, both "old snapshot" and "new snapshot being written" exist on disk. Recommendation: Reserve at least 2x memory space. |
| AOF (Append Only) | MaxMemory * 3 | AOF grows with write operations. Default config (auto-aof-rewrite-percentage 100) triggers rewrite when AOF reaches 2x data size. Disk must hold:1. Old AOF file (2x) 2. New AOF file from rewrite (1x) Peak total 3x. Recommend reserving at least 3x space. |
appendfsync everysec).Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields.
Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance.
| Template Name | Description | Scenarios | Risks |
|---|---|---|---|
| rdb-redis-<version>-<sentinel|cluster> | Enables RDB persistence, periodic snapshots to disk. | Balanced: Limited resources, balances performance/reliability, accepts minute-level data loss. | Data loss depends on save config, usually minute-level RPO. |
| aof-redis-<version>-<sentinel|cluster> | Enables AOF persistence, logs every write op. | Secure: Ample resources, high data security (second-level loss), slight performance compromise. | Frequent fsync requires high-performance storage, high IO pressure. |
| diskless-redis-<version>-<sentinel|cluster> | Disables persistence, pure in-memory. | High-Perf Cache: Acceleration only, data loss acceptable or rebuildable from source. | Restart or failure leads to full data loss. |
<version>represents Redis version, e.g.,6.0,7.2.
Key parameter differences:
| Parameter | RDB Template | AOF Template | Diskless Template | Explanation |
|---|---|---|---|---|
appendonly | no | yes | no | Enable AOF logging. |
save | 60 10000 300 100 600 1 | "" (Disabled) | "" (Disabled) | RDB snapshot triggers. |
repl-diskless-sync | no | no | yes | Master-replica full sync via socket without disk. |
repl-diskless-sync-delay | 5 | 5 | 0 | Delay for diskless sync; 0 for Diskless to speed up sync. |
appendfsync everysec for second-level protection.Redis supports running RDB and AOF together, but it is generally not recommended in Kubernetes:
Recommendation: Choose Single Persistence Mode (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space.
Redis parameters are categorized by application method:
| Category | Parameters | Behavior |
|---|---|---|
| Hot Update | Most runtime params (maxmemory, loglevel, etc.) | Immediate effect after modification, no restart. |
| Restart Update | databases, rename-command, rdbchecksum, tcp-backlog, io-threads, io-threads-do-reads | Requires Instance Restart to take effect. |
| Immutable | bind, protected-mode, port, supervised, pidfile, dir, etc. | Managed by system, modification may cause anomalies. |
Always assume data backup before modifying parameters requiring restart.
Update Data Node Parameters: Configure via spec.customConfig.
Update Sentinel Node Parameters: Configure via spec.sentinel.monitorConfig.
Currently supports
down-after-milliseconds,failover-timeout,parallel-syncs.
Deploy resources according to your actual business scenario.
| Persistence | Template | Instance Spec | Replica / Sentinel | Sentinel Pod | redis-exporter | redis (Spec) | Backup Pod | Total Resources | Storage Quota | Auto Backup (Keep 7) | Manual Backup (Keep 7) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AOF | aof-redis-<version>-sentinel | 2c4g | 1 / 3 | 100m128Mi | 100m200Mi | 2c4g | Unlimited (Reserve resources) | 4.5c4.8G | Evaluate based on actual write volume | ||
| aof-redis-<version>-sentinel | 4c8g | 4c8g | 8.5c8.8G | ||||||||
| RDB | rdb-redis-<version>-sentinel | 2c4g | 2c4g | 4.5c4.8G | 8G | 28G | 28G | ||||
| rdb-redis-<version>-sentinel | 4c8g | 4c8g | 8.5c8.8G | 16G | 56G | 56G | |||||
| Diskless | diskless-redis-<version>-sentinel | 2c4g | 2c4g | 4.5c4.8G | / | 28G | 28G | ||||
| diskless-redis-<version>-sentinel | 4c8g | 4c8g | 8.5c8.8G | 56G | 56G | ||||||
| Persistence | Template | Instance Spec | Sharding / Replica | redis-exporter | redis (Spec) | Backup Pod | Total Resources | Storage Quota | Auto Backup (Keep 7) | Manual Backup (Keep 7) |
|---|---|---|---|---|---|---|---|---|---|---|
| AOF | aof-redis-<version>-cluster | 2c4g | 3 / 1 | 100m300Mi | 2c4g | Unlimited (Reserve resources) | 12.6c25.8G | Evaluate based on actual write volume | ||
| aof-redis-<version>-cluster | 4c8g | 4c8g | 24.6c49.8G | |||||||
| RDB | rdb-redis-<version>-cluster | 2c4g | 2c4g | 12.6c25.8G | 24G | 84G | 84G | |||
| rdb-redis-<version>-cluster | 4c8g | 4c8g | 24.6c49.8G | 48G | 168G | 168G | ||||
| Diskless | diskless-redis-<version>-cluster | 2c4g | 2c4g | 12.6c25.8G | / | 84G | 84G | |||
| diskless-redis-<version>-cluster | 4c8g | 4c8g | 24.6c49.8G | 168G | 168G | |||||
<version>represents Redis version, e.g.,6.0,7.2.
Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments.
You can use the spec.nodeSelector field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools.
Persistence Limitation: If your Redis instance mounts Non-Network Storage (e.g., Local PV) PVCs, be cautious when updating nodeSelector. Since local data resides on specific nodes and cannot migrate with Pods, the updated nodeSelector set MUST include the node where the Pod currently resides. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction.
Use spec.tolerations to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., key=redis:NoSchedule), preventing other non-critical workloads from preempting resources.
To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode.
Immutable: To ensure consistency and reliability, anti-affinity configurations (both affinityPolicy and affinity) cannot be modified after instance creation. Please plan ahead.
In Cluster mode, the system prioritizes spec.affinityPolicy. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet.
spec.affinityPolicy > spec.affinity.affinityPolicy is unset: Alauda Cache Service for Redis OSS checks spec.affinity. If you need custom topology rules beyond the enums below, leave affinityPolicy empty and configure native spec.affinity.| Policy Name | affinityPolicy Value | Behavior | Pros/Cons | Scenario |
|---|---|---|---|---|
| All Pods Forced Anti-Affinity | AntiAffinity | Forces ALL Pods in the cluster (including primary/replicas of different shards) to be on different nodes. Fails if node count < total Pod count. |
| Cluster Mode Core Business Ample resources, strict HA requirements. |
| Shard Primary-Replica Forced Anti-Affinity | AntiAffinityInSharding | Forces Primary and Replicas within same shard to be on different nodes. Pods from different shards can coexist. |
| Production Standard Balances resource usage and data safety. |
| Shard Primary-Replica Soft Anti-Affinity | SoftAntiAffinity | Prioritizes spreading shard primary/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node. |
| Test/Dev Environments Or resource-constrained edge environments. |
Important Sentinel Mode does not support
spec.affinityPolicy.
For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules:
spec.affinity.spec.sentinel.affinity.You need to manually write complete Affinity rules. Example for forcing anti-affinity for both Data and Sentinel nodes:
To force anti-affinity across ALL nodes (Data + Sentinel), refer to:
Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via RedisUser CRD, supporting ACLs.
Compatibility: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control.
The platform pre-defines permission profiles for common scenarios:
| Profile | ACL Rule | Explanation |
|---|---|---|
| NotDangerous | +@all -@dangerous ~* | Allows all commands except dangerous ones (e.g., FLUSHDB). |
| ReadWrite | -@all +@write +@read -@dangerous ~* | Allows read/write, blocks dangerous ops. |
| ReadOnly | -@all +@read -keys ~* | Allows read-only operations. |
| Administrator | +@all -acl ~* | Admin privileges, allows all commands except ACL management. |
For custom ACLs, see Redis ACL Documentation.
RedisUser creation/updates undergo Webhook validation to force remove acl permissions, preventing privilege escalation.cluster|slots, cluster|nodes, cluster|info, cluster|keyslot, cluster|getkeysinslot, cluster|countkeysinslot to ensure client awareness.&* (Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs.Each Redis instance automatically generates a system account named operator. Its roles include:
ReadOnly.ReadWrite.NotDangerous or custom permissions.Administrator: Unless absolutely necessary.~app1:*) to restrict apps to specific key prefixes.For operation steps, see User Management Docs.
Both Sentinel and Cluster modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes:
SENTINEL get-master-addr-by-name mymaster to get Master IP/Port.CLUSTER SLOTS / CLUSTER NODES to get Slot Distribution.MOVED/ASK; client must refresh topology.Both protocols return Real Node IPs. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. Thus, Each Redis Pod needs an independent external address (NodePort/LoadBalancer), not a single proxy address.
Alauda Cache Service for Redis OSS supports multiple access methods:
| Method | Recommended | Description |
|---|---|---|
| ClusterIP | ✅ Internal Preferred | Access Sentinel via K8s Service (port 26379). Clients auto-discover Master. Lowest latency, highest security. |
| LoadBalancer | ✅ External Preferred | Exposes Sentinel via MetalLB/Cloud LB. Stable external entry, no port management. |
| NodePort | ⚠️ External Backup | Exposes Sentinel via Node ports. Requires manual port management, risky, potential multi-NIC binding issues. |
| Method | Recommended | Description |
|---|---|---|
| ClusterIP | ✅ Internal Preferred | Access via K8s Service. Client must support Cluster protocol. |
| LoadBalancer | ✅ External Preferred | Configure LB for each shard Master. Stable external access. Client must handle MOVED/ASK. |
| NodePort | ⚠️ External Backup | Expose underlying Pod NodePorts. Client connects directly. Complex port management. |
We provide best practice examples for go-redis, Jedis, Lettuce, and Redisson:
Master Group Name: In Sentinel mode, the master name is fixed to mymaster.
Timeouts
Retry Strategy
Connection Pooling
MaxTotal reasonably to avoid hitting Redis maxclients.Topology Refresh (Cluster)
MOVED/ASK handling.The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances.
See Backup & Restore.
See Upgrade.
When changing specs (CPU/Mem) or expanding:
When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash.
Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus.
Variables {{.namespace}} and {{.name}} should be replaced with actual values.
maxclients.Recommended production alerts:
| Metric | Threshold | Note |
|---|---|---|
| Memory Usage | > 80% | Risk of eviction/OOM. |
| CPU Usage | > 80% (Sustained) | Latency spikes. |
| Hit Rate | < 80% | Strategy issue or capacity missing. |
| Failovers | > 0 | Check network/node health. |
| Connections | Near maxclients | New connections rejected. |
| Storage Usage | > 80% | Ensure space for AOF/RDB. |
| Response Time | > 10ms | Slow queries/bottlenecks. |
For specific issues, search the Customer Portal.