Best Practices

Overview

As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including persistence stability, dynamic network topology changes, and resource isolation and scheduling.

This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from architecture selection, resource planning, client integration to observability and operations. By following this guide, users can build an enterprise-class Redis data service that is High Availability (HA), High Performance, and Maintainability.

Architecture Selection

The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements:

Sentinel Mode

Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.

Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails.

Pros: Simple architecture, mature operations, lower requirements for client protocols.
Cons: Write capacity is limited to a single node; storage capacity cannot scale horizontally.

Cluster Mode

Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.

Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance.

Pros: True high availability distributed storage, supports dynamic Resharding.
Cons: Complex client protocol; specific multi-key commands (e.g., MGET) are restricted by Slot distribution.

Selection Guide

When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity.

Feature	Sentinel Mode	Cluster Mode
Scenarios	Small/Medium business, Read-heavy/Write-light, moderate data	Large business, High concurrency R/W, massive data
High Availability	Via Sentinel monitoring and auto-failover	Via node auto-failure detection and recovery
Scalability	Vertical (Scale-up), Horizontal (Read-only)	Horizontal (R/W), supports dynamic resharding
Read/Write Separation	Supported (Client support required)	Supported (Usually direct connection to shard master, client support required)
Data Sharding	None (Single node stores full data)	Yes (Data auto-sharded across multiple nodes)
Ops Complexity	Lower, simple architecture	Higher, involves sharding, hash slots, migration
Network Constraints	Requires client support for Sentinel protocol	Requires client support for Cluster protocol

Recommendations:

If data volume is small (fits in single node memory) and simplicity/stability is priority, Sentinel Mode is preferred.
If data volume is massive or write pressure is extremely high and cannot be supported by a single node, choose Cluster Mode.

Version Selection

Alauda Cache Service for Redis OSS currently supports 5.0, 6.0, and 7.2 stable versions. All three versions have undergone complete automated testing and production verification.

For new deployments, we strongly recommend choosing Redis 7.2:

Lifecycle
- 5.0 / 6.0: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications.
- 7.2: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come.
Compatibility
- Redis 7.2 maintains high compatibility with 5.0 and 6.0 data commands. Most business code can migrate smoothly without modification.
- Note: RDB persistence file format (v11) is not backward compatible (i.e., RDB generated by 7.2 cannot be loaded by 6.0), but this does not affect new services.
Key Features
- ACL v2: Provides granular access control (Key-based permission selectors), significantly enhancing security in multi-tenant environments.
- Redis Functions: Introduces Server-side Scripting standards, resolving issues with Lua script loss and replication, keeping logic closer to data.
- Sharded Pub/Sub: Resolves network storm issues caused by Pub/Sub broadcasting in Cluster mode, significantly improving messaging scalability via sharding.
- Performance Optimization: Deep optimizations in data structures (especially Sorted Sets) and memory management provide higher throughput and lower latency.

For more details on Redis 7.2 features, please refer to the official Redis 7.2 Release Notes.

Resource Planning

Kernel Tuning

To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level:

Memory Allocation (vm.overcommit_memory)
- Recommended: 1
- Explanation: Setting to 1 (Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors.
Connection Queue (net.core.somaxconn)
- Recommended: 2048 or higher
- Explanation: Redis default tcp-backlog is 511. In high concurrency scenarios, system net.core.somaxconn should be increased to avoid dropping client connection requests.
Transparent Huge Pages (THP)
- Action: Disable (never)
- Explanation: THP causes significant latency spikes during memory allocation in Redis, especially during Copy-on-Write (CoW) after Fork. It is recommended to disable this on the host or via startup scripts.

Memory Specifications

Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots.

In Kubernetes containerized environments, we recommend a tiered memory management strategy:

✅ Standard Specs (< 8GB): Strongly Recommended. Ensures extremely low Fork latency and fast failure recovery (RTO < 60s); the most robust production choice.
⚠️ High-Performance Specs (8GB - 16GB): Acceptable. Requires high-performance host and THP must be disabled. Fork is controllable but may cause ~100ms jitter under high load.
❌ High-Risk Specs (> 16GB): Not Recommended. Single point of failure impact is too large, and full synchronization can easily saturate network bandwidth. Recommend horizontal splitting into Cluster mode.

Why Limit to 8GB?

While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies:

Fork Blocking & Page Table Copy
- Redis calls fork() during RDB/AOF Rewrite. Although memory pages are CoW, Process Page Tables must be fully copied, blocking the main thread.
- Estimation: 10GB memory ≈ 20MB page table ≈ 10~50ms blocking (depending on virtualization overhead). Exceeding 8GB increases blocking risk exponentially, impacting SLA.
Failure Recovery Efficiency (RTO)
- Container restart loading RDB is a single-threaded CPU-bound task (object deserialization). Tests show loading 8GB data takes 30-50s (even with SSD). Maintaining 32GB could result in multi-minute start times, contradicting K8s "fast self-healing" philosophy.

Memory Configuration Best Practices

To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required:

Set MaxMemory: Do not set maxmemory to 100% of the container Memory Limit. Recommend setting to 70% ~ 80% of the Limit.
Reserve CoW Space: Redis Forks a child process during RDB/AOF Rewrite. If there are heavy write updates, OS Copy-on-Write mechanisms duplicate memory pages; in extreme cases, memory usage can double from 8GB to 16GB.
Overcommit Config: Ensure host vm.overcommit_memory = 1 to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures.

INFO

Resource Reservation Formula: Container_Memory_Limit ≈ Redis_MaxMemory / 0.7

Example: To store 8GB data, configure Container Memory Limit to 10GB ~ 12GB, leaving 2GB+ for CoW and fragmentation overhead.

CPU Resources

Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate at least 2 Cores per Redis instance:

Core 1: Handles main thread requests and commands.
Core 2: Handles persistence fork, background tasks, and system overhead.

Multi-threading

Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks.

When to Enable?
- Bottleneck Analysis: When Redis CPU usage nears 100% and analysis shows time spent on Kernel State Network I/O (System CPU) rather than user-space command execution.
- Traffic Profile: Typically beneficial when single instance QPS > 80,000 or network traffic is huge (> 1GB/s).
- Resource Conditions: Ensure node has sufficient CPU cores (at least 4 cores).
Configuration Best Practices:
- Thread Count: Recommend 4~8 I/O threads. Exceeding 8 threads rarely yields significant gain.
- Config Example:
  io-threads 4 io-threads-do-reads yes
- Note: Multi-threaded I/O only improves network throughput; it does NOT improve execution speed of single complex commands (e.g., SORT, KEYS).

Storage Planning

Capacity Planning

Persistence mode directly determines disk quota requirements. Refer to the following calculation formula:

Mode	Recommended Quota Formula	Details
Diskless (Cache)	`0` (No PVC)	Used as pure cache, no RDB/AOF. Logs collected via stdout in K8s, no persistence disk needed.
RDB (Snapshot)	`MaxMemory * 2`	RDB uses CoW. During snapshot generation, both "old snapshot" and "new snapshot being written" exist on disk. Recommendation: Reserve at least 2x memory space.
AOF (Append Only)	`MaxMemory * 3`	AOF grows with write operations. Default config (`auto-aof-rewrite-percentage 100`) triggers rewrite when AOF reaches 2x data size. Disk must hold: 1. Old AOF file (2x) 2. New AOF file from rewrite (1x) Peak total 3x. Recommend reserving at least 3x space.

Performance Requirements

With AOF: Disk performance is critical. Insufficient IOPS or high fsync latency will directly block the main thread (when appendfsync everysec).
Media: Production environments strongly recommend SSD/NVMe local disks or high-performance cloud disks.

Parameter Configuration

Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields.

Built-in Templates

Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance.

Template Name	Description	Scenarios	Risks
rdb-redis-<version>-<sentinel\|cluster>	Enables RDB persistence, periodic snapshots to disk.	Balanced: Limited resources, balances performance/reliability, accepts minute-level data loss.	Data loss depends on `save` config, usually minute-level RPO.
aof-redis-<version>-<sentinel\|cluster>	Enables AOF persistence, logs every write op.	Secure: Ample resources, high data security (second-level loss), slight performance compromise.	Frequent fsync requires high-performance storage, high IO pressure.
diskless-redis-<version>-<sentinel\|cluster>	Disables persistence, pure in-memory.	High-Perf Cache: Acceleration only, data loss acceptable or rebuildable from source.	Restart or failure leads to full data loss.

<version> represents Redis version, e.g., 6.0, 7.2.

Key parameter differences:

Parameter	RDB Template	AOF Template	Diskless Template	Explanation
`appendonly`	`no`	`yes`	`no`	Enable AOF logging.
`save`	`60 10000 300 100 600 1`	`""` (Disabled)	`""` (Disabled)	RDB snapshot triggers.
`repl-diskless-sync`	`no`	`no`	`yes`	Master-replica full sync via socket without disk.
`repl-diskless-sync-delay`	`5`	`5`	`0`	Delay for diskless sync; 0 for Diskless to speed up sync.

Persistence Selection Recommendations

Pure Cache: Choose Diskless Template. Data rebuildable, no overhead, best performance.
General Business: Choose RDB Template. Periodic snapshots provide minute-level RPO, moderate resource usage.
Financial/High-Reliability: Choose AOF Template with appendfsync everysec for second-level protection.

WARNING

Redis supports running RDB and AOF together, but it is generally not recommended in Kubernetes:

Performance: AOF fsync creates IO pressure; adding RDB fork + disk write significantly increases resource contention.
Storage Doubling: Requires space for both RDB snapshots and AOF files, complicating PVC planning.
Recovery Priority: Redis loads AOF first on start (more complete data); RDB acts only as backup, offering limited benefit.
Platform Backup: Alauda Cache Service for Redis OSS provides independent auto/manual backup, removing reliance on RDB snapshots for extra insurance.

Recommendation: Choose Single Persistence Mode (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space.

Parameter Update

Redis parameters are categorized by application method:

Category	Parameters	Behavior
Hot Update	Most runtime params (`maxmemory`, `loglevel`, etc.)	Immediate effect after modification, no restart.
Restart Update	`databases`, `rename-command`, `rdbchecksum`, `tcp-backlog`, `io-threads`, `io-threads-do-reads`	Requires Instance Restart to take effect.
Immutable	`bind`, `protected-mode`, `port`, `supervised`, `pidfile`, `dir`, etc.	Managed by system, modification may cause anomalies.

TIP

Always assume data backup before modifying parameters requiring restart.

Modification Examples

Update Data Node Parameters: Configure via spec.customConfig.

# Example: Modify save strategy (Hot update)
kubectl -n <namespace> patch redis <instance-name> --type=merge --patch='{"spec": {"customConfig": {"save":"600 1"}}}'

Update Sentinel Node Parameters: Configure via spec.sentinel.monitorConfig.

Currently supports down-after-milliseconds, failover-timeout, parallel-syncs.

# Example: Modify failover timeout
kubectl -n <namespace> patch redis <instance-name> --type=merge --patch='{"spec": {"sentinel": {"monitorConfig": {"down-after-milliseconds":"30000"}}}}'

Resource Specs

Deploy resources according to your actual business scenario.

Sentinel Mode Specs

Persistence	Template	Instance Spec	Replica / Sentinel	Sentinel Pod	redis-exporter	redis (Spec)	Backup Pod	Total Resources	Storage Quota	Auto Backup (Keep 7)	Manual Backup (Keep 7)
AOF	aof-redis-<version>-sentinel	2c4g	1 / 3	100m128Mi	100m200Mi	2c4g	Unlimited (Reserve resources)	4.5c4.8G	Evaluate based on actual write volume
AOF	aof-redis-<version>-sentinel	4c8g				4c8g		8.5c8.8G	Evaluate based on actual write volume
RDB	rdb-redis-<version>-sentinel	2c4g				2c4g		4.5c4.8G	8G	28G	28G
RDB	rdb-redis-<version>-sentinel	4c8g				4c8g		8.5c8.8G	16G	56G	56G
Diskless	diskless-redis-<version>-sentinel	2c4g				2c4g		4.5c4.8G	/	28G	28G
Diskless	diskless-redis-<version>-sentinel	4c8g				4c8g		8.5c8.8G	/	56G	56G

Cluster Mode Specs

Persistence	Template	Instance Spec	Sharding / Replica	redis-exporter	redis (Spec)	Backup Pod	Total Resources	Storage Quota	Auto Backup (Keep 7)	Manual Backup (Keep 7)
AOF	aof-redis-<version>-cluster	2c4g	3 / 1	100m300Mi	2c4g	Unlimited (Reserve resources)	12.6c25.8G	Evaluate based on actual write volume
AOF	aof-redis-<version>-cluster	4c8g			4c8g		24.6c49.8G	Evaluate based on actual write volume
RDB	rdb-redis-<version>-cluster	2c4g			2c4g		12.6c25.8G	24G	84G	84G
RDB	rdb-redis-<version>-cluster	4c8g			4c8g		24.6c49.8G	48G	168G	168G
Diskless	diskless-redis-<version>-cluster	2c4g			2c4g		12.6c25.8G	/	84G	84G
Diskless	diskless-redis-<version>-cluster	4c8g			4c8g		24.6c49.8G	/	168G	168G

<version> represents Redis version, e.g., 6.0, 7.2.

Scheduling

Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments.

Node Selection

You can use the spec.nodeSelector field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools.

WARNING

Persistence Limitation: If your Redis instance mounts Non-Network Storage (e.g., Local PV) PVCs, be cautious when updating nodeSelector. Since local data resides on specific nodes and cannot migrate with Pods, the updated nodeSelector set MUST include the node where the Pod currently resides. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction.

Taint Toleration

Use spec.tolerations to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., key=redis:NoSchedule), preventing other non-critical workloads from preempting resources.

Anti-Affinity

To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode.

CAUTION

Immutable: To ensure consistency and reliability, anti-affinity configurations (both affinityPolicy and affinity) cannot be modified after instance creation. Please plan ahead.

Cluster Mode

In Cluster mode, the system prioritizes spec.affinityPolicy. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet.

Priority: spec.affinityPolicy > spec.affinity.
If affinityPolicy is unset: Alauda Cache Service for Redis OSS checks spec.affinity. If you need custom topology rules beyond the enums below, leave affinityPolicy empty and configure native spec.affinity.

Policy Name	affinityPolicy Value	Behavior	Pros/Cons	Scenario
All Pods Forced Anti-Affinity	`AntiAffinity`	Forces ALL Pods in the cluster (including primary/replicas of different shards) to be on different nodes. Fails if node count < total Pod count.	Pros: Highest disaster recovery, minimal single-node failure impact. Cons: extremely high resource requirement, Node count must be >= Total Pods.	Cluster Mode Core Business Ample resources, strict HA requirements.
Shard Primary-Replica Forced Anti-Affinity	`AntiAffinityInSharding`	Forces Primary and Replicas within same shard to be on different nodes. Pods from different shards can coexist.	Pros: Guarantees physical isolation of data replicas, preventing shard migration data loss. Cons: Scheduling fails if live nodes < replica count. Primaries of different shards might land on same node (single point of failure risk).	Production Standard Balances resource usage and data safety.
Shard Primary-Replica Soft Anti-Affinity	`SoftAntiAffinity`	Prioritizes spreading shard primary/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node.	Pros: Highest deployment success rate, runs with limited resources. Cons: Primary/Replica may share node in extreme cases, risking data loss.	Test/Dev Environments Or resource-constrained edge environments.

Sentinel Mode

Important Sentinel Mode does not support spec.affinityPolicy.

For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules:

Redis Data Nodes: Configured via spec.affinity.
Sentinel Nodes: Configured via spec.sentinel.affinity.

You need to manually write complete Affinity rules. Example for forcing anti-affinity for both Data and Sentinel nodes:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - redis
          - key: redisfailovers.databases.spotahome.com/name
            operator: In
            values:
            - <instance name>
        topologyKey: kubernetes.io/hostname
  sentinel:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/component
              operator: In
              values:
              - sentinel
            - key: redissentinels.databases.spotahome.com/name
              operator: In
              values:
              - <instance name>
          topologyKey: kubernetes.io/hostname

To force anti-affinity across ALL nodes (Data + Sentinel), refer to:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: middleware.instance/type
            operator: In
            values:
            - redis-failover
          - key: middleware.instance/name
            operator: In
            values:
            - <instance name>
        topologyKey: kubernetes.io/hostname
  sentinel:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: middleware.instance/type
              operator: In
              values:
              - redis-failover
            - key: middleware.instance/name
              operator: In
              values:
              - <instance name>
          topologyKey: kubernetes.io/hostname

User Management

Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via RedisUser CRD, supporting ACLs.

TIP

Compatibility: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control.

Permission Profiles

The platform pre-defines permission profiles for common scenarios:

Profile	ACL Rule	Explanation
NotDangerous	`+@all -@dangerous ~*`	Allows all commands except dangerous ones (e.g., `FLUSHDB`).
ReadWrite	`-@all +@write +@read -@dangerous ~*`	Allows read/write, blocks dangerous ops.
ReadOnly	`-@all +@read -keys ~*`	Allows read-only operations.
Administrator	`+@all -acl ~*`	Admin privileges, allows all commands except ACL management.

For custom ACLs, see Redis ACL Documentation.

Security Mechanisms

ACL Force Revocation: All RedisUser creation/updates undergo Webhook validation to force remove acl permissions, preventing privilege escalation.
Cluster Command Injection: For Cluster Mode, Alauda Cache Service for Redis OSS automatically injects topology commands: cluster|slots, cluster|nodes, cluster|info, cluster|keyslot, cluster|getkeysinslot, cluster|countkeysinslot to ensure client awareness.
6.0 -> 7.2 Upgrade Compatibility: When upgrading 6.0 -> 7.2, the operator adds &* (Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs.

System Account

Each Redis instance automatically generates a system account named operator. Its roles include:

Cluster Init: Slot assignment, node joining.
Config Simplification: Unified system account reduces user configuration complexity.
Operations: Used for health checks, failovers, scaling.
Avoid Restarts: Password updates for business users don't affect this account, avoiding restarts.

CAUTION

Complexity: Random 64-char string (alphanumeric+special).
Privilege: Highest level (includes user management).
Restriction: No online password update and DO NOT manually modify/delete, as it may cause irreversible failure.

Production Best Practices

App Isolation: Create independent user accounts for each app/microservice. Avoid sharing accounts to enable auditing and isolation.
Principle of Least Privilege:
- Read-Only App: Use ReadOnly.
- Read-Write App: Use ReadWrite.
- Ops Tools: Use NotDangerous or custom permissions.
- Avoid Administrator: Unless absolutely necessary.
Key Namespace Isolation: Combine ACL Key patterns (e.g., ~app1:*) to restrict apps to specific key prefixes.
Password Rotation: Establish mechanisms to regularly rotate app passwords.

For operation steps, see User Management Docs.

Client Access

Topology Discovery

Both Sentinel and Cluster modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes:

Sentinel Mode

Client connects to Sentinel Node.
Client sends SENTINEL get-master-addr-by-name mymaster to get Master IP/Port.
Client directly connects to Master.
On failover, Sentinel notifies client (or client polls) to switch to new Master.

Cluster Mode

Client connects to any Cluster Node.
Sends CLUSTER SLOTS / CLUSTER NODES to get Slot Distribution.
Calculates hash slot for Key and directly connects to target node.
If slot migrates, node returns MOVED/ASK; client must refresh topology.

Both protocols return Real Node IPs. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. Thus, Each Redis Pod needs an independent external address (NodePort/LoadBalancer), not a single proxy address.

Network Access Strategies

Alauda Cache Service for Redis OSS supports multiple access methods:

Sentinel Mode

Method	Recommended	Description
ClusterIP	✅ Internal Preferred	Access Sentinel via K8s Service (port 26379). Clients auto-discover Master. Lowest latency, highest security.
LoadBalancer	✅ External Preferred	Exposes Sentinel via MetalLB/Cloud LB. Stable external entry, no port management.
NodePort	⚠️ External Backup	Exposes Sentinel via Node ports. Requires manual port management, risky, potential multi-NIC binding issues.

Cluster Mode

Method	Recommended	Description
ClusterIP	✅ Internal Preferred	Access via K8s Service. Client must support Cluster protocol.
LoadBalancer	✅ External Preferred	Configure LB for each shard Master. Stable external access. Client must handle MOVED/ASK.
NodePort	⚠️ External Backup	Expose underlying Pod NodePorts. Client connects directly. Complex port management.

WARNING

Port Management: Range limited (30000-32767), conflicts easy in multi-instance.
Security: Increases attack surface.
Multi-NIC: Redis binds default NIC; clients may fail to connect if IPs mismatch.
No LB Proxy: Sentinel/Cluster protocols require direct node connection; cannot be proxied by standard LBs.

INFO

Sentinel (1P1R + 3 Sentinels): Needs 8 NodePorts/LBs.
Cluster (3 Shards x 1P1R): Needs 7 NodePorts/LBs.

Code Examples

We provide best practice examples for go-redis, Jedis, Lettuce, and Redisson:

Sentinel Access: How to Access Sentinel Instance
Cluster Access: How to Access Cluster Instance

INFO

Master Group Name: In Sentinel mode, the master name is fixed to mymaster.

Client Reliability Best Practices

Timeouts
- Connect Timeout: distinct from Read Timeout. Recommend 1-3s.
- Read/Write Timeout: Based on SLA, usually hundreds of ms.
Retry Strategy
- Exponential Backoff: Do not retry immediately on failure; use backoff (100ms, 200ms...) to avoid retry storms.
Connection Pooling
- Reuse: Always use pooling (JedisPool, go-redis Pool) to save handshake costs.
- Max Connections: Set MaxTotal reasonably to avoid hitting Redis maxclients.
Topology Refresh (Cluster)
- Auto-refresh: Ensure client enables MOVED/ASK handling.
- Periodic refresh: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.

Observability & Operations

Backup & Security

The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances.

See Backup & Restore.

Upgrade & Scaling

Upgrade

See Upgrade.

Scaling Notes

When changing specs (CPU/Mem) or expanding:

Assess Resources: Ensure cluster has capacity.
Progressive: Rolling updates to minimize interruption.
Off-peak: Execute during low traffic.

CAUTION

When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash.

Monitoring

Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus.

Built-in Metrics

Variables {{.namespace}} and {{.name}} should be replaced with actual values.

Key Hit Rate

Desc: Cache hit rate.
Unit: %

Expr:

1/(1+(avg(irate(redis_keyspace_misses_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service) / (avg(irate(redis_keyspace_hits_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service)+1)))

Average Response Time

Desc: Avg command latency. High = slow queries/bottleneck.
Unit: s

Expr:

avg((redis_commands_duration_seconds_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"} / redis_commands_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"})) by (namespace,service)

Role Switching

Desc: Master-Replica switches in 5m. Non-zero = failover occurred.
Unit: Count

Expr:

sum by(namespace,service) (changes((sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",role="master"}) OR (sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",}) * 0))[5m:10s]))

Instance Status

Desc: Health status. 0 = Abnormal.

Expr:

((count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster"}) % count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"})) == bool 0 and count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"}) >= bool 3) or (count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"rfr-({{.name}})",redisarch="sentinel",role="master"})) > bool 0

Node Input Bandwidth

Desc: Peak ingress traffic.
Unit: Bps

Expr:

max by(namespace,service)(irate(redis_net_input_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m]))

Node Output Bandwidth

Desc: Peak egress traffic.
Unit: Bps

Expr:

max by(namespace,service)(irate(redis_net_output_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m]))

Node Connections

Desc: Peak client connections. Watch if near maxclients.
Unit: Count

Expr:

max by(namespace,service)(redis_connected_clients{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*"})

CPU Usage

Desc: Node CPU usage. Sustained high = perf impact.
Unit: %

Expr:

avg by(namespace,pod_name)(irate(container_cpu_usage_seconds_total{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}[5m]))/avg by(namespace,pod_name)(container_spec_cpu_quota{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})*100000

Memory Usage

Desc: Node memory usage. >80% suggest scaling.
Unit: %

Expr:

avg by(namespace,pod_name)(container_memory_usage_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"} - container_memory_cache{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) / avg by(namespace,pod_name)(container_spec_memory_limit_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})

Storage Usage

Desc: PVC usage. Full = persistence failure.
Unit: %

Expr:

avg(kubelet_volume_stats_used_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) / avg(kubelet_volume_stats_capacity_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim)

Key Metrics & Alert Recommendations

Recommended production alerts:

Metric	Threshold	Note
Memory Usage	> 80%	Risk of eviction/OOM.
CPU Usage	> 80% (Sustained)	Latency spikes.
Hit Rate	< 80%	Strategy issue or capacity missing.
Failovers	> 0	Check network/node health.
Connections	Near maxclients	New connections rejected.
Storage Usage	> 80%	Ensure space for AOF/RDB.
Response Time	> 10ms	Slow queries/bottlenecks.

Troubleshooting

For specific issues, search the Customer Portal.

Redis APIs

#Best Practices

#TOC

#Overview

#Architecture Selection

#Sentinel Mode

#Cluster Mode

#Selection Guide

#Version Selection

#Resource Planning

#Kernel Tuning

#Memory Specifications

#Why Limit to 8GB?

#Memory Configuration Best Practices

#CPU Resources

#Multi-threading

#Storage Planning

#Capacity Planning

#Performance Requirements

#Parameter Configuration

#Built-in Templates

#Persistence Selection Recommendations

#Parameter Update

#Modification Examples

#Resource Specs

#Sentinel Mode Specs

#Cluster Mode Specs

#Scheduling

#Node Selection

#Taint Toleration

#Anti-Affinity

#Cluster Mode

#Sentinel Mode

#User Management

#Permission Profiles

#Security Mechanisms

#System Account

#Production Best Practices

#Client Access

#Topology Discovery

#Sentinel Mode

#Cluster Mode

#Network Access Strategies

#Sentinel Mode

#Cluster Mode

#Code Examples

#Client Reliability Best Practices

#Observability & Operations

#Backup & Security

#Upgrade & Scaling

#Upgrade

#Scaling Notes

#Monitoring

#Built-in Metrics

#Key Hit Rate

#Average Response Time

#Role Switching

#Instance Status

#Node Input Bandwidth

#Node Output Bandwidth

#Node Connections

#CPU Usage

#Memory Usage

#Storage Usage

#Key Metrics & Alert Recommendations

#Troubleshooting

#References

Best Practices

TOC

Overview

Architecture Selection

Sentinel Mode

Cluster Mode

Selection Guide

Version Selection

Resource Planning

Kernel Tuning

Memory Specifications

Why Limit to 8GB?

Memory Configuration Best Practices

CPU Resources

Multi-threading

Storage Planning

Capacity Planning

Performance Requirements

Parameter Configuration

Built-in Templates

Persistence Selection Recommendations

Parameter Update

Modification Examples

Resource Specs

Sentinel Mode Specs

Cluster Mode Specs

Scheduling

Node Selection

Taint Toleration

Anti-Affinity

Cluster Mode

Sentinel Mode

User Management

Permission Profiles

Security Mechanisms

System Account

Production Best Practices

Client Access

Topology Discovery

Sentinel Mode

Cluster Mode

Network Access Strategies

Sentinel Mode

Cluster Mode

Code Examples

Client Reliability Best Practices

Observability & Operations

Backup & Security

Upgrade & Scaling

Upgrade

Scaling Notes

Monitoring

Built-in Metrics

Key Hit Rate

Average Response Time

Role Switching

Instance Status

Node Input Bandwidth

Node Output Bandwidth

Node Connections

CPU Usage

Memory Usage

Storage Usage

Key Metrics & Alert Recommendations

Troubleshooting

References