Best Practices

TOC

Overview

As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including persistence stability, dynamic network topology changes, and resource isolation and scheduling.

This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from architecture selection, resource planning, client integration to observability and operations. By following this guide, users can build an enterprise-class Redis data service that is High Availability (HA), High Performance, and Maintainability.

Architecture Selection

The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements:

Sentinel Mode

Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.

Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails.

  • Pros: Simple architecture, mature operations, lower requirements for client protocols.
  • Cons: Write capacity is limited to a single node; storage capacity cannot scale horizontally.

Cluster Mode

Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.

Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance.

  • Pros: True high availability distributed storage, supports dynamic Resharding.
  • Cons: Complex client protocol; specific multi-key commands (e.g., MGET) are restricted by Slot distribution.

Selection Guide

When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity.

FeatureSentinel ModeCluster Mode
ScenariosSmall/Medium business, Read-heavy/Write-light, moderate dataLarge business, High concurrency R/W, massive data
High AvailabilityVia Sentinel monitoring and auto-failoverVia node auto-failure detection and recovery
ScalabilityVertical (Scale-up), Horizontal (Read-only)Horizontal (R/W), supports dynamic resharding
Read/Write SeparationSupported (Client support required)Supported (Usually direct connection to shard master, client support required)
Data ShardingNone (Single node stores full data)Yes (Data auto-sharded across multiple nodes)
Ops ComplexityLower, simple architectureHigher, involves sharding, hash slots, migration
Network ConstraintsRequires client support for Sentinel protocolRequires client support for Cluster protocol

Recommendations:

  • If data volume is small (fits in single node memory) and simplicity/stability is priority, Sentinel Mode is preferred.
  • If data volume is massive or write pressure is extremely high and cannot be supported by a single node, choose Cluster Mode.

Version Selection

Alauda Cache Service for Redis OSS currently supports 5.0, 6.0, and 7.2 stable versions. All three versions have undergone complete automated testing and production verification.

For new deployments, we strongly recommend choosing Redis 7.2:

  1. Lifecycle

    • 5.0 / 6.0: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications.
    • 7.2: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come.
  2. Compatibility

    • Redis 7.2 maintains high compatibility with 5.0 and 6.0 data commands. Most business code can migrate smoothly without modification.
    • Note: RDB persistence file format (v11) is not backward compatible (i.e., RDB generated by 7.2 cannot be loaded by 6.0), but this does not affect new services.
  3. Key Features

    • ACL v2: Provides granular access control (Key-based permission selectors), significantly enhancing security in multi-tenant environments.
    • Redis Functions: Introduces Server-side Scripting standards, resolving issues with Lua script loss and replication, keeping logic closer to data.
    • Sharded Pub/Sub: Resolves network storm issues caused by Pub/Sub broadcasting in Cluster mode, significantly improving messaging scalability via sharding.
    • Performance Optimization: Deep optimizations in data structures (especially Sorted Sets) and memory management provide higher throughput and lower latency.

For more details on Redis 7.2 features, please refer to the official Redis 7.2 Release Notes.

Resource Planning

Kernel Tuning

To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level:

  1. Memory Allocation (vm.overcommit_memory)

    • Recommended: 1
    • Explanation: Setting to 1 (Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors.
  2. Connection Queue (net.core.somaxconn)

    • Recommended: 2048 or higher
    • Explanation: Redis default tcp-backlog is 511. In high concurrency scenarios, system net.core.somaxconn should be increased to avoid dropping client connection requests.
  3. Transparent Huge Pages (THP)

    • Action: Disable (never)
    • Explanation: THP causes significant latency spikes during memory allocation in Redis, especially during Copy-on-Write (CoW) after Fork. It is recommended to disable this on the host or via startup scripts.

Memory Specifications

Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots.

In Kubernetes containerized environments, we recommend a tiered memory management strategy:

  • ✅ Standard Specs (< 8GB): Strongly Recommended. Ensures extremely low Fork latency and fast failure recovery (RTO < 60s); the most robust production choice.
  • ⚠️ High-Performance Specs (8GB - 16GB): Acceptable. Requires high-performance host and THP must be disabled. Fork is controllable but may cause ~100ms jitter under high load.
  • ❌ High-Risk Specs (> 16GB): Not Recommended. Single point of failure impact is too large, and full synchronization can easily saturate network bandwidth. Recommend horizontal splitting into Cluster mode.

Why Limit to 8GB?

While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies:

  1. Fork Blocking & Page Table Copy

    • Redis calls fork() during RDB/AOF Rewrite. Although memory pages are CoW, Process Page Tables must be fully copied, blocking the main thread.
    • Estimation: 10GB memory ≈ 20MB page table ≈ 10~50ms blocking (depending on virtualization overhead). Exceeding 8GB increases blocking risk exponentially, impacting SLA.
  2. Failure Recovery Efficiency (RTO)

    • Container restart loading RDB is a single-threaded CPU-bound task (object deserialization). Tests show loading 8GB data takes 30-50s (even with SSD). Maintaining 32GB could result in multi-minute start times, contradicting K8s "fast self-healing" philosophy.

Memory Configuration Best Practices

To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required:

  1. Set MaxMemory: Do not set maxmemory to 100% of the container Memory Limit. Recommend setting to 70% ~ 80% of the Limit.
  2. Reserve CoW Space: Redis Forks a child process during RDB/AOF Rewrite. If there are heavy write updates, OS Copy-on-Write mechanisms duplicate memory pages; in extreme cases, memory usage can double from 8GB to 16GB.
  3. Overcommit Config: Ensure host vm.overcommit_memory = 1 to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures.
INFO

Resource Reservation Formula: Container_Memory_LimitRedis_MaxMemory / 0.7

  • Example: To store 8GB data, configure Container Memory Limit to 10GB ~ 12GB, leaving 2GB+ for CoW and fragmentation overhead.

CPU Resources

Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate at least 2 Cores per Redis instance:

  • Core 1: Handles main thread requests and commands.
  • Core 2: Handles persistence fork, background tasks, and system overhead.

Multi-threading

Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks.

  • When to Enable?

    • Bottleneck Analysis: When Redis CPU usage nears 100% and analysis shows time spent on Kernel State Network I/O (System CPU) rather than user-space command execution.
    • Traffic Profile: Typically beneficial when single instance QPS > 80,000 or network traffic is huge (> 1GB/s).
    • Resource Conditions: Ensure node has sufficient CPU cores (at least 4 cores).
  • Configuration Best Practices:

    • Thread Count: Recommend 4~8 I/O threads. Exceeding 8 threads rarely yields significant gain.
    • Config Example:
      io-threads 4
      io-threads-do-reads yes
    • Note: Multi-threaded I/O only improves network throughput; it does NOT improve execution speed of single complex commands (e.g., SORT, KEYS).

Storage Planning

Capacity Planning

Persistence mode directly determines disk quota requirements. Refer to the following calculation formula:

ModeRecommended Quota FormulaDetails
Diskless (Cache)0 (No PVC)Used as pure cache, no RDB/AOF. Logs collected via stdout in K8s, no persistence disk needed.
RDB (Snapshot)MaxMemory * 2RDB uses CoW. During snapshot generation, both "old snapshot" and "new snapshot being written" exist on disk.
Recommendation: Reserve at least 2x memory space.
AOF (Append Only)MaxMemory * 3AOF grows with write operations. Default config (auto-aof-rewrite-percentage 100) triggers rewrite when AOF reaches 2x data size. Disk must hold:
1. Old AOF file (2x)
2. New AOF file from rewrite (1x)
Peak total 3x. Recommend reserving at least 3x space.

Performance Requirements

  • With AOF: Disk performance is critical. Insufficient IOPS or high fsync latency will directly block the main thread (when appendfsync everysec).
  • Media: Production environments strongly recommend SSD/NVMe local disks or high-performance cloud disks.

Parameter Configuration

Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields.

Built-in Templates

Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance.

Template NameDescriptionScenariosRisks
rdb-redis-<version>-<sentinel|cluster>Enables RDB persistence, periodic snapshots to disk.Balanced: Limited resources, balances performance/reliability, accepts minute-level data loss.Data loss depends on save config, usually minute-level RPO.
aof-redis-<version>-<sentinel|cluster>Enables AOF persistence, logs every write op.Secure: Ample resources, high data security (second-level loss), slight performance compromise.Frequent fsync requires high-performance storage, high IO pressure.
diskless-redis-<version>-<sentinel|cluster>Disables persistence, pure in-memory.High-Perf Cache: Acceleration only, data loss acceptable or rebuildable from source.Restart or failure leads to full data loss.

<version> represents Redis version, e.g., 6.0, 7.2.

Key parameter differences:

ParameterRDB TemplateAOF TemplateDiskless TemplateExplanation
appendonlynoyesnoEnable AOF logging.
save60 10000 300 100 600 1"" (Disabled)"" (Disabled)RDB snapshot triggers.
repl-diskless-syncnonoyesMaster-replica full sync via socket without disk.
repl-diskless-sync-delay550Delay for diskless sync; 0 for Diskless to speed up sync.
Persistence Selection Recommendations
  1. Pure Cache: Choose Diskless Template. Data rebuildable, no overhead, best performance.
  2. General Business: Choose RDB Template. Periodic snapshots provide minute-level RPO, moderate resource usage.
  3. Financial/High-Reliability: Choose AOF Template with appendfsync everysec for second-level protection.
WARNING

Redis supports running RDB and AOF together, but it is generally not recommended in Kubernetes:

  • Performance: AOF fsync creates IO pressure; adding RDB fork + disk write significantly increases resource contention.
  • Storage Doubling: Requires space for both RDB snapshots and AOF files, complicating PVC planning.
  • Recovery Priority: Redis loads AOF first on start (more complete data); RDB acts only as backup, offering limited benefit.
  • Platform Backup: Alauda Cache Service for Redis OSS provides independent auto/manual backup, removing reliance on RDB snapshots for extra insurance.

Recommendation: Choose Single Persistence Mode (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space.

Parameter Update

Redis parameters are categorized by application method:

CategoryParametersBehavior
Hot UpdateMost runtime params (maxmemory, loglevel, etc.)Immediate effect after modification, no restart.
Restart Updatedatabases, rename-command, rdbchecksum, tcp-backlog, io-threads, io-threads-do-readsRequires Instance Restart to take effect.
Immutablebind, protected-mode, port, supervised, pidfile, dir, etc.Managed by system, modification may cause anomalies.
TIP

Always assume data backup before modifying parameters requiring restart.

Modification Examples

Update Data Node Parameters: Configure via spec.customConfig.

# Example: Modify save strategy (Hot update)
kubectl -n <namespace> patch redis <instance-name> --type=merge --patch='{"spec": {"customConfig": {"save":"600 1"}}}'

Update Sentinel Node Parameters: Configure via spec.sentinel.monitorConfig.

Currently supports down-after-milliseconds, failover-timeout, parallel-syncs.

# Example: Modify failover timeout
kubectl -n <namespace> patch redis <instance-name> --type=merge --patch='{"spec": {"sentinel": {"monitorConfig": {"down-after-milliseconds":"30000"}}}}'

Resource Specs

Deploy resources according to your actual business scenario.

Sentinel Mode Specs

PersistenceTemplateInstance SpecReplica / SentinelSentinel Podredis-exporterredis (Spec)Backup PodTotal ResourcesStorage QuotaAuto Backup (Keep 7)Manual Backup (Keep 7)
AOFaof-redis-<version>-sentinel2c4g1 / 3100m128Mi100m200Mi2c4gUnlimited (Reserve resources)4.5c4.8GEvaluate based on actual write volume
aof-redis-<version>-sentinel4c8g4c8g8.5c8.8G
RDBrdb-redis-<version>-sentinel2c4g2c4g4.5c4.8G8G28G28G
rdb-redis-<version>-sentinel4c8g4c8g8.5c8.8G16G56G56G
Disklessdiskless-redis-<version>-sentinel2c4g2c4g4.5c4.8G/28G28G
diskless-redis-<version>-sentinel4c8g4c8g8.5c8.8G56G56G

Cluster Mode Specs

PersistenceTemplateInstance SpecSharding / Replicaredis-exporterredis (Spec)Backup PodTotal ResourcesStorage QuotaAuto Backup (Keep 7)Manual Backup (Keep 7)
AOFaof-redis-<version>-cluster2c4g3 / 1100m300Mi2c4gUnlimited (Reserve resources)12.6c25.8GEvaluate based on actual write volume
aof-redis-<version>-cluster4c8g4c8g24.6c49.8G
RDBrdb-redis-<version>-cluster2c4g2c4g12.6c25.8G24G84G84G
rdb-redis-<version>-cluster4c8g4c8g24.6c49.8G48G168G168G
Disklessdiskless-redis-<version>-cluster2c4g2c4g12.6c25.8G/84G84G
diskless-redis-<version>-cluster4c8g4c8g24.6c49.8G168G168G

<version> represents Redis version, e.g., 6.0, 7.2.

Scheduling

Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments.

Node Selection

You can use the spec.nodeSelector field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools.

WARNING

Persistence Limitation: If your Redis instance mounts Non-Network Storage (e.g., Local PV) PVCs, be cautious when updating nodeSelector. Since local data resides on specific nodes and cannot migrate with Pods, the updated nodeSelector set MUST include the node where the Pod currently resides. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction.

Taint Toleration

Use spec.tolerations to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., key=redis:NoSchedule), preventing other non-critical workloads from preempting resources.

Anti-Affinity

To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode.

CAUTION

Immutable: To ensure consistency and reliability, anti-affinity configurations (both affinityPolicy and affinity) cannot be modified after instance creation. Please plan ahead.

Cluster Mode

In Cluster mode, the system prioritizes spec.affinityPolicy. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet.

  • Priority: spec.affinityPolicy > spec.affinity.
  • If affinityPolicy is unset: Alauda Cache Service for Redis OSS checks spec.affinity. If you need custom topology rules beyond the enums below, leave affinityPolicy empty and configure native spec.affinity.
Policy NameaffinityPolicy ValueBehaviorPros/ConsScenario
All Pods Forced Anti-AffinityAntiAffinityForces ALL Pods in the cluster (including primary/replicas of different shards) to be on different nodes. Fails if node count < total Pod count.
  • Pros: Highest disaster recovery, minimal single-node failure impact.
  • Cons: extremely high resource requirement, Node count must be >= Total Pods.
Cluster Mode Core Business
Ample resources, strict HA requirements.
Shard Primary-Replica Forced Anti-AffinityAntiAffinityInShardingForces Primary and Replicas within same shard to be on different nodes. Pods from different shards can coexist.
  • Pros: Guarantees physical isolation of data replicas, preventing shard migration data loss.
  • Cons: Scheduling fails if live nodes < replica count. Primaries of different shards might land on same node (single point of failure risk).
Production Standard
Balances resource usage and data safety.
Shard Primary-Replica Soft Anti-AffinitySoftAntiAffinityPrioritizes spreading shard primary/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node.
  • Pros: Highest deployment success rate, runs with limited resources.
  • Cons: Primary/Replica may share node in extreme cases, risking data loss.
Test/Dev Environments
Or resource-constrained edge environments.

Sentinel Mode

Important Sentinel Mode does not support spec.affinityPolicy.

For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules:

  • Redis Data Nodes: Configured via spec.affinity.
  • Sentinel Nodes: Configured via spec.sentinel.affinity.

You need to manually write complete Affinity rules. Example for forcing anti-affinity for both Data and Sentinel nodes:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - redis
          - key: redisfailovers.databases.spotahome.com/name
            operator: In
            values:
            - <instance name>
        topologyKey: kubernetes.io/hostname
  sentinel:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/component
              operator: In
              values:
              - sentinel
            - key: redissentinels.databases.spotahome.com/name
              operator: In
              values:
              - <instance name>
          topologyKey: kubernetes.io/hostname

To force anti-affinity across ALL nodes (Data + Sentinel), refer to:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: middleware.instance/type
            operator: In
            values:
            - redis-failover
          - key: middleware.instance/name
            operator: In
            values:
            - <instance name>
        topologyKey: kubernetes.io/hostname
  sentinel:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: middleware.instance/type
              operator: In
              values:
              - redis-failover
            - key: middleware.instance/name
              operator: In
              values:
              - <instance name>
          topologyKey: kubernetes.io/hostname

User Management

Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via RedisUser CRD, supporting ACLs.

TIP

Compatibility: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control.

Permission Profiles

The platform pre-defines permission profiles for common scenarios:

ProfileACL RuleExplanation
NotDangerous+@all -@dangerous ~*Allows all commands except dangerous ones (e.g., FLUSHDB).
ReadWrite-@all +@write +@read -@dangerous ~*Allows read/write, blocks dangerous ops.
ReadOnly-@all +@read -keys ~*Allows read-only operations.
Administrator+@all -acl ~*Admin privileges, allows all commands except ACL management.

For custom ACLs, see Redis ACL Documentation.

Security Mechanisms

  1. ACL Force Revocation: All RedisUser creation/updates undergo Webhook validation to force remove acl permissions, preventing privilege escalation.
  2. Cluster Command Injection: For Cluster Mode, Alauda Cache Service for Redis OSS automatically injects topology commands: cluster|slots, cluster|nodes, cluster|info, cluster|keyslot, cluster|getkeysinslot, cluster|countkeysinslot to ensure client awareness.
  3. 6.0 -> 7.2 Upgrade Compatibility: When upgrading 6.0 -> 7.2, the operator adds &* (Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs.

System Account

Each Redis instance automatically generates a system account named operator. Its roles include:

  1. Cluster Init: Slot assignment, node joining.
  2. Config Simplification: Unified system account reduces user configuration complexity.
  3. Operations: Used for health checks, failovers, scaling.
  4. Avoid Restarts: Password updates for business users don't affect this account, avoiding restarts.
CAUTION
  • Complexity: Random 64-char string (alphanumeric+special).
  • Privilege: Highest level (includes user management).
  • Restriction: No online password update and DO NOT manually modify/delete, as it may cause irreversible failure.

Production Best Practices

  1. App Isolation: Create independent user accounts for each app/microservice. Avoid sharing accounts to enable auditing and isolation.
  2. Principle of Least Privilege:
    • Read-Only App: Use ReadOnly.
    • Read-Write App: Use ReadWrite.
    • Ops Tools: Use NotDangerous or custom permissions.
    • Avoid Administrator: Unless absolutely necessary.
  3. Key Namespace Isolation: Combine ACL Key patterns (e.g., ~app1:*) to restrict apps to specific key prefixes.
  4. Password Rotation: Establish mechanisms to regularly rotate app passwords.

For operation steps, see User Management Docs.

Client Access

Topology Discovery

Both Sentinel and Cluster modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes:

Sentinel Mode

  1. Client connects to Sentinel Node.
  2. Client sends SENTINEL get-master-addr-by-name mymaster to get Master IP/Port.
  3. Client directly connects to Master.
  4. On failover, Sentinel notifies client (or client polls) to switch to new Master.

Cluster Mode

  1. Client connects to any Cluster Node.
  2. Sends CLUSTER SLOTS / CLUSTER NODES to get Slot Distribution.
  3. Calculates hash slot for Key and directly connects to target node.
  4. If slot migrates, node returns MOVED/ASK; client must refresh topology.

Both protocols return Real Node IPs. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. Thus, Each Redis Pod needs an independent external address (NodePort/LoadBalancer), not a single proxy address.

Network Access Strategies

Alauda Cache Service for Redis OSS supports multiple access methods:

Sentinel Mode

MethodRecommendedDescription
ClusterIPInternal PreferredAccess Sentinel via K8s Service (port 26379). Clients auto-discover Master. Lowest latency, highest security.
LoadBalancerExternal PreferredExposes Sentinel via MetalLB/Cloud LB. Stable external entry, no port management.
NodePort⚠️ External BackupExposes Sentinel via Node ports. Requires manual port management, risky, potential multi-NIC binding issues.

Cluster Mode

MethodRecommendedDescription
ClusterIPInternal PreferredAccess via K8s Service. Client must support Cluster protocol.
LoadBalancerExternal PreferredConfigure LB for each shard Master. Stable external access. Client must handle MOVED/ASK.
NodePort⚠️ External BackupExpose underlying Pod NodePorts. Client connects directly. Complex port management.
WARNING
  • Port Management: Range limited (30000-32767), conflicts easy in multi-instance.
  • Security: Increases attack surface.
  • Multi-NIC: Redis binds default NIC; clients may fail to connect if IPs mismatch.
  • No LB Proxy: Sentinel/Cluster protocols require direct node connection; cannot be proxied by standard LBs.
INFO
  • Sentinel (1P1R + 3 Sentinels): Needs 8 NodePorts/LBs.
  • Cluster (3 Shards x 1P1R): Needs 7 NodePorts/LBs.

Code Examples

We provide best practice examples for go-redis, Jedis, Lettuce, and Redisson:

INFO

Master Group Name: In Sentinel mode, the master name is fixed to mymaster.

Client Reliability Best Practices

  1. Timeouts

    • Connect Timeout: distinct from Read Timeout. Recommend 1-3s.
    • Read/Write Timeout: Based on SLA, usually hundreds of ms.
  2. Retry Strategy

    • Exponential Backoff: Do not retry immediately on failure; use backoff (100ms, 200ms...) to avoid retry storms.
  3. Connection Pooling

    • Reuse: Always use pooling (JedisPool, go-redis Pool) to save handshake costs.
    • Max Connections: Set MaxTotal reasonably to avoid hitting Redis maxclients.
  4. Topology Refresh (Cluster)

    • Auto-refresh: Ensure client enables MOVED/ASK handling.
    • Periodic refresh: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.

Observability & Operations

Backup & Security

The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances.

See Backup & Restore.

Upgrade & Scaling

Upgrade

See Upgrade.

Scaling Notes

When changing specs (CPU/Mem) or expanding:

  1. Assess Resources: Ensure cluster has capacity.
  2. Progressive: Rolling updates to minimize interruption.
  3. Off-peak: Execute during low traffic.
CAUTION

When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash.

Monitoring

Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus.

Built-in Metrics

Variables {{.namespace}} and {{.name}} should be replaced with actual values.

Key Hit Rate
  • Desc: Cache hit rate.
  • Unit: %
  • Expr:
    1/(1+(avg(irate(redis_keyspace_misses_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service) / (avg(irate(redis_keyspace_hits_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service)+1)))
Average Response Time
  • Desc: Avg command latency. High = slow queries/bottleneck.
  • Unit: s
  • Expr:
    avg((redis_commands_duration_seconds_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"} / redis_commands_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"})) by (namespace,service)
Role Switching
  • Desc: Master-Replica switches in 5m. Non-zero = failover occurred.
  • Unit: Count
  • Expr:
    sum by(namespace,service) (changes((sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",role="master"}) OR (sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",}) * 0))[5m:10s]))
Instance Status
  • Desc: Health status. 0 = Abnormal.
  • Expr:
    ((count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster"}) % count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"})) == bool 0 and count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"}) >= bool 3) or (count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"rfr-({{.name}})",redisarch="sentinel",role="master"})) > bool 0
Node Input Bandwidth
  • Desc: Peak ingress traffic.
  • Unit: Bps
  • Expr:
    max by(namespace,service)(irate(redis_net_input_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m]))
Node Output Bandwidth
  • Desc: Peak egress traffic.
  • Unit: Bps
  • Expr:
    max by(namespace,service)(irate(redis_net_output_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m]))
Node Connections
  • Desc: Peak client connections. Watch if near maxclients.
  • Unit: Count
  • Expr:
    max by(namespace,service)(redis_connected_clients{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*"})
CPU Usage
  • Desc: Node CPU usage. Sustained high = perf impact.
  • Unit: %
  • Expr:
    avg by(namespace,pod_name)(irate(container_cpu_usage_seconds_total{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}[5m]))/avg by(namespace,pod_name)(container_spec_cpu_quota{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})*100000
Memory Usage
  • Desc: Node memory usage. >80% suggest scaling.
  • Unit: %
  • Expr:
    avg by(namespace,pod_name)(container_memory_usage_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"} - container_memory_cache{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) / avg by(namespace,pod_name)(container_spec_memory_limit_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})
Storage Usage
  • Desc: PVC usage. Full = persistence failure.
  • Unit: %
  • Expr:
    avg(kubelet_volume_stats_used_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) / avg(kubelet_volume_stats_capacity_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim)

Key Metrics & Alert Recommendations

Recommended production alerts:

MetricThresholdNote
Memory Usage> 80%Risk of eviction/OOM.
CPU Usage> 80% (Sustained)Latency spikes.
Hit Rate< 80%Strategy issue or capacity missing.
Failovers> 0Check network/node health.
ConnectionsNear maxclientsNew connections rejected.
Storage Usage> 80%Ensure space for AOF/RDB.
Response Time> 10msSlow queries/bottlenecks.

Troubleshooting

For specific issues, search the Customer Portal.

References