Best Practices
This topic outlines recommended best practices for achieving secure and efficient production-grade operations.
Graceful Node Consolidation with Smart Karpenter on OCI
To ensure reliable and predictable node consolidations when using Smart Karpenter on Oracle Cloud Infrastructure (OCI), review the following recommended practices:
-
Configure an Appropriate
consolidateAfterValueOCI nodes typically require around 3 minutes to provision, and this buffer period allows new nodes to initialize and applications to stabilize before Smart Karpenter initiates consolidation or voluntary disruptions.
If this parameter is not configured, Smart Karpenter may consolidate nodes too aggressively. To prevent the aggressive nodes consolidation:
- Confirm whether
consolidateAfteris explicitly defined in the NodePool configuration. - Set consolidateAfter to approximately 10 minutes.
- Confirm whether
-
Use Pod Disruption Budgets (PDBs) and Anti-Affinity Rules
Implement PDBs and anti-affinity rules for workloads, especially mission-critical or stateful applications. These safeguards help maintain availability during rescheduling events initiated by Smart Karpenter or Kubernetes. These safeguards are considered standard best practices even outside of Smart Karpenter environments.
-
Protect Highly Sensitive Workloads from Voluntary Disruption
For workloads that must never be disrupted during consolidation, add the following annotation to the pod specification:
karpenter.sh/do-not-disrupt: "true"This prevents Smart Karpenter from evicting the pod during voluntary disruption workflows, including consolidation operations.
If node consolidation issues continue, please share the Smart Karpenter logs, NodePool definition, and relevant application manifests with Avesha Support. This information helps diagnose the issue accurately and offer more targeted guidance.
Node Rotation on OCI
Smart Karpenter rotates nodes through disruption actions such as drift handling, consolidation, expiration, or explicit user-triggered operations.
Graceful Time-Based Node Rotation on OCI
Smart Karpenter 1.2.0 introduces support for graceful, scheduled node rotation through the
new rotateAfter parameter. This capability enables organizations to rotate nodes predictably while
minimizing workload disruption.
| Property | Description |
|---|---|
| Field | rotateAfter (optional) |
| Type | Duration string (for example, "168h" for one week, "720h" for 30 days) |
| Default | "" (disabled) |
| Purpose | Initiates graceful node replacement based on configured node age |
How It Works
Smart Karpenter continuously evaluates the age of nodes in the cluster:
- When a node exceeds the configured rotateAfter duration, it is marked as drifted.
- Smart Karpenter provisions a new replacement node and waits for it to reach the Ready state.
- The existing node is gracefully drained, adhering to Pod Disruption Budgets (PDBs).
- After all workloads are evicted safely, the old node is deleted.
- The process strictly respects configured disruption policies.
Key Benefits
- Predictable, scheduled node rotation for compliance or operational policies.
- Zero-downtime rotations when used with appropriate PDB settings.
- Non-disruptive alternative to expireAfter, providing more controlled lifecycle management.
Manual Node Rotation with Smart Karpenter on OCI
You can manually mark nodes as `drifted so Smart Karpenter treats them as needing replacement.
# Trigger manual rotation
kubectl annotate nodepool <nodepool-name> \
karpenter.sh/nodepool-hash="manual-rotation-$(date +%s)" \
--overwrite --namespace smart-scaler
# View current annotation
kubectl get nodepool <nodepool-name> -n smart-scaler \
-o jsonpath='{.metadata.annotations.karpenter\.sh/nodepool-hash}'
This is the sequence of steps that happen during manual node rotation:
- Smart Karpenter identifies drift and marks all nodes in the NodePool as drifted.
- New nodes are provisioned to replace the drifted ones.
- Smart Karpenter waits until the new nodes become Ready and can safely take over workloads.
- Old nodes are tainted with
karpenter.sh/disrupted:NoScheduleto prevent new workloads from being scheduled on them. - Existing workloads on old nodes are drained, honoring any configured Pod Disruption Budgets (PDBs).
- Old nodes are deleted once draining is complete.
- Throughout the process, Smart Karpenter respects all configured disruption budgets to ensure controlled and safe rotation.
Disruption Budgets on OCI
Configure disruption budgets to prevent simultaneous multi-node rotation that can cause service disruptions.
Disruption Budget Parameters
| Parameter | Description |
|---|---|
| nodes | Percentage ("20%") or absolute count ("5") |
| reasons | Filter by disruption reason (Empty, Drifted, Underutilized, or Expired) |
| schedule | Cron expression for time-based budgets |
| duration | The time for how long the budget applies. |
Automatic Kubernetes Version Detection on OCI
Smart Karpenter automatically detects the Kubernetes version from the OKE control plane and selects the appropriate node image, ensuring consistent and compatible node provisioning without manual intervention.
How It Works
Smart Karpenter performs the following steps to select the correct OKE node image:
- Detects the Kubernetes version from the OKE control plane
- Searches the OCI image catalog for corresponding OKE images that match the detected version
- Selects the latest compatible image from the available options
- After a control plane upgrade and a subsequent Smart Karpenter restart:
- Nodes using older images are automatically marked as drifted.
- Smart Karpenter initiates graceful node rotation, respecting Pod Disruption Budgets (PDBs) and disruption policies.
Image Selection Priority
Smart Karpenter uses the following priority order to determine which image to apply:
- Environment variables (highest priority, if configured)
- Auto-detected version from the OKE control plane
- NodePool image configuration (fallback option)
Key Benefits
The automatic Kubernetes version detection:
- Eliminates the need for manual image ID management
- Ensures automatic compatibility with the control plane version
- Reduces operational overhead across multiple clusters
- Simplifies and accelerates cluster upgrade workflows
Image detection automatically searches for matching OKE images in both your OCI compartment and Oracle's public compartment.