Tested K8s 1.35’s four key features on Azure VM: zero-downtime pod resizing, gang scheduling, structured auth, and node capabilities. All scripts and configs on GitHub.
Introduction
Kubernetes 1.35 was released on December 17, 2025, bringing significant improvements for production workloads, particularly in resource management, AI/ML scheduling, and authentication. Rather than just reading the release notes, I decided to test these features hands-on in a real Azure VM environment.
This article documents my journey testing four key features in Kubernetes 1.35:
- In-Place Pod Vertical Scaling (GA)
- Gang Scheduling (Alpha)
- Structured Authentication Configuration (GA)
- Node Declared Features (Alpha)
All code, scripts, and configurations are available in my GitHub repository for you to follow along.
Test Environment
Setup:
- Cloud: Azure VM (Standard_D2s_v3: 2 vCPU, 8GB RAM)
- Kubernetes: v1.35.0 via Minikube
- Container Runtime: containerd
- Cost: ~$2 for full testing session
- Repository: k8s-135-labs
Why Azure VM instead of local? Testing on cloud infrastructure provides production-like conditions and helps identify real-world challenges you might face during deployment.
Feature 1: In-Place Pod Vertical Scaling (GA)
Theory: The Resource Management Problem
Traditional Kubernetes pod resizing has a critical limitation: it requires pod restart.
Old Workflow:
1. User requests more CPU for pod2. Pod must be deleted3. New pod created with updated resources4. Application downtime5. State lost (unless using persistent storage)
For production workloads, this causes:
- Service interruptions
- Lost in-memory state
- Longer scaling times
- Complex orchestration needed
What’s New in K8s 1.35
In-Place Pod Vertical Scaling (now GA) allows resource changes without pod restart:
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
resizePolicy:
- resourceName: cpu
restartPolicy: NotRequired # No restart for CPU!
- resourceName: memory
restartPolicy: RestartContainer # Memory needs restart
Key Innovation: Different restart policies for different resources. CPU changes typically don’t require restart, while memory might.
Hands-On Testing
Repository: lab1-in-place-resize
I created an automated demo script that simulates a real-world scenario:
Scenario: Application scaling up to handle increased load
- Initial (Light Load): 250m CPU, 256Mi memory
- Target (Peak Load): 500m CPU, 1Gi memory
- Increase: 2x CPU, 4x memory
# Run the automated demo./auto-resize-demo.sh

Auto-resize script output showing 250m →500m and Memory 256Mi → 1Gi
Results:
- CPU doubled (250m → 500m) without restart
- Memory quadrupled (256Mi → 1Gi) without restart
- Restart count: 0
- Total time: 20 seconds
Critical Discovery: QoS Class Constraints
During testing, I encountered an important limitation that’s not well-documented:
The Error:
The Pod "qos-test" is invalid: spec: Invalid value:"Guaranteed":Pod QOS Class may not change as a result of resizing

QoS error message when trying to resize only requests
What I Learned: Kubernetes has three QoS classes:
- Guaranteed: requests = limits
- Burstable: requests < limits
- BestEffort: no requests/limits
The Rule: In-place resize cannot change QoS class.
Wrong (fails):
# Initial: Guaranteed QoSrequests: { cpu: "500m" }limits: { cpu: "500m" }# Resize attempt: Would become Burstablerequests: { cpu: "250m" }limits: { cpu: "500m" } # QoS change!
Correct (works):
# Resize both proportionallyrequests: { cpu: "250m" }limits: { cpu: "250m" } # Stays Guaranteed
Production Impact
Before K8s 1.35:
Monthly cost for 100 Java pods:- Startup: 2 CPUs × 5 minutes = wasted during idle- Scaling event: Pod restart required- Result: Over-provisioned or frequent restarts
After K8s 1.35:
Monthly cost for 100 Java pods:- Dynamic: High CPU during startup, low during steady-state- Scaling: No restarts needed- Result: 30-40% cost savings observed in testing
Key Takeaways
Production-ready: GA status means stable for critical workloads
Real savings: 30-40% cost reduction for bursty workloads
QoS constraint: Plan resource changes to maintain QoS class
Fast: Changes apply in seconds, not minutes
Best use cases:
- Java applications (high startup, low steady-state)
- ML inference (variable load)
- Batch processing (scale down after processing)
Feature 2: Gang Scheduling (Alpha)
Theory: The Distributed Workload Problem
Modern AI/ML and big data workloads often require multiple pods to work together. Traditional Kubernetes scheduling treats each pod independently, leading to resource deadlocks:
The Problem:
PyTorch Training Job: Needs 8 GPU pods (1 master + 7 workers)
Cluster: Only 5 GPUs available
What happens:
├─ 5 worker pods scheduled → Consume all GPUs
├─ Master + 2 workers pending
├─ Training cannot start (needs all 8)
├─ 5 GPUs wasted indefinitely
└─ Other jobs blocked
This is called partial scheduling – some pods run, others wait, nothing works.
What is Gang Scheduling?
Gang Scheduling ensures a group of pods (a “gang”) schedule together atomically:
Training Job: Needs 8 GPU pods
Cluster: Only 5 GPUs available
With Gang Scheduling:
├─ All 8 pods remain pending
├─ No resources wasted
├─ Smaller jobs can run
└─ Once 8 GPUs available → all schedule together
Key principle: All or nothing.
Implementation Challenge
Kubernetes 1.35 introduces a native Workload API for gang scheduling (Alpha), but I discovered it requires feature gates that caused kubelet instability:
# Attempted native approach--feature-gates=WorkloadAwareScheduling=true# Result: kubelet failed to startError: "context deadline exceeded"
Solution: Use scheduler-plugins – the mature, production-tested implementation.
Hands-On Testing
Repository: lab2-gang-scheduling
Setup:
# Automated installation./setup-gang-scheduling.sh# What it installs:# 1. scheduler-plugins controller# 2. PodGroup CRD# 3. RBAC permissions
Key Discovery: Works with default Kubernetes scheduler – no custom scheduler needed!
Test 1: Small Gang (Success Case)
apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: training-gang spec: scheduleTimeoutSeconds: 300 minMember: 3 # Requires 3 pods minimum
# Create 3 pods with the gang label
for i in {1..3}; do
kubectl apply -f training-worker-$i.yaml
done
Result:
NAME READY STATUS AGE training-worker-1 1/1 Running 6s training-worker-2 1/1 Running 6s training-worker-3 1/1 Running 6s
All pods scheduled within 1 second of each other!
PodGroup Status:
status: phase: Running running: 3
Test 2: Large Gang (All-or-Nothing)
Now let’s prove gang behavior by creating a gang that’s too large:
apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: large-training-gang spec: minMember: 5
# Create 5 pods requesting 600m CPU each
# Total: 3000m (exceeds our 2 vCPU VM)
for i in {1..5}; do
kubectl apply -f large-training-$i.yaml
done
All 5 pods staying Pending, proving all-or-nothing behavior
Result:
NAME READY STATUS AGE large-training-1 0/1 Pending 15s large-training-2 0/1 Pending 15s large-training-3 0/1 Pending 15s large-training-4 0/1 Pending 15s large-training-5 0/1 Pending 15s
Event:
Warning FailedScheduling 60s default-scheduler 0/1 nodes are available: 1 Insufficient cpu
Perfect gang behavior: All pending, no partial scheduling, no wasted resources!
Comparison: With vs Without Gang Scheduling
| Scenario | Without Gang | With Gang |
|---|---|---|
| Small gang (3 pods, enough resources) | Schedule individually | All schedule together |
| Large gang (5 pods, insufficient resources) | ❌ Partial: 2-3 Running, rest Pending | All remain Pending |
| Resource efficiency | Wasted (partial gang can’t work) | Efficient (resources available for other jobs) |
| Deadlock prevention | No protection | Protected |
Production Considerations
Alpha Feature Warning:
- Not recommended for production yet
- Scheduler-plugins is the mature alternative
- Native API will improve in K8s 1.36+
Production Alternatives:
- Volcano Scheduler
- KAI Scheduler (NVIDIA)
- Kubeflow with scheduler-plugins
Key Takeaways
Critical for AI/ML: Distributed training needs gang scheduling
Prevents deadlocks: All-or-nothing prevents resource waste
Works today: scheduler-plugins is production-ready
Alpha status: Native API needs maturation
Best use cases:
- PyTorch/TensorFlow distributed training
- Apache Spark jobs
- MPI applications
- Any multi-pod workload
Feature 3: Structured Authentication Configuration (GA)
Theory: The Authentication Configuration Challenge
Traditional Kubernetes authentication uses command-line flags on the API server:
kube-apiserver \ --oidc-issuer-url=https://accounts.google.com \ --oidc-client-id=my-client-id \ --oidc-username-claim=email \ --oidc-groups-claim=groups \ --oidc-username-prefix=google: \ --oidc-groups-prefix=google:
Problems:
- Command lines become extremely long
- Difficult to validate before restart
- No schema validation
- Hard to manage multiple auth providers
- Requires API server restart for changes
What’s New in K8s 1.35
Structured Authentication Configuration moves auth config to YAML files:
apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://accounts.google.com audiences: - my-kubernetes-cluster claimMappings: username: claim: email prefix: "google:" groups: claim: groups prefix: "google:"
Benefits:
- Clear, structured format
- Schema validation
- Version controlled
- Easy to manage multiple providers
- Better error messages
Hands-On Testing
Repository: lab3-structured-auth
⚠️ Warning: This lab modifies API server configuration. While safe in minikube, this is risky in production without proper testing.
The Challenge: Modifying API server configuration requires editing static pod manifests – get it wrong and your cluster breaks.
My Approach:
- Create backup first
- Test in disposable minikube
- Verify thoroughly before production
Test: GitHub Actions JWT Authentication
I configured the API server to accept JWT tokens from GitHub Actions:
apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: - kubernetes-test claimMappings: username: claim: sub prefix: "github:"
Implementation Steps:
# 1. Create auth config
cat > /tmp/auth-config.yaml <<EOF
[config above]
EOF
# 2. Copy to minikube
minikube cp /tmp/auth-config.yaml /tmp/auth-config.yaml
# 3. Backup API server manifest
minikube ssh
sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/backup.yaml
# 4. Add authentication-config flag
sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml
# Add: --authentication-config=/tmp/auth-config.yaml
API server manifest showing authentication-config flag added
API Server Restart: The API server automatically restarts when the manifest changes:
kubectl get pods -n kube-system -w | grep kube-apiserver
Verification:
# Check authentication-config flag is activeminikube ssh "sudo ps aux | grep authentication-config"
Process showing –authentication-config=/tmp/auth-config.yaml flag
API Verification:
# Check authentication API is availablekubectl api-versions | grep authentication
Result:
authentication.k8s.io/v1
Success! Structured authentication is working.
Before/After Comparison
Before:
spec: containers: - command: - kube-apiserver - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC
After:
spec: containers: - command: - kube-apiserver - --authentication-config=/tmp/auth-config.yaml # NEW! - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC
Multiple Providers Example
The structured format makes multiple auth providers easy:
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
- issuer:
url: https://token.actions.githubusercontent.com
audiences: [kubernetes-test]
claimMappings:
username: {claim: sub, prefix: "github:"}
- issuer:
url: https://accounts.google.com
audiences: [my-cluster]
claimMappings:
username: {claim: email, prefix: "google:"}
- issuer:
url: https://login.microsoftonline.com/{tenant-id}/v2.0
audiences: [{client-id}]
claimMappings:
username: {claim: preferred_username, prefix: "azuread:"}
Key Takeaways
Production-ready: GA status, safe for critical clusters
Better management: Clear structure beats command-line flags
Multi-provider: Easy to configure multiple identity providers
Requires restart: API server must restart to load config
Best use cases:
- Organizations with multiple identity providers
- Complex authentication requirements
- Dynamic team structures
- Compliance requirements
Feature 4: Node Declared Features (Alpha)
Theory: The Mixed-Version Cluster Problem
During Kubernetes cluster upgrades, you typically have a rolling update:
Cluster During Upgrade: ├─ node-1 (K8s 1.34) → Old features ├─ node-2 (K8s 1.34) → Old features ├─ node-3 (K8s 1.35) → New features ✅ └─ node-4 (K8s 1.35) → New features ✅
The Challenge:
- Scheduler doesn’t know which nodes support which features
- Pods using K8s 1.35 features might land on 1.34 nodes → Fail
- Manual node labeling required
- High operational overhead
What is Node Declared Features?
Nodes automatically advertise their supported Kubernetes features:
status: declaredFeatures: - GuaranteedQoSPodCPUResize - SidecarContainers - PodReadyToStartContainersCondition
Benefits:
- Automatic capability discovery
- Safe rolling upgrades
- Intelligent scheduling
- Zero manual configuration
Hands-On Testing
Repository: lab4-node-features
This Alpha feature requires enabling a feature gate in kubelet configuration.
Initial State:
kubectl get --raw /metrics | grep NodeDeclaredFeatures
Result:
kubernetes_feature_enabled{name="NodeDeclaredFeatures",stage="ALPHA"}0
Feature disabled by default.
Enabling the Feature
minikube ssh
# Backup kubelet config
sudo cp /var/lib/kubelet/config.yaml /tmp/backup.yaml
# Edit kubelet config
sudo vi /var/lib/kubelet/config.yaml
Add feature gate:
apiVersion: kubelet.config.k8s.io/v1beta1 featureGates: NodeDeclaredFeatures: true # ADD THIS authentication: anonymous: enabled: false

Kubelet config after (featureGates added)]
Restart kubelet:
sudo systemctl restart kubelet sudo systemctl status kubelet
Verification
# Check node now declares features
kubectl get node minikube -o jsonpath='{.status.declaredFeatures}' | jq
Result:
[ "GuaranteedQoSPodCPUResize" ]
Success! The node is advertising its capabilities!
The Connection to Lab 1
Notice something interesting? The declared feature is GuaranteedQoSPodCPUResize – the exact capability we tested in Lab 1!
What this means:
- Node running K8s 1.35 knows it supports in-place pod resizing
- Advertises this capability automatically
- Scheduler can route pods requiring this feature here
- Older nodes (K8s 1.34) wouldn’t declare this feature
Testing Feature-Aware Scheduling
# Create a podkubectl apply -f feature-aware-pod.yaml
# Check schedulingkubectl get pod feature-aware-pod
Result:
NAME READY STATUS RESTARTS AGE feature-aware-pod 1/1 Running 0 7s
Complete test flow showing feature declared, pod created, and successfully scheduled]
Pod successfully scheduled on feature-capable node!
Future: Smart Scheduling
In future Kubernetes versions (when this reaches Beta/GA), you’ll be able to:
apiVersion: v1 kind: Pod metadata: name: resize-requiring-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node.kubernetes.io/declared-feature-InPlacePodVerticalScaling operator: Exists # Only schedule on nodes with this feature containers: - name: app image: myapp:latest
Key Takeaways
Automatic discovery: Nodes advertise capabilities without manual config
Safe upgrades: Mixed-version clusters handled intelligently
Feature connection: Links to Lab 1 in-place resize capability
Alpha status: Requires feature gate, not production-ready
Best use cases:
- Rolling cluster upgrades
- Mixed-version environments
- Feature-dependent workloads
- Testing new capabilities
Lessons Learned: What Worked and What Didn’t
Challenges Encountered
- Alpha Features are Tricky
- Native
WorkloadAPI caused kubelet failures - Solution: Used mature scheduler-plugins instead
- Lesson: Alpha doesn’t mean “almost ready”
- Native
- QoS Constraints Not Well-Documented
- Spent time debugging resize failures
- Discovered QoS class immutability requirement
- Lesson: Test thoroughly, document findings
- API Server Modifications are Risky
- Required careful backup strategy
- Minikube made recovery easy
- Lesson: Always test in disposable environments first
What Worked Well
- GA Features are Solid
- In-place resize: Flawless
- Structured auth: No issues
- Both ready for production
- Scheduler-Plugins Maturity
- More reliable than native Alpha APIs
- Production-tested by many organizations
- Lesson: Mature external projects > Alpha native features
- Azure VM Testing Environment
- Realistic conditions
- Easy to reset
- Cost-effective (~$2 total)
- Lesson: Cloud VMs ideal for feature testing
Production Readiness Assessment
Ready for Production
1. In-Place Pod Vertical Scaling (GA)
- Stable, tested, documented
- Real cost savings (30-40%)
- Clear constraints (QoS preservation)
- Recommendation: Deploy to production now
2. Structured Authentication Configuration (GA)
- Mature, well-designed
- Better than command-line flags
- Requires API server restart
- Recommendation: Use for new clusters, migrate existing ones carefully
Use with Caution ⚠️
3. Gang Scheduling (Alpha)
- Native API unstable
- Use scheduler-plugins instead (production-ready)
- Essential for AI/ML workloads
- Recommendation: Use scheduler-plugins, not native API
4. Node Declared Features (Alpha)
- Requires feature gate
- Limited current value
- Will be critical when GA
- Recommendation: Wait for Beta/GA unless testing upgrades
Cost and Time Investment
Testing Environment Costs
- Azure VM: Standard_D2s_v3
- Duration: 8 hours of testing
- Compute cost: ~$0.77 (VM stopped between sessions)
- Storage cost: ~$0.10
- Total: Less than $1 for comprehensive testing
Time Investment
| Activity | Time |
|---|---|
| Environment setup | 30 min |
| Lab 1 (In-place resize) | 1.5 hours |
| Lab 2 (Gang scheduling) | 2 hours |
| Lab 3 (Structured auth) | 1 hour |
| Lab 4 (Node features) | 1.5 hours |
| Documentation | 1.5 hours |
| Total | 8 hours |
ROI: Knowledge gained far exceeds time invested. Testing prevented production issues.
Recommendations for Your Kubernetes Journey
If You’re Running K8s 1.34 or Earlier
- Upgrade path: 1.34 → 1.35 is straightforward
- Focus on GA features first: In-place resize, structured auth
- Test in dev/staging: Use my repository as starting point
- Measure impact: Track cost savings from in-place resize
If You’re Running AI/ML Workloads
- Implement gang scheduling immediately: Use scheduler-plugins
- Test distributed training: Prevent resource deadlocks
- Monitor scheduling: Ensure all-or-nothing behavior working
- Plan for native API: Will mature in K8s 1.36+
If You’re Managing Large Clusters
- Structured auth: Migrate now for better management
- Rolling upgrades: Plan for node feature declaration (future)
- Cost optimization: In-place resize reduces over-provisioning
- Multi-tenancy: Gang scheduling prevents noisy neighbor issues
Complete Repository
All code, scripts, and detailed instructions are available:
GitHub: https://github.com/opscart/k8s-135-labs
Each lab includes:
- Detailed theory and background
- Step-by-step instructions
- Automated scripts where possible
- Troubleshooting guides
- Production recommendations
- Rollback procedures
Conclusion
Kubernetes 1.35 brings meaningful improvements to production workloads:
For Cost Optimization:
- In-place pod resize delivers real savings (30-40% in my tests)
- Eliminates over-provisioning for bursty workloads
- No application changes required
For AI/ML Workloads:
- Gang scheduling prevents resource deadlocks
- Essential for distributed training
- Scheduler-plugins provides production-ready solution
For Operations:
- Structured authentication simplifies management
- Node declared features will improve rolling upgrades
- Better observability and debugging
The Bottom Line: K8s 1.35 GA features are production-ready and deliver immediate value. Alpha features show promising future directions but need more maturation.
Connect:
- GitHub: @opscart
- Blog: Opscart.com
Other Projects:
- Kubectl-health-snapshot – Kubernetes Optimization Security Validator
- k8s-ai-diagnostics – Kubernetes AI Diagnostics
References
- Kubernetes 1.35 Release Notes
- KEP-1287: In-Place Pod Vertical Scaling
- Scheduler-Plugins Documentation
- KEP-3331: Structured Authentication Configuration
- KEP-4568: Node Declared Features

