Published: May 10, 2022
Author: Fernando McKenzie
Tags: Kubernetes, Containers, Microservices, DevOps, Scalability
Building on our ML-driven predictive maintenance success in 2021, we faced a new challenge: our monolithic application architecture was becoming a bottleneck for innovation and scaling. This article chronicles our journey from a single monolithic application to a distributed microservices architecture orchestrated by Kubernetes, achieving 10x deployment frequency and 99.99% uptime.
Monolith Limitations (2021):
├── Deployment frequency: Weekly releases (52/year)
├── Average deployment time: 45 minutes
├── Failed deployment rate: 12% (rollback required)
├── Mean time to recovery: 2.5 hours
├── Resource utilization: 35% average CPU/memory
└── Developer velocity: 3 story points/developer/sprint
Infrastructure as Code (Terraform):
# EKS cluster configuration
resource "aws_eks_cluster" "main" {
name = "supply-chain-cluster"
role_arn = aws_iam_role.cluster.arn
version = "1.21"
vpc_config {
subnet_ids = module.vpc.private_subnets
endpoint_private_access = true
endpoint_public_access = true
public_access_cidrs = ["0.0.0.0/0"]
}
encryption_config {
provider {
key_arn = aws_kms_key.eks.arn
}
resources = ["secrets"]
}
enabled_cluster_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
depends_on = [
aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy,
]
}
# Node groups with different instance types for workload optimization
resource "aws_eks_node_group" "general" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "general-workload"
node_role_arn = aws_iam_role.node.arn
subnet_ids = module.vpc.private_subnets
instance_types = ["t3.medium", "t3.large"]
capacity_type = "ON_DEMAND"
scaling_config {
desired_size = 3
max_size = 10
min_size = 2
}
labels = {
workload = "general"
}
depends_on = [
aws_iam_role_policy_attachment.node_AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.node_AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.node_AmazonEC2ContainerRegistryReadOnly,
]
}
resource "aws_eks_node_group" "ml_workload" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "ml-workload"
node_role_arn = aws_iam_role.node.arn
subnet_ids = module.vpc.private_subnets
instance_types = ["c5.xlarge", "c5.2xlarge"]
capacity_type = "SPOT" # Cost optimization for ML workloads
scaling_config {
desired_size = 1
max_size = 5
min_size = 0
}
labels = {
workload = "ml-processing"
}
taint {
key = "workload"
value = "ml"
effect = "NO_SCHEDULE"
}
}
Service Boundaries Definition:
# services-architecture.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: service-boundaries
data:
inventory-service: |
Responsibilities:
- Product catalog management
- Stock level tracking
- Inventory allocation/deallocation
- Low stock alerts
Dependencies:
- PostgreSQL database
- Redis cache
- Notification service
order-service: |
Responsibilities:
- Order creation and management
- Order status tracking
- Payment processing coordination
- Order fulfillment workflow
Dependencies:
- Inventory service
- Payment service
- Shipping service
- Customer service
shipping-service: |
Responsibilities:
- Carrier integration
- Shipment tracking
- Delivery scheduling
- Route optimization
Dependencies:
- Order service
- External carrier APIs
- Geolocation service
ml-prediction-service: |
Responsibilities:
- Predictive maintenance models
- Demand forecasting
- Anomaly detection
- Model training and deployment
Dependencies:
- Time series database
- Model registry
- Feature store
Container Deployment Configurations:
# inventory-service deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: inventory-service
namespace: supply-chain
labels:
app: inventory-service
version: v1.2.3
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: inventory-service
template:
metadata:
labels:
app: inventory-service
version: v1.2.3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: inventory-service-sa
containers:
- name: inventory-service
image: your-registry/inventory-service:v1.2.3
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: grpc
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: url
- name: REDIS_URL
valueFrom:
configMapKeyRef:
name: redis-config
key: url
- name: LOG_LEVEL
value: "INFO"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
volumeMounts:
- name: config-volume
mountPath: /app/config
readOnly: true
volumes:
- name: config-volume
configMap:
name: inventory-service-config
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- inventory-service
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: inventory-service
namespace: supply-chain
labels:
app: inventory-service
spec:
selector:
app: inventory-service
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
- name: grpc
port: 9090
targetPort: 9090
protocol: TCP
type: ClusterIP
# Istio service mesh configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inventory-service
namespace: supply-chain
spec:
hosts:
- inventory-service
http:
- match:
- headers:
version:
exact: v2
route:
- destination:
host: inventory-service
subset: v2
weight: 100
- route:
- destination:
host: inventory-service
subset: v1
weight: 100
fault:
delay:
percentage:
value: 0.1
fixedDelay: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure,refused-stream
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inventory-service
namespace: supply-chain
spec:
host: inventory-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 10
circuitBreaker:
consecutiveErrors: 3
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: v1
labels:
version: v1.2.3
- name: v2
labels:
version: v1.3.0
---
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: supply-chain-mtls
namespace: supply-chain
spec:
mtls:
mode: STRICT
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: inventory-service-authz
namespace: supply-chain
spec:
selector:
matchLabels:
app: inventory-service
rules:
- from:
- source:
principals: ["cluster.local/ns/supply-chain/sa/order-service-sa"]
- source:
principals: ["cluster.local/ns/supply-chain/sa/ml-prediction-service-sa"]
to:
- operation:
methods: ["GET", "POST", "PUT"]
- from:
- source:
principals: ["cluster.local/ns/supply-chain/sa/admin-sa"]
to:
- operation:
methods: ["*"]
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inventory-service-hpa
namespace: supply-chain
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inventory-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ml-prediction-service-vpa
namespace: supply-chain
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-prediction-service
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: ml-prediction-service
maxAllowed:
cpu: "4"
memory: "8Gi"
minAllowed:
cpu: "500m"
memory: "1Gi"
controlledResources: ["cpu", "memory"]
# Custom resource for ML model deployments
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: mlmodels.ml.supplychain.io
spec:
group: ml.supplychain.io
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
modelName:
type: string
modelVersion:
type: string
framework:
type: string
enum: ["tensorflow", "pytorch", "sklearn"]
resourceRequirements:
type: object
properties:
memory:
type: string
cpu:
type: string
gpu:
type: string
replicas:
type: integer
minimum: 1
maximum: 10
status:
type: object
properties:
deploymentStatus:
type: string
enum: ["pending", "deploying", "ready", "failed"]
endpoint:
type: string
lastUpdated:
type: string
format: date-time
scope: Namespaced
names:
plural: mlmodels
singular: mlmodel
kind: MLModel
---
# Example ML model deployment using custom resource
apiVersion: ml.supplychain.io/v1
kind: MLModel
metadata:
name: demand-forecasting-v2
namespace: supply-chain
spec:
modelName: "demand-forecasting"
modelVersion: "v2.1.0"
framework: "tensorflow"
resourceRequirements:
memory: "2Gi"
cpu: "1000m"
gpu: "1"
replicas: 3
# ArgoCD application configuration
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: supply-chain-services
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/supply-chain-k8s
targetRevision: HEAD
path: manifests/production
destination:
server: https://kubernetes.default.svc
namespace: supply-chain
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
# .github/workflows/deploy.yml
name: Build and Deploy to Kubernetes
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
CLUSTER_NAME: supply-chain-cluster
CLUSTER_REGION: us-west-2
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Go
uses: actions/setup-go@v3
with:
go-version: 1.18
- name: Run tests
run: |
go test ./... -v -race -coverprofile=coverage.out
go tool cover -html=coverage.out -o coverage.html
- name: Security scan
uses: securecodewarrior/github-action-add-sarif@v1
with:
sarif-file: 'security-scan-results.sarif'
build:
needs: test
runs-on: ubuntu-latest
outputs:
image-tag: $
image-digest: $
steps:
- uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Log in to Container Registry
uses: docker/login-action@v2
with:
registry: $
username: $
password: $
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: $/$
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha,prefix=-
type=raw,value=latest,enable=
- name: Build and push Docker image
id: build
uses: docker/build-push-action@v4
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: $
labels: $
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
VERSION=$
COMMIT_SHA=$
security-scan:
needs: build
runs-on: ubuntu-latest
steps:
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: $
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
deploy-staging:
needs: [build, security-scan]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: staging
steps:
- uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: $
aws-secret-access-key: $
aws-region: $
- name: Update kubeconfig
run: |
aws eks update-kubeconfig --name $ --region $
- name: Deploy to staging
run: |
# Update image in Kustomization
cd k8s/overlays/staging
kustomize edit set image app=$
# Apply manifests
kubectl apply -k .
# Wait for rollout
kubectl rollout status deployment/inventory-service -n staging --timeout=600s
kubectl rollout status deployment/order-service -n staging --timeout=600s
- name: Run smoke tests
run: |
# Wait for services to be ready
kubectl wait --for=condition=ready pod -l app=inventory-service -n staging --timeout=300s
# Run integration tests
go test ./tests/integration/... -tags=staging
deploy-production:
needs: [build, deploy-staging]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: $
aws-secret-access-key: $
aws-region: $
- name: Update kubeconfig
run: |
aws eks update-kubeconfig --name $ --region $
- name: Blue-Green Deployment
run: |
# Update ArgoCD application with new image
argocd app sync supply-chain-services --force
argocd app wait supply-chain-services --timeout 600
# Verify deployment health
kubectl get pods -n supply-chain
kubectl top pods -n supply-chain
- name: Post-deployment verification
run: |
# Health checks
curl -f http://api.internal/health
# Performance verification
kubectl top nodes
kubectl get hpa -n supply-chain
# Alert if any issues
if [ $? -ne 0 ]; then
echo "Deployment verification failed"
exit 1
fi
# Custom ServiceMonitor for application metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: supply-chain-services
namespace: supply-chain
labels:
app: supply-chain-services
spec:
selector:
matchLabels:
monitoring: enabled
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_service_name]
targetLabel: service
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
// Application metrics in Go service
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Business metrics
OrdersProcessed = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_processed_total",
Help: "Total number of orders processed",
},
[]string{"status", "customer_type"},
)
InventoryLevels = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "inventory_levels",
Help: "Current inventory levels by SKU",
},
[]string{"sku", "location"},
)
OrderProcessingDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "order_processing_duration_seconds",
Help: "Time taken to process orders",
Buckets: prometheus.ExponentialBuckets(0.1, 2, 10),
},
[]string{"order_type"},
)
// Technical metrics
DatabaseConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "database_connections_active",
Help: "Number of active database connections",
},
)
CacheHitRate = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "cache_hit_rate",
Help: "Cache hit rate percentage",
},
[]string{"cache_name"},
)
)
// Middleware to track HTTP request metrics
func PrometheusMiddleware(next http.Handler) http.Handler {
return promhttp.InstrumentHandlerDuration(
promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
),
next,
)
}
Deployment Metrics:
Deployment Improvements:
├── Frequency: Weekly (52/year) → Daily (250+/year)
├── Duration: 45 minutes → 3 minutes (93% reduction)
├── Success rate: 88% → 99.2% (99.2% first-time success)
├── Rollback time: 2.5 hours → 2 minutes (98% reduction)
└── Zero-downtime: 0% → 100% of deployments
Operational Improvements:
# Performance comparison metrics
operational_metrics = {
'system_availability': {
'before': 99.2, # 99.2% uptime
'after': 99.99, # 99.99% uptime (< 4 minutes downtime/month)
'improvement': 'SLA exceeded by 290%'
},
'resource_utilization': {
'before': 35, # 35% average utilization
'after': 78, # 78% average utilization
'cost_savings': '$45,000/month in infrastructure'
},
'scaling_response': {
'before': 900, # 15 minutes to scale manually
'after': 30, # 30 seconds automatic scaling
'improvement': '97% faster response to demand'
},
'developer_productivity': {
'before': 3, # story points per developer per sprint
'after': 12, # story points per developer per sprint
'improvement': '300% increase in velocity'
}
}
Cost Analysis:
Monthly Infrastructure Costs:
Monolith (Previous):
├── EC2 instances (over-provisioned): $8,500
├── Load balancers: $450
├── Database (single instance): $1,200
├── Monitoring: $200
└── Total: $10,350
Kubernetes (New):
├── EKS cluster: $220
├── Worker nodes (auto-scaled): $4,200
├── Load balancers (ALB): $150
├── Databases (per-service): $1,800
├── Service mesh: $300
├── Monitoring (Prometheus/Grafana): $450
└── Total: $7,120
Monthly Savings: $3,230 (31% reduction)
Annual Savings: $38,760
Problem: Maintaining data consistency without distributed transactions
Solution: Saga Pattern Implementation
// Saga orchestrator for order processing
package saga
import (
"context"
"fmt"
"time"
)
type OrderSaga struct {
orderID string
steps []SagaStep
currentStep int
completed bool
}
type SagaStep struct {
Name string
Execute func(ctx context.Context, data interface{}) error
Compensate func(ctx context.Context, data interface{}) error
}
func NewOrderProcessingSaga(orderID string) *OrderSaga {
return &OrderSaga{
orderID: orderID,
steps: []SagaStep{
{
Name: "ReserveInventory",
Execute: reserveInventory,
Compensate: releaseInventory,
},
{
Name: "ProcessPayment",
Execute: processPayment,
Compensate: refundPayment,
},
{
Name: "CreateShipment",
Execute: createShipment,
Compensate: cancelShipment,
},
{
Name: "UpdateOrderStatus",
Execute: updateOrderStatus,
Compensate: revertOrderStatus,
},
},
}
}
func (s *OrderSaga) Execute(ctx context.Context, data interface{}) error {
for i, step := range s.steps {
s.currentStep = i
err := step.Execute(ctx, data)
if err != nil {
// Compensation: rollback previous steps
for j := i - 1; j >= 0; j-- {
if compErr := s.steps[j].Compensate(ctx, data); compErr != nil {
// Log compensation failure but continue
fmt.Printf("Compensation failed for step %s: %v\n", s.steps[j].Name, compErr)
}
}
return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
}
// Log progress
fmt.Printf("Saga step %s completed successfully\n", step.Name)
}
s.completed = true
return nil
}
func reserveInventory(ctx context.Context, data interface{}) error {
// Call inventory service to reserve items
orderData := data.(*OrderData)
client := inventory.NewClient()
err := client.ReserveItems(ctx, orderData.Items)
if err != nil {
return fmt.Errorf("failed to reserve inventory: %w", err)
}
orderData.InventoryReserved = true
return nil
}
func releaseInventory(ctx context.Context, data interface{}) error {
// Compensate by releasing reserved inventory
orderData := data.(*OrderData)
if orderData.InventoryReserved {
client := inventory.NewClient()
return client.ReleaseItems(ctx, orderData.Items)
}
return nil
}
// Similar implementations for other steps...
Problem: Services need to discover and communicate with each other reliably
Solution: Service Mesh + Custom Discovery
// Service discovery client
package discovery
import (
"context"
"sync"
"time"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
)
type ServiceRegistry struct {
client kubernetes.Interface
services map[string][]ServiceEndpoint
mu sync.RWMutex
updateChan chan ServiceUpdate
}
type ServiceEndpoint struct {
Address string
Port int
Healthy bool
Metadata map[string]string
LastSeen time.Time
}
type ServiceUpdate struct {
ServiceName string
Endpoints []ServiceEndpoint
}
func NewServiceRegistry() (*ServiceRegistry, error) {
config, err := rest.InClusterConfig()
if err != nil {
return nil, err
}
client, err := kubernetes.NewForConfig(config)
if err != nil {
return nil, err
}
sr := &ServiceRegistry{
client: client,
services: make(map[string][]ServiceEndpoint),
updateChan: make(chan ServiceUpdate, 100),
}
go sr.watchServices()
return sr, nil
}
func (sr *ServiceRegistry) GetServiceEndpoints(serviceName string) []ServiceEndpoint {
sr.mu.RLock()
defer sr.mu.RUnlock()
endpoints, exists := sr.services[serviceName]
if !exists {
return nil
}
// Filter healthy endpoints
var healthy []ServiceEndpoint
for _, ep := range endpoints {
if ep.Healthy && time.Since(ep.LastSeen) < 30*time.Second {
healthy = append(healthy, ep)
}
}
return healthy
}
func (sr *ServiceRegistry) watchServices() {
// Watch Kubernetes endpoints for service changes
for {
select {
case update := <-sr.updateChan:
sr.mu.Lock()
sr.services[update.ServiceName] = update.Endpoints
sr.mu.Unlock()
case <-time.After(10 * time.Second):
// Periodic health check of services
sr.healthCheckServices()
}
}
}
func (sr *ServiceRegistry) healthCheckServices() {
sr.mu.Lock()
defer sr.mu.Unlock()
for serviceName, endpoints := range sr.services {
for i := range endpoints {
// Perform health check
healthy := sr.checkEndpointHealth(endpoints[i])
sr.services[serviceName][i].Healthy = healthy
sr.services[serviceName][i].LastSeen = time.Now()
}
}
}
Problem: Managing configuration across dozens of microservices
Solution: External Secrets Operator + ConfigMap Hierarchy
# External secrets configuration
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: supply-chain
spec:
provider:
aws:
service: SecretsManager
region: us-west-2
auth:
jwt:
serviceAccountRef:
name: external-secrets-sa
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
namespace: supply-chain
spec:
refreshInterval: 15s
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: database-credentials
creationPolicy: Owner
data:
- secretKey: url
remoteRef:
key: production/database
property: connection_string
- secretKey: username
remoteRef:
key: production/database
property: username
- secretKey: password
remoteRef:
key: production/database
property: password
---
# Hierarchical configuration with Kustomize
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
# Base configuration
resources:
- ../base
# Environment-specific patches
patchesStrategicMerge:
- config-patch.yaml
- resource-patch.yaml
# Environment-specific config
configMapGenerator:
- name: app-config
literals:
- ENVIRONMENT=production
- LOG_LEVEL=INFO
- DATABASE_POOL_SIZE=20
- CACHE_TTL=300
- API_RATE_LIMIT=1000
Learning: Begin with non-critical services to build expertise
Implementation Timeline:
Learning: Distributed systems require distributed observability
Three Pillars Implementation:
# Metrics (Prometheus)
monitoring:
business_metrics: ["orders/second", "inventory_turnover", "fulfillment_time"]
technical_metrics: ["response_time", "error_rate", "throughput"]
infrastructure_metrics: ["cpu", "memory", "network", "disk"]
# Logs (ELK Stack)
logging:
structured_logging: true
correlation_ids: true
log_levels: ["ERROR", "WARN", "INFO", "DEBUG"]
retention: "30_days"
# Traces (Jaeger)
tracing:
sample_rate: 0.1 # 10% of requests
trace_timeout: "30s"
max_trace_depth: 20
Learning: Implement security controls from day one
Security Checklist:
# Chaos engineering with Chaos Monkey
chaos_engineering:
tools: ["chaos-monkey", "litmus", "gremlin"]
experiments:
- pod_termination
- network_latency
- cpu_stress
- memory_pressure
schedule: "weekly"
blast_radius: "single_service"
# ML model deployment automation
def deploy_ml_model(model_name, version, replicas=3):
"""Deploy ML model using custom Kubernetes operator"""
ml_deployment = {
'apiVersion': 'ml.supplychain.io/v1',
'kind': 'MLModel',
'metadata': {
'name': f'{model_name}-{version}',
'namespace': 'ml-models'
},
'spec': {
'modelName': model_name,
'modelVersion': version,
'replicas': replicas,
'autoScaling': {
'enabled': True,
'minReplicas': 1,
'maxReplicas': 10,
'targetCPUUtilization': 70
},
'monitoring': {
'enabled': True,
'metricsPath': '/metrics',
'alertRules': [
'prediction_latency_high',
'model_accuracy_degraded'
]
}
}
}
return deploy_to_cluster(ml_deployment)
Our Kubernetes transformation journey from monolith to microservices delivered exceptional results: 10x deployment frequency, 99.99% uptime, and 300% developer productivity improvement. The key was treating it as an organizational transformation, not just a technology migration.
Critical Success Factors:
The Kubernetes platform now serves as our foundation for innovation, enabling rapid experimentation with new technologies like serverless computing, edge processing, and advanced ML workflows.
2022 taught us: Container orchestration success depends more on organizational readiness than technical complexity. The teams that embraced DevOps practices adapted fastest to the microservices paradigm.
Planning a Kubernetes migration? Let’s connect on LinkedIn to discuss your containerization strategy.