Deployment & DevOps
Microservices only deliver on “independent deployability” when every service ships as a scanned container, runs on Kubernetes with correct probes and resource limits, promotes through CI/CD with GitOps audit trails, and separates deployment (code on servers) from release (users see new behavior) via flags and canaries.
Deployment & DevOps in microservices
Each service owns its container image, Kubernetes manifests, and pipeline—platform teams provide the cluster, golden Helm chart, and GitOps controller; product teams own the application image tag and config values.
The flow from commit to production: build and test the JAR → package into an immutable container image → scan for CVEs → push to registry with a unique tag → update GitOps repo → controller syncs cluster → rolling/canary rollout with probes confirming health. Metrics from Observability → SLOs gate canary promotion; secrets never enter the image—see Security → Secrets.
flowchart LR BUILD[Build + test] --> SCAN[Image scan] SCAN --> REG[Registry] REG --> GIT[GitOps PR] GIT --> K8S[Kubernetes] K8S --> REL[Rolling or canary]
Containerization — Docker best practices for Java/Spring Boot
A production image contains only the JRE (or a minimal runtime), your fat JAR, and a non-root user—never Maven, source code, or secrets.
Multi-stage builds
Stage one compiles with JDK + Maven/Gradle; stage two copies only the artifact into a slim runtime image. Build tools and intermediate layers stay out of production—smaller attack surface, faster pulls, no accidental mvn in prod debug sessions.
Non-root user
Create a dedicated app user (UID ≥ 10000) and set USER app before ENTRYPOINT. Pair with Kubernetes securityContext.runAsNonRoot: true and readOnlyRootFilesystem: true where the app allows it (mount emptyDir for /tmp). Root in a container is still root on the node if a kernel escape occurs—non-root is baseline hygiene.
Base image choice: eclipse-temurin vs distroless
| Base | Pros | Cons |
|---|---|---|
| eclipse-temurin (Alpine or Ubuntu JRE) | Familiar, shell for emergency exec, easy debugging, official OpenJDK builds | Larger than distroless; shell increases attack surface if compromised |
| gcr.io/distroless/java21-debian12 | Minimal packages—no shell, smaller CVE surface, Google-maintained | Harder to kubectl exec debug; must get JVM flags right at build time |
Most Spring Boot teams start with eclipse-temurin:21-jre-alpine for operability; move to distroless when security policy or image size demands it and you rely on logs/traces instead of shell debugging. Always pass -XX:+UseContainerSupport (default on Java 10+) so the JVM respects cgroup memory limits.
FROM gcr.io/distroless/java21-debian12:nonroot
COPY --from=build /app/target/order-service.jar /app/app.jar
WORKDIR /app
ENTRYPOINT ["java", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]
- Tag images with git commit SHA, not latest
- One process per container—the JVM only; sidecars attach at pod level in K8s
- Never bake secrets into ARG, ENV, or layers—inject at runtime
- Pin base image digests in security-sensitive environments
Dockerfile for Spring Boot — optimized layer caching
Docker rebuilds a layer when any file in that layer changes. Copy dependency descriptors before source so code edits do not invalidate the expensive Maven download layer.
FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app
# Layer 1: wrapper + POM only — rebuilds when dependencies change
COPY mvnw pom.xml ./
COPY .mvn .mvn
RUN ./mvnw dependency:go-offline -B
# Layer 2: source — rebuilds on every code change
COPY src ./src
RUN ./mvnw package -DskipTests -B
FROM eclipse-temurin:21-jre-alpine
RUN addgroup -S app && adduser -S app -G app
USER app
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]
Why this order matters: developers change src/ dozens of times daily but pom.xml rarely. Without layer split, every build re-runs dependency:go-offline—CI minutes add up across twenty microservices. For Gradle, copy build.gradle, settings.gradle, and the Gradle wrapper first, run gradle dependencies, then copy src/.
Copying the entire repo with COPY . . before Maven invalidates cache on any file change—including README and .git if not in .dockerignore.
Buildpacks — Spring Boot build-image
Spring Boot 2.3+ ships Cloud Native Buildpack integration—no Dockerfile required; Paketo Buildpacks produce layered, OCI-compliant images with sensible defaults.
# pom.xml — spring-boot-maven-plugin
<configuration>
<image>registry.example.com/order-service:${git.commit.id.abbrev}</image>
<buildpacks>
<buildpack>paketo-buildpacks/java</buildpack>
</buildpacks>
</configuration>
./mvnw spring-boot:build-image -Dspring-boot.build-image.imageName=order-service:local
Buildpacks automatically: create layered JAR (dependencies, snapshot dependencies, application code separate for cache efficiency), set non-root user, apply JVM memory defaults for containers, and rebuild only changed layers. Trade-off: less control than a hand-written Dockerfile—corporate base-image policies may require custom builders.
Alternatives: Google Jib (Maven/Gradle plugin, no Docker daemon), ko for Go-heavy polyglot shops. Pick one standard per organization—three packaging paths means three CVE patch processes.
Run docker run --rm -it <image> bash on temurin images for debug; buildpack images often have no shell—design for observability instead.
Image scanning — Trivy and Snyk
Base images and transitive JAR dependencies carry CVEs. Scan every image in CI before push to registry—block critical/high severities on main branch merges.
Trivy (open source)
Aqua’s Trivy scans OS packages, language libraries (including JARs inside the image), and misconfigurations. Runs locally, in CI, and as admission controller in cluster. Fast, no SaaS required—fits GitHub Actions and GitLab CI.
trivy image --severity CRITICAL,HIGH --exit-code 1 \
registry.example.com/order-service:${GIT_SHA}
Snyk (commercial + free tier)
Snyk Container integrates with registries for continuous monitoring after deploy—not just point-in-time CI scan. Strong IDE and PR comments for dependency upgrades; pairs with Snyk Open Source for Maven pom.xml CVEs before they reach the image. Many enterprises standardize on Snyk for unified Java dependency + container policy.
| Tool | Best for |
|---|---|
| Trivy | Free CI gate, air-gapped, K8s admission, quick local scans |
| Snyk | Developer UX, registry monitoring, org-wide policy dashboards |
| Both | Trivy blocks merge; Snyk tracks drift and opens fix PRs—common combo |
Also scan: dependency CVEs (OWASP Dependency-Check, Snyk) at build stage; secrets (gitleaks, Trivy secret scanner) so API keys never reach the image layer history.
Kubernetes for microservices — core resources
One Deployment per microservice per environment; Services provide stable DNS; ConfigMaps and Secrets inject config; Ingress exposes HTTP north-south.
| Resource | Role in microservices |
|---|---|
| Deployment | Desired replica count, rolling update strategy, pod template (image, env, probes, resources) |
| Service | Stable cluster DNS (order-service.orders.svc.cluster.local) and load balancing to pod IPs |
| ConfigMap | Non-secret config: feature defaults, public URLs, application.yaml overrides |
| Secret | DB passwords, API keys—prefer External Secrets syncing from Vault |
| Ingress | HTTP/S routing from outside cluster to Services—TLS termination, host/path rules |
apiVersion: v1
kind: ConfigMap
metadata:
name: order-service-config
data:
SPRING_PROFILES_ACTIVE: "production"
LOG_LEVEL: "INFO"
---
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service
ports:
- port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: registry.example.com/order-service:abc123f
envFrom:
- configMapRef:
name: order-service-config
ports:
- containerPort: 8080
Label pods with app, version, and team— Service selectors, Istio subsets, and Prometheus ServiceMonitors all depend on consistent labels. Internal service-to-service traffic uses ClusterIP Services; only the API gateway or mesh ingress needs public Ingress rules.
Resource requests vs limits — and HPA
Requests drive scheduling; limits cap usage. Mis-set JVM memory limits cause OOMKilled pods; missing CPU requests cause noisy-neighbor scheduling.
Requests vs limits
| Setting | Effect |
|---|---|
| CPU request | Guaranteed CPU share for scheduling; HPA % utilization calculated against request |
| CPU limit | Throttled when exceeded—JVM may stall GC threads under heavy throttle |
| Memory request | Scheduler ensures node has capacity; used for bin-packing |
| Memory limit | Exceeded → kernel OOMKills container—no graceful shutdown unless preStop hook |
Spring Boot on Java 21: set memory limit ~25% above expected heap + metaspace + native overhead; use -XX:MaxRAMPercentage=75.0 instead of fixed -Xmx so heap scales with cgroup limit. CPU: start with requests: 250m, limits: 1000m for typical REST service—load test and adjust.
Horizontal Pod Autoscaler (HPA)
HPA adjusts Deployment replica count from metrics—default CPU and memory utilization, or custom metrics (requests/sec, Kafka lag) via Prometheus adapter. Requires requests set on containers; otherwise utilization is undefined.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
JVM warm-up makes CPU-based HPA laggy—consider custom metrics from Micrometer for request rate. Ensure cluster node pool can run maxReplicas simultaneously before Black Friday.
Readiness vs Liveness probes — consequences of misconfiguration
Probes are how Kubernetes knows when to route traffic and when to restart a container. Wrong probe on wrong endpoint causes cascading outages.
| Probe | Question | On failure | Misconfiguration consequence |
|---|---|---|---|
| Startup | Has JVM finished booting? | Blocks liveness/readiness until success | Too short → liveness kills pod mid-startup in CrashLoopBackOff |
| Readiness | Can this pod take traffic? | Removed from Service endpoints | Checks DB on liveness path → unnecessary restarts; or never fails when DB down → traffic to broken pod |
| Liveness | Is JVM deadlocked? | Container restart | Checks external deps → all pods restart together during dependency outage |
Spring Boot Actuator (2.3+): /actuator/health/liveness for liveness, /actuator/health/readiness for readiness—readiness includes DB/Kafka health; liveness stays lightweight.
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 10
failureThreshold: 3
Using the same heavy health check for liveness and readiness—when PostgreSQL blips, Kubernetes restarts every pod instead of temporarily removing them from load balancing.
Pod Disruption Budget — safe rolling updates
PDB limits how many pods can be voluntarily evicted at once—during rolling deploys, node drains, and cluster upgrades—so you never drop below minimum availability.
Rolling updates and cluster maintenance both evict pods. Without PDB, a node drain during deploy could take down all three replicas simultaneously. PDB works with Deployment maxUnavailable—together they define “how many can be down” during planned disruption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: order-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: order-service
For three replicas, minAvailable: 2 allows one pod evicted at a time. Alternative: maxUnavailable: 1. Payment and auth services often require stricter budgets than internal admin APIs. Pair with maxUnavailable: 0 and maxSurge: 1 on Deployment for zero-downtime rolls.
Namespace-based isolation per team/environment
Namespaces partition cluster resources—combine with RBAC, NetworkPolicy, and resource quotas for multi-team microservice estates.
Common layouts:
- By environment — orders-dev, orders-staging, orders-prod (separate clusters for prod is stronger isolation)
- By team — team-checkout owns all their services in one namespace
- Hybrid — prod cluster, namespace per team; staging shared with RBAC
Each namespace gets: ResourceQuota (CPU/memory caps), LimitRange (default container limits), NetworkPolicy (deny cross-namespace except from ingress/mesh), and dedicated ServiceAccounts for GitOps deploy roles. Avoid one mega-namespace with forty Deployments—blast radius and RBAC become unmanageable.
Rolling Update — gradual replacement, zero downtime
Kubernetes default: replace pods incrementally while keeping the Service above minimum capacity—old and new code coexist briefly during rollout.
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
maxUnavailable: 0 ensures capacity never drops below desired replicas during deploy—new pod must pass readiness before old pod terminates. Requires backward-compatible changes: new code must handle old schema/data during the roll. Rollback: revert GitOps image tag or kubectl rollout undo—previous ReplicaSet still exists.
Best for: stateless REST services, bugfix releases, config changes. Not sufficient alone for high-risk payment logic—add canary metrics gating.
Blue-Green — two production environments, instant switch
Run two full stacks (blue = current, green = new); flip Service selector or Ingress target in one step—rollback is another instant flip.
flowchart LR IN[Ingress / Service] -->|100 percent| BLUE[Blue v1.4] IN -.->|after switch| GREEN[Green v1.5] GREEN --> TEST[Smoke tests on green] TEST --> IN
Implementation options: two Deployments (order-service-blue, order-service-green) with Service label switch; Argo Rollouts blueGreen strategy; or duplicate entire namespace for large releases. Green environment receives synthetic traffic or internal-only DNS before public cutover.
Pros: instant rollback, clear before/after, good for demos and schema-compatible major versions. Cons: doubles resource cost during cutover; database migrations still need expand/contract—both colors must tolerate schema during switch window.
Blue-green does not replace database migration discipline—if green expects a column blue lacks, the switch fails regardless of routing speed.
Canary — percentage-based traffic split, monitor, expand or rollback
Route a small fraction of real traffic to the new version; watch error rate and latency; increase weight if SLOs hold, rollback if not.
flowchart TB DEP[Deploy v2 pods] --> CAN[5 percent traffic to v2] CAN --> MET[Monitor SLO 15 min] MET -->|pass| UP[25 then 50 then 100 percent] MET -->|fail| RB[Rollback to v1]
Tools: Argo Rollouts (canary steps + Prometheus analysis), Flagger (Istio/NGINX weight adjustment), Istio VirtualService weights—see Service Mesh → Canary. Gate on business metrics too—checkout conversion drop aborts even if HTTP 5xx looks fine.
Start with header-based canaries (internal QA on v2) before random user percentage. Automated rollback beats human watching Grafana at 2 a.m.—wire alerts to rollout abort, not just Slack noise.
Feature Flags — decouple deployment from release
Deploy code to production with features dark; enable per user, tenant, or percentage via flag service—ship Tuesday, release Thursday without redeploy.
Deployment = new binary running in pods. Release = users experience new behavior. Feature flags close the gap: merge incomplete work behind if (flags.isEnabled("new-checkout")), deploy daily, enable when tested. Kill switch disables bad features without rollback deploy.
| Product | Notes |
|---|---|
| LaunchDarkly | SaaS, rich targeting (user, segment, geo), experimentation, enterprise SSO |
| Unleash | Open source, self-hosted option, Spring Boot SDK, gradual rollout strategies |
| Flagsmith / ConfigCat | Alternatives for cost-sensitive or EU data residency requirements |
@RestController
public class CheckoutController {
private final Unleash unleash;
@PostMapping("/checkout")
public ResponseEntity> checkout(@RequestBody CartDto cart) {
if (unleash.isEnabled("express-checkout-v2")) {
return expressFlow.checkout(cart);
}
return legacyFlow.checkout(cart);
}
}
Flag hygiene: remove stale flags after full rollout; avoid hundreds of permanent toggles becoming undeletable debt. Do not use flags for secrets or security boundaries—use proper authZ.
A/B Testing — canary with user segmentation
A/B testing is canary deployment plus experiment design: stable user cohorts see variant A or B; measure conversion, revenue, or latency—not just error rates.
Technical stack overlaps canary (traffic split) but adds: consistent assignment (same user always sees B), statistical power (sample size, duration), and product metrics (click-through, order completion). LaunchDarkly and Unleash both support percentage rollouts with sticky user keys; dedicated tools (Optimizely, internal experiment platforms) add Bayesian analysis and guardrails.
Pattern: flag service assigns experiment.checkout-v2 = control | treatment based on user ID hash; observability tags spans and logs with experiment variant for downstream analysis in warehouse or Grafana. Coordinate with legal/privacy—experiments on EU users may need consent.
Contrast canary (ops safety—5xx rate) vs A/B (product hypothesis—conversion). Same infra, different success criteria and duration.
CI/CD pipeline — build → test → scan → package → deploy
Per-microservice pipeline with mandatory gates; failed scan or contract test blocks prod promotion—no manual SSH deploys.
flowchart LR B[Build JAR] --> T[Test] T --> S[Scan deps + image] S --> P[Package push registry] P --> D[Deploy via GitOps]
| Stage | Actions | Fail criteria |
|---|---|---|
| Build | Maven/Gradle compile, unit tests fast path | Compile error, unit test failure |
| Test | Integration (Testcontainers), Pact contract tests | Broken API contract, flaky tests quarantined not ignored |
| Scan | Snyk/Trivy deps, Trivy image, gitleaks secrets | Critical CVE, secret in repo |
| Package | Docker build or spring-boot:build-image, push with git SHA tag | Image push failure, unsigned image policy |
| Deploy | Update GitOps repo / trigger Argo CD; smoke test staging | Smoke test fail, sync error, SLO burn in canary |
Use OIDC federation for cloud/registry auth—no long-lived AWS keys in GitHub Secrets. Pipeline templates (reusable workflows) ensure service #47 gets the same gates as service #1. Contract tests from Service Design → Pact run on consumer and provider pipelines.
GitOps — Argo CD / Flux, declarative, Git as source of truth
Cluster state lives in Git; controllers continuously reconcile live resources to match—every prod change is a reviewed PR with full audit history.
| Tool | Characteristics |
|---|---|
| Argo CD | UI-rich, ApplicationSet for multi-service bootstrap, strong enterprise adoption, sync waves/hooks |
| Flux | GitOps Toolkit, native Helm/Kustomize controllers, lightweight, CNCF graduated |
flowchart LR DEV[Developer merges PR] --> GIT[Git manifest repo] CI[CI pushes image] --> GIT GIT --> CTRL[Argo CD or Flux] CTRL --> CLUSTER[Kubernetes cluster]
CI builds image and opens PR bumping image.tag in apps/order-service/overlays/prod/values.yaml—human or bot merge triggers sync. Drift detection alerts when someone kubectl-edits prod; rollback = git revert. Secrets via SOPS, Sealed Secrets, or External Secrets—never plain text in Git.
Helm charts — templating Kubernetes manifests
Helm packages Deployment, Service, Ingress, ConfigMap, HPA, and PDB into one chart with environment-specific values—platform golden chart, team fills values.
replicaCount: 3
image:
repository: registry.example.com/order-service
tag: "abc123f" # set by CI PR
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
memory: 768Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 30
ingress:
enabled: true
host: api.example.com
path: /orders
Chart templates use {{ .Values.image.tag }} for image, helm upgrade --install in CI for ephemeral envs; Argo CD Helm source for prod. Chart version (packaging) semver independently from app version—document breaking chart changes. Kustomize alternative for teams preferring patches over Go templates—many repos use Helm for third-party charts (Kafka, Prometheus) and Kustomize for internal apps.
Semantic versioning + conventional commits for automated releases
Semver (MAJOR.MINOR.PATCH) communicates breaking vs additive vs fix changes; conventional commits feed tools that bump version and generate changelogs automatically.
Conventional commits format: feat(checkout): add express payment, fix(inventory): null reserve response, feat! or BREAKING CHANGE: footer for major bumps. Tools: semantic-release, release-please, git-cliff—parse commits on main merge, compute next version, tag Git, publish GitHub Release, optionally bump Helm chart appVersion.
| Bump | When | Example |
|---|---|---|
| PATCH | Bug fix, backward compatible | 1.4.2 → 1.4.3 |
| MINOR | New feature, backward compatible | 1.4.3 → 1.5.0 |
| MAJOR | Breaking API or behavior | 1.5.0 → 2.0.0 |
Container images still tag with git SHA for immutable deploy traceability—semver tags the release artifact and changelog; prod manifest pins SHA: image: order-service:abc123f (release v1.5.0). Consumer-driven contracts catch breaking changes before semver major is needed unexpectedly.
Environment promotion — dev → staging → prod gates
Same artifact (image digest) promotes through environments with increasing scrutiny—never rebuild differently for prod.
flowchart LR
DEV[Dev auto deploy] --> STG[Staging soak + e2e]
STG --> APP{Approval gate}
APP --> PROD[Prod canary or sync]
| Environment | Purpose | Gate |
|---|---|---|
| Dev | Feature branch previews, fast feedback | CI green only; synthetic data |
| Staging | Prod-like config, integration e2e, load smoke | Automated smoke + Pact verify; optional manual QA sign-off |
| Prod | Real traffic | Change advisory for risky windows; canary SLO gate; on-call aware |
Promote by merging GitOps PR that updates staging → prod overlay image tag—same SHA tested in staging runs in prod. Staging should mirror prod topology (mesh, HPA, probe config) at smaller scale—"works in dev" with H2 database does not predict prod Kafka behavior. Manual approval step in pipeline (GitHub Environments, GitLab protected environments) for prod; dev/staging auto-sync on merge.
Teams often deploy to staging on every main merge; prod promote twice weekly unless hotfix—balance velocity with change fatigue and on-call load.