Deployment & DevOps

Deployment & DevOps in microservices

Each service owns its container image, Kubernetes manifests, and pipeline—platform teams provide the cluster, golden Helm chart, and GitOps controller; product teams own the application image tag and config values.

The flow from commit to production: build and test the JAR → package into an immutable container image → scan for CVEs → push to registry with a unique tag → update GitOps repo → controller syncs cluster → rolling/canary rollout with probes confirming health. Metrics from Observability → SLOs gate canary promotion; secrets never enter the image—see Security → Secrets.

flowchart LR
  BUILD[Build + test] --> SCAN[Image scan]
  SCAN --> REG[Registry]
  REG --> GIT[GitOps PR]
  GIT --> K8S[Kubernetes]
  K8S --> REL[Rolling or canary]

Containerization — Docker best practices for Java/Spring Boot

A production image contains only the JRE (or a minimal runtime), your fat JAR, and a non-root user—never Maven, source code, or secrets.

Multi-stage builds

Stage one compiles with JDK + Maven/Gradle; stage two copies only the artifact into a slim runtime image. Build tools and intermediate layers stay out of production—smaller attack surface, faster pulls, no accidental mvn in prod debug sessions.

Non-root user

Create a dedicated app user (UID ≥ 10000) and set USER app before ENTRYPOINT. Pair with Kubernetes securityContext.runAsNonRoot: true and readOnlyRootFilesystem: true where the app allows it (mount emptyDir for /tmp). Root in a container is still root on the node if a kernel escape occurs—non-root is baseline hygiene.

Base image choice: eclipse-temurin vs distroless

Base	Pros	Cons
eclipse-temurin (Alpine or Ubuntu JRE)	Familiar, shell for emergency exec, easy debugging, official OpenJDK builds	Larger than distroless; shell increases attack surface if compromised
gcr.io/distroless/java21-debian12	Minimal packages—no shell, smaller CVE surface, Google-maintained	Harder to kubectl exec debug; must get JVM flags right at build time

Most Spring Boot teams start with eclipse-temurin:21-jre-alpine for operability; move to distroless when security policy or image size demands it and you rely on logs/traces instead of shell debugging. Always pass -XX:+UseContainerSupport (default on Java 10+) so the JVM respects cgroup memory limits.

FROM gcr.io/distroless/java21-debian12:nonroot
COPY --from=build /app/target/order-service.jar /app/app.jar
WORKDIR /app
ENTRYPOINT ["java", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]

Tag images with git commit SHA, not latest
One process per container—the JVM only; sidecars attach at pod level in K8s
Never bake secrets into ARG, ENV, or layers—inject at runtime
Pin base image digests in security-sensitive environments

Dockerfile for Spring Boot — optimized layer caching

Docker rebuilds a layer when any file in that layer changes. Copy dependency descriptors before source so code edits do not invalidate the expensive Maven download layer.

FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app

# Layer 1: wrapper + POM only — rebuilds when dependencies change
COPY mvnw pom.xml ./
COPY .mvn .mvn
RUN ./mvnw dependency:go-offline -B

# Layer 2: source — rebuilds on every code change
COPY src ./src
RUN ./mvnw package -DskipTests -B

FROM eclipse-temurin:21-jre-alpine
RUN addgroup -S app && adduser -S app -G app
USER app
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]

Why this order matters: developers change src/ dozens of times daily but pom.xml rarely. Without layer split, every build re-runs dependency:go-offline—CI minutes add up across twenty microservices. For Gradle, copy build.gradle, settings.gradle, and the Gradle wrapper first, run gradle dependencies, then copy src/.

⚠️ Pitfall

Copying the entire repo with COPY . . before Maven invalidates cache on any file change—including README and .git if not in .dockerignore.

Buildpacks — Spring Boot build-image

Spring Boot 2.3+ ships Cloud Native Buildpack integration—no Dockerfile required; Paketo Buildpacks produce layered, OCI-compliant images with sensible defaults.

# pom.xml — spring-boot-maven-plugin
<configuration>
  <image>registry.example.com/order-service:${git.commit.id.abbrev}</image>
  <buildpacks>
    <buildpack>paketo-buildpacks/java</buildpack>
  </buildpacks>
</configuration>

./mvnw spring-boot:build-image -Dspring-boot.build-image.imageName=order-service:local

Buildpacks automatically: create layered JAR (dependencies, snapshot dependencies, application code separate for cache efficiency), set non-root user, apply JVM memory defaults for containers, and rebuild only changed layers. Trade-off: less control than a hand-written Dockerfile—corporate base-image policies may require custom builders.

Alternatives: Google Jib (Maven/Gradle plugin, no Docker daemon), ko for Go-heavy polyglot shops. Pick one standard per organization—three packaging paths means three CVE patch processes.

💡 Pro Tip

Run docker run --rm -it <image> bash on temurin images for debug; buildpack images often have no shell—design for observability instead.

Image scanning — Trivy and Snyk

Base images and transitive JAR dependencies carry CVEs. Scan every image in CI before push to registry—block critical/high severities on main branch merges.

Trivy (open source)

Aqua’s Trivy scans OS packages, language libraries (including JARs inside the image), and misconfigurations. Runs locally, in CI, and as admission controller in cluster. Fast, no SaaS required—fits GitHub Actions and GitLab CI.

trivy image --severity CRITICAL,HIGH --exit-code 1 \
  registry.example.com/order-service:${GIT_SHA}

Snyk (commercial + free tier)

Snyk Container integrates with registries for continuous monitoring after deploy—not just point-in-time CI scan. Strong IDE and PR comments for dependency upgrades; pairs with Snyk Open Source for Maven pom.xml CVEs before they reach the image. Many enterprises standardize on Snyk for unified Java dependency + container policy.

Tool	Best for
Trivy	Free CI gate, air-gapped, K8s admission, quick local scans
Snyk	Developer UX, registry monitoring, org-wide policy dashboards
Both	Trivy blocks merge; Snyk tracks drift and opens fix PRs—common combo

Also scan: dependency CVEs (OWASP Dependency-Check, Snyk) at build stage; secrets (gitleaks, Trivy secret scanner) so API keys never reach the image layer history.

Kubernetes for microservices — core resources

One Deployment per microservice per environment; Services provide stable DNS; ConfigMaps and Secrets inject config; Ingress exposes HTTP north-south.

Resource	Role in microservices
Deployment	Desired replica count, rolling update strategy, pod template (image, env, probes, resources)
Service	Stable cluster DNS (order-service.orders.svc.cluster.local) and load balancing to pod IPs
ConfigMap	Non-secret config: feature defaults, public URLs, application.yaml overrides
Secret	DB passwords, API keys—prefer External Secrets syncing from Vault
Ingress	HTTP/S routing from outside cluster to Services—TLS termination, host/path rules

apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
data:
  SPRING_PROFILES_ACTIVE: "production"
  LOG_LEVEL: "INFO"
---
apiVersion: v1
kind: Service
metadata:
  name: order-service
spec:
  selector:
    app: order-service
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: registry.example.com/order-service:abc123f
          envFrom:
            - configMapRef:
                name: order-service-config
          ports:
            - containerPort: 8080

Label pods with app, version, and team— Service selectors, Istio subsets, and Prometheus ServiceMonitors all depend on consistent labels. Internal service-to-service traffic uses ClusterIP Services; only the API gateway or mesh ingress needs public Ingress rules.

Resource requests vs limits — and HPA

Requests drive scheduling; limits cap usage. Mis-set JVM memory limits cause OOMKilled pods; missing CPU requests cause noisy-neighbor scheduling.

Requests vs limits

Setting	Effect
CPU request	Guaranteed CPU share for scheduling; HPA % utilization calculated against request
CPU limit	Throttled when exceeded—JVM may stall GC threads under heavy throttle
Memory request	Scheduler ensures node has capacity; used for bin-packing
Memory limit	Exceeded → kernel OOMKills container—no graceful shutdown unless preStop hook

Spring Boot on Java 21: set memory limit ~25% above expected heap + metaspace + native overhead; use -XX:MaxRAMPercentage=75.0 instead of fixed -Xmx so heap scales with cgroup limit. CPU: start with requests: 250m, limits: 1000m for typical REST service—load test and adjust.

Horizontal Pod Autoscaler (HPA)

HPA adjusts Deployment replica count from metrics—default CPU and memory utilization, or custom metrics (requests/sec, Kafka lag) via Prometheus adapter. Requires requests set on containers; otherwise utilization is undefined.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

JVM warm-up makes CPU-based HPA laggy—consider custom metrics from Micrometer for request rate. Ensure cluster node pool can run maxReplicas simultaneously before Black Friday.

Readiness vs Liveness probes — consequences of misconfiguration

Probes are how Kubernetes knows when to route traffic and when to restart a container. Wrong probe on wrong endpoint causes cascading outages.

Probe	Question	On failure	Misconfiguration consequence
Startup	Has JVM finished booting?	Blocks liveness/readiness until success	Too short → liveness kills pod mid-startup in CrashLoopBackOff
Readiness	Can this pod take traffic?	Removed from Service endpoints	Checks DB on liveness path → unnecessary restarts; or never fails when DB down → traffic to broken pod
Liveness	Is JVM deadlocked?	Container restart	Checks external deps → all pods restart together during dependency outage

Spring Boot Actuator (2.3+): /actuator/health/liveness for liveness, /actuator/health/readiness for readiness—readiness includes DB/Kafka health; liveness stays lightweight.

startupProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

🚫 Anti-Pattern

Using the same heavy health check for liveness and readiness—when PostgreSQL blips, Kubernetes restarts every pod instead of temporarily removing them from load balancing.

Pod Disruption Budget — safe rolling updates

PDB limits how many pods can be voluntarily evicted at once—during rolling deploys, node drains, and cluster upgrades—so you never drop below minimum availability.

Rolling updates and cluster maintenance both evict pods. Without PDB, a node drain during deploy could take down all three replicas simultaneously. PDB works with Deployment maxUnavailable—together they define “how many can be down” during planned disruption.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: order-service

For three replicas, minAvailable: 2 allows one pod evicted at a time. Alternative: maxUnavailable: 1. Payment and auth services often require stricter budgets than internal admin APIs. Pair with maxUnavailable: 0 and maxSurge: 1 on Deployment for zero-downtime rolls.

Namespace-based isolation per team/environment

Namespaces partition cluster resources—combine with RBAC, NetworkPolicy, and resource quotas for multi-team microservice estates.

Common layouts:

By environment — orders-dev, orders-staging, orders-prod (separate clusters for prod is stronger isolation)
By team — team-checkout owns all their services in one namespace
Hybrid — prod cluster, namespace per team; staging shared with RBAC

Each namespace gets: ResourceQuota (CPU/memory caps), LimitRange (default container limits), NetworkPolicy (deny cross-namespace except from ingress/mesh), and dedicated ServiceAccounts for GitOps deploy roles. Avoid one mega-namespace with forty Deployments—blast radius and RBAC become unmanageable.

Rolling Update — gradual replacement, zero downtime

Kubernetes default: replace pods incrementally while keeping the Service above minimum capacity—old and new code coexist briefly during rollout.

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

maxUnavailable: 0 ensures capacity never drops below desired replicas during deploy—new pod must pass readiness before old pod terminates. Requires backward-compatible changes: new code must handle old schema/data during the roll. Rollback: revert GitOps image tag or kubectl rollout undo—previous ReplicaSet still exists.

Best for: stateless REST services, bugfix releases, config changes. Not sufficient alone for high-risk payment logic—add canary metrics gating.

Blue-Green — two production environments, instant switch

Run two full stacks (blue = current, green = new); flip Service selector or Ingress target in one step—rollback is another instant flip.

flowchart LR
  IN[Ingress / Service] -->|100 percent| BLUE[Blue v1.4]
  IN -.->|after switch| GREEN[Green v1.5]
  GREEN --> TEST[Smoke tests on green]
  TEST --> IN

Implementation options: two Deployments (order-service-blue, order-service-green) with Service label switch; Argo Rollouts blueGreen strategy; or duplicate entire namespace for large releases. Green environment receives synthetic traffic or internal-only DNS before public cutover.

Pros: instant rollback, clear before/after, good for demos and schema-compatible major versions. Cons: doubles resource cost during cutover; database migrations still need expand/contract—both colors must tolerate schema during switch window.

⚖️ Trade-off

Blue-green does not replace database migration discipline—if green expects a column blue lacks, the switch fails regardless of routing speed.

Canary — percentage-based traffic split, monitor, expand or rollback

Route a small fraction of real traffic to the new version; watch error rate and latency; increase weight if SLOs hold, rollback if not.

flowchart TB
  DEP[Deploy v2 pods] --> CAN[5 percent traffic to v2]
  CAN --> MET[Monitor SLO 15 min]
  MET -->|pass| UP[25 then 50 then 100 percent]
  MET -->|fail| RB[Rollback to v1]

Tools: Argo Rollouts (canary steps + Prometheus analysis), Flagger (Istio/NGINX weight adjustment), Istio VirtualService weights—see Service Mesh → Canary. Gate on business metrics too—checkout conversion drop aborts even if HTTP 5xx looks fine.

Start with header-based canaries (internal QA on v2) before random user percentage. Automated rollback beats human watching Grafana at 2 a.m.—wire alerts to rollout abort, not just Slack noise.

Feature Flags — decouple deployment from release

Deploy code to production with features dark; enable per user, tenant, or percentage via flag service—ship Tuesday, release Thursday without redeploy.

Deployment = new binary running in pods. Release = users experience new behavior. Feature flags close the gap: merge incomplete work behind if (flags.isEnabled("new-checkout")), deploy daily, enable when tested. Kill switch disables bad features without rollback deploy.

Product	Notes
LaunchDarkly	SaaS, rich targeting (user, segment, geo), experimentation, enterprise SSO
Unleash	Open source, self-hosted option, Spring Boot SDK, gradual rollout strategies
Flagsmith / ConfigCat	Alternatives for cost-sensitive or EU data residency requirements

@RestController
public class CheckoutController {
  private final Unleash unleash;

  @PostMapping("/checkout")
  public ResponseEntity checkout(@RequestBody CartDto cart) {
    if (unleash.isEnabled("express-checkout-v2")) {
      return expressFlow.checkout(cart);
    }
    return legacyFlow.checkout(cart);
  }
}

Flag hygiene: remove stale flags after full rollout; avoid hundreds of permanent toggles becoming undeletable debt. Do not use flags for secrets or security boundaries—use proper authZ.

A/B Testing — canary with user segmentation

A/B testing is canary deployment plus experiment design: stable user cohorts see variant A or B; measure conversion, revenue, or latency—not just error rates.

Technical stack overlaps canary (traffic split) but adds: consistent assignment (same user always sees B), statistical power (sample size, duration), and product metrics (click-through, order completion). LaunchDarkly and Unleash both support percentage rollouts with sticky user keys; dedicated tools (Optimizely, internal experiment platforms) add Bayesian analysis and guardrails.

Pattern: flag service assigns experiment.checkout-v2 = control | treatment based on user ID hash; observability tags spans and logs with experiment variant for downstream analysis in warehouse or Grafana. Coordinate with legal/privacy—experiments on EU users may need consent.

🎯 Interview Tip

Contrast canary (ops safety—5xx rate) vs A/B (product hypothesis—conversion). Same infra, different success criteria and duration.

CI/CD pipeline — build → test → scan → package → deploy

Per-microservice pipeline with mandatory gates; failed scan or contract test blocks prod promotion—no manual SSH deploys.

flowchart LR
  B[Build JAR] --> T[Test]
  T --> S[Scan deps + image]
  S --> P[Package push registry]
  P --> D[Deploy via GitOps]

Stage	Actions	Fail criteria
Build	Maven/Gradle compile, unit tests fast path	Compile error, unit test failure
Test	Integration (Testcontainers), Pact contract tests	Broken API contract, flaky tests quarantined not ignored
Scan	Snyk/Trivy deps, Trivy image, gitleaks secrets	Critical CVE, secret in repo
Package	Docker build or spring-boot:build-image, push with git SHA tag	Image push failure, unsigned image policy
Deploy	Update GitOps repo / trigger Argo CD; smoke test staging	Smoke test fail, sync error, SLO burn in canary

Use OIDC federation for cloud/registry auth—no long-lived AWS keys in GitHub Secrets. Pipeline templates (reusable workflows) ensure service #47 gets the same gates as service #1. Contract tests from Service Design → Pact run on consumer and provider pipelines.

GitOps — Argo CD / Flux, declarative, Git as source of truth

Cluster state lives in Git; controllers continuously reconcile live resources to match—every prod change is a reviewed PR with full audit history.

Tool	Characteristics
Argo CD	UI-rich, ApplicationSet for multi-service bootstrap, strong enterprise adoption, sync waves/hooks
Flux	GitOps Toolkit, native Helm/Kustomize controllers, lightweight, CNCF graduated

flowchart LR
  DEV[Developer merges PR] --> GIT[Git manifest repo]
  CI[CI pushes image] --> GIT
  GIT --> CTRL[Argo CD or Flux]
  CTRL --> CLUSTER[Kubernetes cluster]

CI builds image and opens PR bumping image.tag in apps/order-service/overlays/prod/values.yaml—human or bot merge triggers sync. Drift detection alerts when someone kubectl-edits prod; rollback = git revert. Secrets via SOPS, Sealed Secrets, or External Secrets—never plain text in Git.

Helm charts — templating Kubernetes manifests

Helm packages Deployment, Service, Ingress, ConfigMap, HPA, and PDB into one chart with environment-specific values—platform golden chart, team fills values.

replicaCount: 3
image:
  repository: registry.example.com/order-service
  tag: "abc123f"  # set by CI PR
resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    memory: 768Mi
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 30
ingress:
  enabled: true
  host: api.example.com
  path: /orders

Chart templates use {{ .Values.image.tag }} for image, helm upgrade --install in CI for ephemeral envs; Argo CD Helm source for prod. Chart version (packaging) semver independently from app version—document breaking chart changes. Kustomize alternative for teams preferring patches over Go templates—many repos use Helm for third-party charts (Kafka, Prometheus) and Kustomize for internal apps.

Semantic versioning + conventional commits for automated releases

Semver (MAJOR.MINOR.PATCH) communicates breaking vs additive vs fix changes; conventional commits feed tools that bump version and generate changelogs automatically.

Conventional commits format: feat(checkout): add express payment, fix(inventory): null reserve response, feat! or BREAKING CHANGE: footer for major bumps. Tools: semantic-release, release-please, git-cliff—parse commits on main merge, compute next version, tag Git, publish GitHub Release, optionally bump Helm chart appVersion.

Bump	When	Example
PATCH	Bug fix, backward compatible	1.4.2 → 1.4.3
MINOR	New feature, backward compatible	1.4.3 → 1.5.0
MAJOR	Breaking API or behavior	1.5.0 → 2.0.0

Container images still tag with git SHA for immutable deploy traceability—semver tags the release artifact and changelog; prod manifest pins SHA: image: order-service:abc123f (release v1.5.0). Consumer-driven contracts catch breaking changes before semver major is needed unexpectedly.

Environment promotion — dev → staging → prod gates

Same artifact (image digest) promotes through environments with increasing scrutiny—never rebuild differently for prod.

flowchart LR
  DEV[Dev auto deploy] --> STG[Staging soak + e2e]
  STG --> APP{Approval gate}
  APP --> PROD[Prod canary or sync]

Environment	Purpose	Gate
Dev	Feature branch previews, fast feedback	CI green only; synthetic data
Staging	Prod-like config, integration e2e, load smoke	Automated smoke + Pact verify; optional manual QA sign-off
Prod	Real traffic	Change advisory for risky windows; canary SLO gate; on-call aware

Promote by merging GitOps PR that updates staging → prod overlay image tag—same SHA tested in staging runs in prod. Staging should mirror prod topology (mesh, HPA, probe config) at smaller scale—"works in dev" with H2 database does not predict prod Kafka behavior. Manual approval step in pipeline (GitHub Environments, GitLab protected environments) for prod; dev/staging auto-sync on merge.

📦 Real World

Teams often deploy to staging on every main merge; prod promote twice weekly unless hotfix—balance velocity with change fatigue and on-call load.