Spring Batch

Spring Batch is the standard for offline, high-volume data processing in Spring—ETL, nightly reconciliations, file imports, and report generation. Jobs are composed of steps; chunk steps read, transform, and write in batches with durable execution metadata in a JobRepository.

mid senior Spring Boot 3.x

Batch architecture

Spring Batch separates what to run (job definitions) from how runs are tracked (metadata in JobRepository) and who launches them (JobLauncher). This lets jobs restart after failure, run on schedule, and report status to operators.

flowchart TB
  JL[JobLauncher] --> JR[(JobRepository)]
  JL --> J[Job]
  J --> S1[Step 1 — chunk]
  J --> S2[Step 2 — tasklet]
  S1 --> R[ItemReader]
  S1 --> P[ItemProcessor]
  S1 --> W[ItemWriter]
  JR --> JE[JobExecution]
  JR --> SE[StepExecution]
ComponentRole
JobRepositoryPersists JobInstance, JobExecution, StepExecution—enables restart and audit trail
JobLauncherEntry point: accepts a Job + JobParameters, returns JobExecution
JobNamed workflow: ordered (or branched) steps; one logical batch process
StepAtomic unit of work inside a job—either chunk-oriented or tasklet
ChunkBatch of items processed in one transaction: read N → process N → write N → commit
TaskletSingle callback executed repeatedly until RepeatStatus.FINISHED—file cleanup, stored proc, one-shot task

Job → Steps → chunk pipeline

The dominant pattern is a chunk step: loop until the reader returns null, accumulating items up to the chunk size, then process and write the batch in one transaction. Non-item work (archiving a file, sending a summary email) fits a tasklet step.

Minimal job definition
@Configuration
class ImportJobConfig {

  @Bean
  Job importCustomersJob(JobRepository repo, Step importStep, Step archiveStep) {
    return new JobBuilder("importCustomersJob", repo)
        .start(importStep)
        .next(archiveStep)
        .build();
  }

  @Bean
  Step importStep(JobRepository repo, PlatformTransactionManager tx,
                  ItemReader<CustomerRecord> reader,
                  ItemProcessor<CustomerRecord, Customer> processor,
                  ItemWriter<Customer> writer) {
    return new StepBuilder("importStep", repo)
        .<CustomerRecord, Customer>chunk(500, tx)
        .reader(reader)
        .processor(processor)
        .writer(writer)
        .build();
  }

  @Bean
  Step archiveStep(JobRepository repo, PlatformTransactionManager tx) {
    return new StepBuilder("archiveStep", repo)
        .tasklet(archiveFileTasklet(), tx)
        .build();
  }
}
🌍 Real World

Batch jobs complement online APIs from Web MVC: the API accepts uploads and enqueues work; Batch processes millions of rows overnight with checkpointing. Never run long synchronous batch work inside HTTP request threads.

ItemReader, ItemProcessor, ItemWriter

Three interfaces define the chunk pipeline. Generics flow I → O: reader outputs I, processor maps I → O, writer accepts O.

InterfaceContractNotes
ItemReader<T>read() returns next item or null at endMust be restartable for fault tolerance—save state in ExecutionContext
ItemProcessor<I,O>process(I) returns transformed item or null to filter outOptional—use identity processor if no transform
ItemWriter<T>write(Chunk<? extends T>) persists a chunkCalled once per committed chunk, not per item
Custom components
@Component
@StepScope
class CsvCustomerReader implements ItemReader<CustomerRecord> {
  private Iterator<CustomerRecord> lines;

  @BeforeStep
  void open(@Value("#{jobParameters['filePath']}") String path) {
    lines = parseCsv(path).iterator();
  }

  @Override
  public CustomerRecord read() {
    return lines.hasNext() ? lines.next() : null;
  }
}

@Component
class CustomerProcessor implements ItemProcessor<CustomerRecord, Customer> {
  @Override
  public Customer process(CustomerRecord row) {
    if (!row.isValid()) return null;           // filter invalid rows
    return new Customer(row.email(), row.name());
  }
}

@Component
class CustomerWriter implements ItemWriter<Customer> {
  private final CustomerRepository repo;
  @Override
  public void write(Chunk<? extends Customer> chunk) {
    repo.saveAll(chunk.getItems());
  }
}
🔬 Under the Hood

Each chunk is one transaction. If item 437 in a 500-item chunk throws, the whole chunk rolls back (unless skip/retry policies apply). On restart, the reader's saved offset in ExecutionContext determines where reading resumes.

Chunk-oriented processing

Chunk processing balances throughput and memory: read a batch, transform, write, commit—repeat. Chunk size is the primary tuning knob alongside reader type and writer batching strategy.

@EnableBatchProcessing (Boot 2) vs autoconfigured (Boot 3)

Spring Boot 2.xSpring Boot 3.x
@EnableBatchProcessing on a @Configuration class wired JobRepository, JobLauncher, buildersAdd spring-boot-starter-batch—auto-config provides infrastructure beans
Manual bean setup common in older tutorialsInject JobRepository, use JobBuilder / StepBuilder in @Bean methods
Batch schema via spring.batch.initialize-schemaSame—Boot creates BATCH_* tables in your datasource (or use external job repository)
Boot 3 — no @EnableBatchProcessing needed
# application.yml
spring:
  batch:
    jdbc:
      initialize-schema: always   # dev only; use Flyway/Liquibase in prod
    job:
      enabled: false              # don't auto-run all jobs on startup
📌 Version Note

Spring Batch 5 (Boot 3) uses Jakarta EE namespaces and requires Java 17+. JobBuilder / StepBuilder replace deprecated JobBuilderFactory / StepBuilderFactory.

Chunk size tuning

  • Too small — excessive commits, slow throughput, DB round-trip overhead
  • Too large — long transactions, memory pressure, large rollback on failure
  • Starting point — 100–500 for JDBC; align with DB batch insert size and connection pool
  • Measure — items/sec, transaction log growth, GC pauses; tune under production-like volume
Chunk with fault tolerance hooks
return new StepBuilder("importStep", jobRepository)
    .<CustomerRecord, Customer>chunk(250, transactionManager)
    .reader(reader)
    .processor(processor)
    .writer(writer)
    .faultTolerant()
    .retryLimit(3)
    .retry(TransientDataAccessException.class)
    .skipLimit(50)
    .skip(ValidationException.class)
    .build();

Built-in readers and writers

FlatFileItemReader

FlatFileItemReader
@Bean
@StepScope
FlatFileItemReader<CustomerRecord> csvReader(
    @Value("#{jobParameters['inputFile']}") Resource input) {
  return new FlatFileItemReaderBuilder<CustomerRecord>()
      .name("customerCsvReader")
      .resource(input)
      .delimited()
      .names("id", "email", "name")
      .targetType(CustomerRecord.class)
      .linesToSkip(1)                    // header row
      .build();
}

JdbcCursorItemReader

Streams rows with a forward-only cursor—good for large tables when paging is awkward. Holds a DB connection for the step duration.

JdbcCursorItemReader
@Bean
@StepScope
JdbcCursorItemReader<OrderRow> orderCursorReader(DataSource ds) {
  return new JdbcCursorItemReaderBuilder<OrderRow>()
      .name("orderCursorReader")
      .dataSource(ds)
      .sql("SELECT id, status, total FROM orders WHERE status = 'PENDING'")
      .rowMapper(new BeanPropertyRowMapper<>(OrderRow.class))
      .fetchSize(500)
      .build();
}

JpaPagingItemReader

Pages through JPA entities—integrates with Spring Data JPA. Each page is a separate query; restart-friendly with sort keys.

JpaPagingItemReader
@Bean
@StepScope
JpaPagingItemReader<Customer> customerPagingReader(EntityManagerFactory emf) {
  return new JpaPagingItemReaderBuilder<Customer>()
      .name("customerPagingReader")
      .entityManagerFactory(emf)
      .queryString("SELECT c FROM Customer c WHERE c.synced = false ORDER BY c.id")
      .pageSize(500)
      .build();
}

JdbcBatchItemWriter & FlatFileItemWriter

Writers
// JDBC batch insert — one round-trip per chunk
@Bean
JdbcBatchItemWriter<Customer> jdbcWriter(DataSource ds) {
  return new JdbcBatchItemWriterBuilder<Customer>()
      .dataSource(ds)
      .sql("INSERT INTO customers (email, name) VALUES (:email, :name)")
      .beanMapped()
      .build();
}

// Export to CSV
@Bean
@StepScope
FlatFileItemWriter<CustomerExport> csvWriter(
    @Value("#{jobParameters['outputFile']}") Resource output) {
  return new FlatFileItemWriterBuilder<CustomerExport>()
      .name("exportWriter")
      .resource(output)
      .delimited()
      .names("email", "name", "createdAt")
      .build();
}

CompositeItemProcessor

Chain multiple processors without nesting logic in one class—validation, enrichment, mapping as separate beans.

CompositeItemProcessor
@Bean
CompositeItemProcessor<CustomerRecord, Customer> compositeProcessor(
    ValidateProcessor validate,
    EnrichProcessor enrich,
    MapToEntityProcessor map) {

  CompositeItemProcessor<CustomerRecord, Customer> pipeline =
      new CompositeItemProcessor<>();
  pipeline.setDelegates(List.of(validate, enrich, map));
  return pipeline;
}

// Each delegate: ItemProcessor<CustomerRecord, CustomerRecord> until the last maps to Customer
💡 Pro Tip

Mark readers/writers with @StepScope when they depend on jobParameters or stepExecutionContext—one instance per step execution, not singleton.

Next: Job control →

Job control

Batch jobs are long-running and failure-prone by nature. Spring Batch's metadata model makes jobs re-runnable, restartable, and observable— the difference between a script and production-grade ETL.

JobParameters — making jobs re-runnable

A JobInstance is identified by job name + parameters. Running the same job with identical parameters is rejected (already completed)—unless you use RunIdIncrementer or unique params like a timestamp.

Launch with parameters
@Bean
Job importJob(JobRepository repo, Step importStep) {
  return new JobBuilder("importCustomersJob", repo)
      .incrementer(new RunIdIncrementer())   // adds run.id — allows re-run
      .start(importStep)
      .build();
}

@Service
class JobService {
  private final JobLauncher launcher;
  private final Job importJob;

  void runImport(String filePath) throws Exception {
    JobParameters params = new JobParametersBuilder()
        .addString("filePath", filePath)
        .addLong("timestamp", System.currentTimeMillis())
        .toJobParameters();
    JobExecution execution = launcher.run(importJob, params);
    // execution.getStatus() → STARTED, COMPLETED, FAILED, …
  }
}
⚠ Pitfall

Identifying parameters must be stable for restart semantics but unique for new runs. Use filePath + run.id, not only a static job name. Document which parameters form the instance identity.

JobExecution & StepExecution — status tracking

EntityTracksKey fields
JobExecutionOne run of a jobstatus, startTime, endTime, exitStatus
StepExecutionOne run of a step within a jobreadCount, writeCount, skipCount, commitCount
ExecutionContextRestart state (reader offset, custom flags)Serialized key-value per step/job
Query execution status
@Component
class BatchMonitor {
  private final JobExplorer jobExplorer;

  BatchStatus report(Long executionId) {
    JobExecution job = jobExplorer.getJobExecution(executionId);
    return new BatchStatus(
        job.getStatus(),
        job.getStepExecutions().stream()
            .map(s -> s.getStepName() + ": read=" + s.getReadCount()
                + " write=" + s.getWriteCount()
                + " skip=" + s.getSkipCount())
            .toList()
    );
  }
}

Failed jobs with restartable steps can be relaunched—Batch skips completed steps and resumes from the last committed chunk. Expose status via Actuator or admin UI (see upcoming Observability chapter).

Retry and skip policies

PolicyBehaviorUse when
RetryRe-attempt failed chunk/item up to limitTransient DB deadlock, network blip
SkipLog bad record, continue processingMalformed CSV row, business validation failure
No rollback skipSkip without rolling back entire chunkItem-level failures in otherwise good batch
Retry, skip, listener
return new StepBuilder("importStep", jobRepository)
    .<CustomerRecord, Customer>chunk(100, transactionManager)
    .reader(reader)
    .processor(processor)
    .writer(writer)
    .faultTolerant()
    .retry(DeadlockLoserDataAccessException.class)
    .retryLimit(3)
    .skip(FlatFileParseException.class)
    .skip(ValidationException.class)
    .skipLimit(100)
    .listener(new SkipLoggingListener())
    .build();

class SkipLoggingListener implements SkipListener<CustomerRecord, Customer> {
  @Override
  public void onSkipInRead(Throwable t) { log.warn("Skip read: {}", t.getMessage()); }
  @Override
  public void onSkipInProcess(CustomerRecord item, Throwable t) {
    log.warn("Skip row: {} — {}", item, t.getMessage());
  }
}
🎯 Interview

"Retry vs skip?" — Retry assumes success on a later attempt (transient failure). Skip assumes the item is permanently bad—record it and move on. Always cap both with limits so a poison pill cannot loop forever.

Partitioning for parallelism

Split one large step into partitions—each worker processes a slice of data (ID ranges, file segments, hash buckets). A manager step delegates to worker steps, often on a TaskExecutor thread pool.

flowchart TB
  M[Manager Step] --> P1[Worker partition 0\nrows 1–100k]
  M --> P2[Worker partition 1\nrows 100k–200k]
  M --> P3[Worker partition 2\nrows 200k–300k]
  P1 --> DB[(Database)]
  P2 --> DB
  P3 --> DB
Partition step
@Bean
Step partitionedImportStep(JobRepository repo, Step workerStep,
                           Partitioner rangePartitioner, TaskExecutor batchExecutor) {
  return new StepBuilder("partitionedImportStep", repo)
      .partitioner("workerStep", rangePartitioner)
      .step(workerStep)
      .gridSize(8)
      .taskExecutor(batchExecutor)
      .build();
}

@Bean
Partitioner rangePartitioner(DataSource ds) {
  return grid -> {
    Map<String, ExecutionContext> map = new HashMap<>();
    int min = 1, max = 800_000, gridSize = 8, range = (max - min) / gridSize;
    for (int i = 0; i < gridSize; i++) {
      ExecutionContext ctx = new ExecutionContext();
      ctx.putInt("minId", min + i * range);
      ctx.putInt("maxId", min + (i + 1) * range - 1);
      map.put("partition" + i, ctx);
    }
    return map;
  };
}

@Bean
@StepScope
JdbcCursorItemReader<OrderRow> partitionedReader(
    @Value("#{stepExecutionContext['minId']}") int minId,
    @Value("#{stepExecutionContext['maxId']}") int maxId,
    DataSource ds) {
  return new JdbcCursorItemReaderBuilder<OrderRow>()
      .sql("SELECT * FROM orders WHERE id BETWEEN ? AND ?")
      .preparedStatementSetter(ps -> { ps.setInt(1, minId); ps.setInt(2, maxId); })
      .dataSource(ds)
      .rowMapper(new BeanPropertyRowMapper<>(OrderRow.class))
      .build();
}
💡 Pro Tip

Partition on a indexed column with non-overlapping ranges. Avoid hot partitions (e.g. all recent rows in one range)—use hash partitioning or time-based splits for skewed data.

Scheduling — @Scheduled or Quartz

Batch jobs are typically triggered on a schedule, not on every app startup (spring.batch.job.enabled=false).

@Scheduled launch
@Component
@EnableScheduling
class NightlyImportScheduler {
  private final JobLauncher launcher;
  private final Job importJob;

  @Scheduled(cron = "0 0 2 * * *", zone = "America/New_York")
  void runNightlyImport() throws Exception {
    JobParameters params = new JobParametersBuilder()
        .addString("inputFile", "/data/incoming/customers.csv")
        .addLong("runAt", System.currentTimeMillis())
        .toJobParameters();
    launcher.run(importJob, params);
  }
}

For clustered environments, complex calendars, or misfire handling, use Quartz with Spring—store triggers in JDBC so only one node fires a given job window.

Quartz JobDetail (cluster-safe)
@Bean
JobDetail batchJobDetail(JobLauncher launcher, Job importJob) {
  return JobBuilder.newJob(BatchLauncherJob.class)
      .withIdentity("nightlyImport")
      .usingJobData("jobName", "importCustomersJob")
      .storeDurably()
      .build();
}

// BatchLauncherJob implements org.quartz.Job — calls JobLauncher inside execute()
🌍 Real World

In Kubernetes, prefer CronJob resources that invoke your app via HTTP/actuator endpoint or a one-shot pod—keeps scheduling outside the JVM. Use in-app @Scheduled for single-instance batch workers or when Quartz clustering is already in place.

🔬 Under the Hood

BATCH_JOB_INSTANCE, BATCH_JOB_EXECUTION, and BATCH_STEP_EXECUTION tables are the source of truth. Back them up, monitor growth, and archive old executions—millions of step rows slow the metadata queries that power restart and dashboards.