Spring Batch

Spring Batch is the standard for offline, high-volume data processing in Spring—ETL, nightly reconciliations, file imports, and report generation. Jobs are composed of steps; chunk steps read, transform, and write in batches with durable execution metadata in a JobRepository.

mid senior Spring Boot 3.x

Batch architecture

Spring Batch separates what to run (job definitions) from how runs are tracked (metadata in JobRepository) and who launches them (JobLauncher). This lets jobs restart after failure, run on schedule, and report status to operators.

flowchart TB
  JL[JobLauncher] --> JR[(JobRepository)]
  JL --> J[Job]
  J --> S1[Step 1 — chunk]
  J --> S2[Step 2 — tasklet]
  S1 --> R[ItemReader]
  S1 --> P[ItemProcessor]
  S1 --> W[ItemWriter]
  JR --> JE[JobExecution]
  JR --> SE[StepExecution]

Component	Role
JobRepository	Persists JobInstance, JobExecution, StepExecution—enables restart and audit trail
JobLauncher	Entry point: accepts a Job + JobParameters, returns JobExecution
Job	Named workflow: ordered (or branched) steps; one logical batch process
Step	Atomic unit of work inside a job—either chunk-oriented or tasklet
Chunk	Batch of items processed in one transaction: read N → process N → write N → commit
Tasklet	Single callback executed repeatedly until RepeatStatus.FINISHED—file cleanup, stored proc, one-shot task

Job → Steps → chunk pipeline

The dominant pattern is a chunk step: loop until the reader returns null, accumulating items up to the chunk size, then process and write the batch in one transaction. Non-item work (archiving a file, sending a summary email) fits a tasklet step.

@Configuration
class ImportJobConfig {

  @Bean
  Job importCustomersJob(JobRepository repo, Step importStep, Step archiveStep) {
    return new JobBuilder("importCustomersJob", repo)
        .start(importStep)
        .next(archiveStep)
        .build();
  }

  @Bean
  Step importStep(JobRepository repo, PlatformTransactionManager tx,
                  ItemReader<CustomerRecord> reader,
                  ItemProcessor<CustomerRecord, Customer> processor,
                  ItemWriter<Customer> writer) {
    return new StepBuilder("importStep", repo)
        .<CustomerRecord, Customer>chunk(500, tx)
        .reader(reader)
        .processor(processor)
        .writer(writer)
        .build();
  }

  @Bean
  Step archiveStep(JobRepository repo, PlatformTransactionManager tx) {
    return new StepBuilder("archiveStep", repo)
        .tasklet(archiveFileTasklet(), tx)
        .build();
  }
}

🌍 Real World

Batch jobs complement online APIs from Web MVC: the API accepts uploads and enqueues work; Batch processes millions of rows overnight with checkpointing. Never run long synchronous batch work inside HTTP request threads.

ItemReader, ItemProcessor, ItemWriter

Three interfaces define the chunk pipeline. Generics flow I → O: reader outputs I, processor maps I → O, writer accepts O.

Interface	Contract	Notes
ItemReader<T>	read() returns next item or null at end	Must be restartable for fault tolerance—save state in ExecutionContext
ItemProcessor<I,O>	process(I) returns transformed item or null to filter out	Optional—use identity processor if no transform
ItemWriter<T>	write(Chunk<? extends T>) persists a chunk	Called once per committed chunk, not per item

@Component
@StepScope
class CsvCustomerReader implements ItemReader<CustomerRecord> {
  private Iterator<CustomerRecord> lines;

  @BeforeStep
  void open(@Value("#{jobParameters['filePath']}") String path) {
    lines = parseCsv(path).iterator();
  }

  @Override
  public CustomerRecord read() {
    return lines.hasNext() ? lines.next() : null;
  }
}

@Component
class CustomerProcessor implements ItemProcessor<CustomerRecord, Customer> {
  @Override
  public Customer process(CustomerRecord row) {
    if (!row.isValid()) return null;           // filter invalid rows
    return new Customer(row.email(), row.name());
  }
}

@Component
class CustomerWriter implements ItemWriter<Customer> {
  private final CustomerRepository repo;
  @Override
  public void write(Chunk<? extends Customer> chunk) {
    repo.saveAll(chunk.getItems());
  }
}

🔬 Under the Hood

Each chunk is one transaction. If item 437 in a 500-item chunk throws, the whole chunk rolls back (unless skip/retry policies apply). On restart, the reader's saved offset in ExecutionContext determines where reading resumes.

Chunk-oriented processing

Chunk processing balances throughput and memory: read a batch, transform, write, commit—repeat. Chunk size is the primary tuning knob alongside reader type and writer batching strategy.

@EnableBatchProcessing (Boot 2) vs autoconfigured (Boot 3)

Spring Boot 2.x	Spring Boot 3.x
@EnableBatchProcessing on a @Configuration class wired JobRepository, JobLauncher, builders	Add spring-boot-starter-batch—auto-config provides infrastructure beans
Manual bean setup common in older tutorials	Inject JobRepository, use JobBuilder / StepBuilder in @Bean methods
Batch schema via spring.batch.initialize-schema	Same—Boot creates BATCH_* tables in your datasource (or use external job repository)

# application.yml
spring:
  batch:
    jdbc:
      initialize-schema: always   # dev only; use Flyway/Liquibase in prod
    job:
      enabled: false              # don't auto-run all jobs on startup

📌 Version Note

Spring Batch 5 (Boot 3) uses Jakarta EE namespaces and requires Java 17+. JobBuilder / StepBuilder replace deprecated JobBuilderFactory / StepBuilderFactory.

Chunk size tuning

Too small — excessive commits, slow throughput, DB round-trip overhead
Too large — long transactions, memory pressure, large rollback on failure
Starting point — 100–500 for JDBC; align with DB batch insert size and connection pool
Measure — items/sec, transaction log growth, GC pauses; tune under production-like volume

return new StepBuilder("importStep", jobRepository)
    .<CustomerRecord, Customer>chunk(250, transactionManager)
    .reader(reader)
    .processor(processor)
    .writer(writer)
    .faultTolerant()
    .retryLimit(3)
    .retry(TransientDataAccessException.class)
    .skipLimit(50)
    .skip(ValidationException.class)
    .build();

Built-in readers and writers

FlatFileItemReader

@Bean
@StepScope
FlatFileItemReader<CustomerRecord> csvReader(
    @Value("#{jobParameters['inputFile']}") Resource input) {
  return new FlatFileItemReaderBuilder<CustomerRecord>()
      .name("customerCsvReader")
      .resource(input)
      .delimited()
      .names("id", "email", "name")
      .targetType(CustomerRecord.class)
      .linesToSkip(1)                    // header row
      .build();
}

JdbcCursorItemReader

Streams rows with a forward-only cursor—good for large tables when paging is awkward. Holds a DB connection for the step duration.

@Bean
@StepScope
JdbcCursorItemReader<OrderRow> orderCursorReader(DataSource ds) {
  return new JdbcCursorItemReaderBuilder<OrderRow>()
      .name("orderCursorReader")
      .dataSource(ds)
      .sql("SELECT id, status, total FROM orders WHERE status = 'PENDING'")
      .rowMapper(new BeanPropertyRowMapper<>(OrderRow.class))
      .fetchSize(500)
      .build();
}

JpaPagingItemReader

Pages through JPA entities—integrates with Spring Data JPA. Each page is a separate query; restart-friendly with sort keys.

@Bean
@StepScope
JpaPagingItemReader<Customer> customerPagingReader(EntityManagerFactory emf) {
  return new JpaPagingItemReaderBuilder<Customer>()
      .name("customerPagingReader")
      .entityManagerFactory(emf)
      .queryString("SELECT c FROM Customer c WHERE c.synced = false ORDER BY c.id")
      .pageSize(500)
      .build();
}

JdbcBatchItemWriter & FlatFileItemWriter

// JDBC batch insert — one round-trip per chunk
@Bean
JdbcBatchItemWriter<Customer> jdbcWriter(DataSource ds) {
  return new JdbcBatchItemWriterBuilder<Customer>()
      .dataSource(ds)
      .sql("INSERT INTO customers (email, name) VALUES (:email, :name)")
      .beanMapped()
      .build();
}

// Export to CSV
@Bean
@StepScope
FlatFileItemWriter<CustomerExport> csvWriter(
    @Value("#{jobParameters['outputFile']}") Resource output) {
  return new FlatFileItemWriterBuilder<CustomerExport>()
      .name("exportWriter")
      .resource(output)
      .delimited()
      .names("email", "name", "createdAt")
      .build();
}

CompositeItemProcessor

Chain multiple processors without nesting logic in one class—validation, enrichment, mapping as separate beans.

@Bean
CompositeItemProcessor<CustomerRecord, Customer> compositeProcessor(
    ValidateProcessor validate,
    EnrichProcessor enrich,
    MapToEntityProcessor map) {

  CompositeItemProcessor<CustomerRecord, Customer> pipeline =
      new CompositeItemProcessor<>();
  pipeline.setDelegates(List.of(validate, enrich, map));
  return pipeline;
}

// Each delegate: ItemProcessor<CustomerRecord, CustomerRecord> until the last maps to Customer

💡 Pro Tip

Mark readers/writers with @StepScope when they depend on jobParameters or stepExecutionContext—one instance per step execution, not singleton.

Next: Job control →

Job control

Batch jobs are long-running and failure-prone by nature. Spring Batch's metadata model makes jobs re-runnable, restartable, and observable— the difference between a script and production-grade ETL.

JobParameters — making jobs re-runnable

A JobInstance is identified by job name + parameters. Running the same job with identical parameters is rejected (already completed)—unless you use RunIdIncrementer or unique params like a timestamp.

@Bean
Job importJob(JobRepository repo, Step importStep) {
  return new JobBuilder("importCustomersJob", repo)
      .incrementer(new RunIdIncrementer())   // adds run.id — allows re-run
      .start(importStep)
      .build();
}

@Service
class JobService {
  private final JobLauncher launcher;
  private final Job importJob;

  void runImport(String filePath) throws Exception {
    JobParameters params = new JobParametersBuilder()
        .addString("filePath", filePath)
        .addLong("timestamp", System.currentTimeMillis())
        .toJobParameters();
    JobExecution execution = launcher.run(importJob, params);
    // execution.getStatus() → STARTED, COMPLETED, FAILED, …
  }
}

⚠ Pitfall

Identifying parameters must be stable for restart semantics but unique for new runs. Use filePath + run.id, not only a static job name. Document which parameters form the instance identity.

JobExecution & StepExecution — status tracking

Entity	Tracks	Key fields
JobExecution	One run of a job	status, startTime, endTime, exitStatus
StepExecution	One run of a step within a job	readCount, writeCount, skipCount, commitCount
ExecutionContext	Restart state (reader offset, custom flags)	Serialized key-value per step/job

@Component
class BatchMonitor {
  private final JobExplorer jobExplorer;

  BatchStatus report(Long executionId) {
    JobExecution job = jobExplorer.getJobExecution(executionId);
    return new BatchStatus(
        job.getStatus(),
        job.getStepExecutions().stream()
            .map(s -> s.getStepName() + ": read=" + s.getReadCount()
                + " write=" + s.getWriteCount()
                + " skip=" + s.getSkipCount())
            .toList()
    );
  }
}

Failed jobs with restartable steps can be relaunched—Batch skips completed steps and resumes from the last committed chunk. Expose status via Actuator or admin UI (see upcoming Observability chapter).

Retry and skip policies

Policy	Behavior	Use when
Retry	Re-attempt failed chunk/item up to limit	Transient DB deadlock, network blip
Skip	Log bad record, continue processing	Malformed CSV row, business validation failure
No rollback skip	Skip without rolling back entire chunk	Item-level failures in otherwise good batch

return new StepBuilder("importStep", jobRepository)
    .<CustomerRecord, Customer>chunk(100, transactionManager)
    .reader(reader)
    .processor(processor)
    .writer(writer)
    .faultTolerant()
    .retry(DeadlockLoserDataAccessException.class)
    .retryLimit(3)
    .skip(FlatFileParseException.class)
    .skip(ValidationException.class)
    .skipLimit(100)
    .listener(new SkipLoggingListener())
    .build();

class SkipLoggingListener implements SkipListener<CustomerRecord, Customer> {
  @Override
  public void onSkipInRead(Throwable t) { log.warn("Skip read: {}", t.getMessage()); }
  @Override
  public void onSkipInProcess(CustomerRecord item, Throwable t) {
    log.warn("Skip row: {} — {}", item, t.getMessage());
  }
}

🎯 Interview

"Retry vs skip?" — Retry assumes success on a later attempt (transient failure). Skip assumes the item is permanently bad—record it and move on. Always cap both with limits so a poison pill cannot loop forever.

Partitioning for parallelism

Split one large step into partitions—each worker processes a slice of data (ID ranges, file segments, hash buckets). A manager step delegates to worker steps, often on a TaskExecutor thread pool.

flowchart TB
  M[Manager Step] --> P1[Worker partition 0\nrows 1–100k]
  M --> P2[Worker partition 1\nrows 100k–200k]
  M --> P3[Worker partition 2\nrows 200k–300k]
  P1 --> DB[(Database)]
  P2 --> DB
  P3 --> DB

@Bean
Step partitionedImportStep(JobRepository repo, Step workerStep,
                           Partitioner rangePartitioner, TaskExecutor batchExecutor) {
  return new StepBuilder("partitionedImportStep", repo)
      .partitioner("workerStep", rangePartitioner)
      .step(workerStep)
      .gridSize(8)
      .taskExecutor(batchExecutor)
      .build();
}

@Bean
Partitioner rangePartitioner(DataSource ds) {
  return grid -> {
    Map<String, ExecutionContext> map = new HashMap<>();
    int min = 1, max = 800_000, gridSize = 8, range = (max - min) / gridSize;
    for (int i = 0; i < gridSize; i++) {
      ExecutionContext ctx = new ExecutionContext();
      ctx.putInt("minId", min + i * range);
      ctx.putInt("maxId", min + (i + 1) * range - 1);
      map.put("partition" + i, ctx);
    }
    return map;
  };
}

@Bean
@StepScope
JdbcCursorItemReader<OrderRow> partitionedReader(
    @Value("#{stepExecutionContext['minId']}") int minId,
    @Value("#{stepExecutionContext['maxId']}") int maxId,
    DataSource ds) {
  return new JdbcCursorItemReaderBuilder<OrderRow>()
      .sql("SELECT * FROM orders WHERE id BETWEEN ? AND ?")
      .preparedStatementSetter(ps -> { ps.setInt(1, minId); ps.setInt(2, maxId); })
      .dataSource(ds)
      .rowMapper(new BeanPropertyRowMapper<>(OrderRow.class))
      .build();
}

💡 Pro Tip

Partition on a indexed column with non-overlapping ranges. Avoid hot partitions (e.g. all recent rows in one range)—use hash partitioning or time-based splits for skewed data.

Scheduling — @Scheduled or Quartz

Batch jobs are typically triggered on a schedule, not on every app startup (spring.batch.job.enabled=false).

@Component
@EnableScheduling
class NightlyImportScheduler {
  private final JobLauncher launcher;
  private final Job importJob;

  @Scheduled(cron = "0 0 2 * * *", zone = "America/New_York")
  void runNightlyImport() throws Exception {
    JobParameters params = new JobParametersBuilder()
        .addString("inputFile", "/data/incoming/customers.csv")
        .addLong("runAt", System.currentTimeMillis())
        .toJobParameters();
    launcher.run(importJob, params);
  }
}

For clustered environments, complex calendars, or misfire handling, use Quartz with Spring—store triggers in JDBC so only one node fires a given job window.

@Bean
JobDetail batchJobDetail(JobLauncher launcher, Job importJob) {
  return JobBuilder.newJob(BatchLauncherJob.class)
      .withIdentity("nightlyImport")
      .usingJobData("jobName", "importCustomersJob")
      .storeDurably()
      .build();
}

// BatchLauncherJob implements org.quartz.Job — calls JobLauncher inside execute()

🌍 Real World

In Kubernetes, prefer CronJob resources that invoke your app via HTTP/actuator endpoint or a one-shot pod—keeps scheduling outside the JVM. Use in-app @Scheduled for single-instance batch workers or when Quartz clustering is already in place.

🔬 Under the Hood

BATCH_JOB_INSTANCE, BATCH_JOB_EXECUTION, and BATCH_STEP_EXECUTION tables are the source of truth. Back them up, monitor growth, and archive old executions—millions of step rows slow the metadata queries that power restart and dashboards.