Spring Batch
Spring Batch is the standard for offline, high-volume data processing in Spring—ETL, nightly reconciliations, file imports, and report generation. Jobs are composed of steps; chunk steps read, transform, and write in batches with durable execution metadata in a JobRepository.
Batch architecture
Spring Batch separates what to run (job definitions) from how runs are tracked (metadata in JobRepository) and who launches them (JobLauncher). This lets jobs restart after failure, run on schedule, and report status to operators.
flowchart TB JL[JobLauncher] --> JR[(JobRepository)] JL --> J[Job] J --> S1[Step 1 — chunk] J --> S2[Step 2 — tasklet] S1 --> R[ItemReader] S1 --> P[ItemProcessor] S1 --> W[ItemWriter] JR --> JE[JobExecution] JR --> SE[StepExecution]
| Component | Role |
|---|---|
| JobRepository | Persists JobInstance, JobExecution, StepExecution—enables restart and audit trail |
| JobLauncher | Entry point: accepts a Job + JobParameters, returns JobExecution |
| Job | Named workflow: ordered (or branched) steps; one logical batch process |
| Step | Atomic unit of work inside a job—either chunk-oriented or tasklet |
| Chunk | Batch of items processed in one transaction: read N → process N → write N → commit |
| Tasklet | Single callback executed repeatedly until RepeatStatus.FINISHED—file cleanup, stored proc, one-shot task |
Job → Steps → chunk pipeline
The dominant pattern is a chunk step: loop until the reader returns null, accumulating items up to the chunk size, then process and write the batch in one transaction. Non-item work (archiving a file, sending a summary email) fits a tasklet step.
@Configuration
class ImportJobConfig {
@Bean
Job importCustomersJob(JobRepository repo, Step importStep, Step archiveStep) {
return new JobBuilder("importCustomersJob", repo)
.start(importStep)
.next(archiveStep)
.build();
}
@Bean
Step importStep(JobRepository repo, PlatformTransactionManager tx,
ItemReader<CustomerRecord> reader,
ItemProcessor<CustomerRecord, Customer> processor,
ItemWriter<Customer> writer) {
return new StepBuilder("importStep", repo)
.<CustomerRecord, Customer>chunk(500, tx)
.reader(reader)
.processor(processor)
.writer(writer)
.build();
}
@Bean
Step archiveStep(JobRepository repo, PlatformTransactionManager tx) {
return new StepBuilder("archiveStep", repo)
.tasklet(archiveFileTasklet(), tx)
.build();
}
}
Batch jobs complement online APIs from Web MVC: the API accepts uploads and enqueues work; Batch processes millions of rows overnight with checkpointing. Never run long synchronous batch work inside HTTP request threads.
ItemReader, ItemProcessor, ItemWriter
Three interfaces define the chunk pipeline. Generics flow I → O: reader outputs I, processor maps I → O, writer accepts O.
| Interface | Contract | Notes |
|---|---|---|
| ItemReader<T> | read() returns next item or null at end | Must be restartable for fault tolerance—save state in ExecutionContext |
| ItemProcessor<I,O> | process(I) returns transformed item or null to filter out | Optional—use identity processor if no transform |
| ItemWriter<T> | write(Chunk<? extends T>) persists a chunk | Called once per committed chunk, not per item |
@Component
@StepScope
class CsvCustomerReader implements ItemReader<CustomerRecord> {
private Iterator<CustomerRecord> lines;
@BeforeStep
void open(@Value("#{jobParameters['filePath']}") String path) {
lines = parseCsv(path).iterator();
}
@Override
public CustomerRecord read() {
return lines.hasNext() ? lines.next() : null;
}
}
@Component
class CustomerProcessor implements ItemProcessor<CustomerRecord, Customer> {
@Override
public Customer process(CustomerRecord row) {
if (!row.isValid()) return null; // filter invalid rows
return new Customer(row.email(), row.name());
}
}
@Component
class CustomerWriter implements ItemWriter<Customer> {
private final CustomerRepository repo;
@Override
public void write(Chunk<? extends Customer> chunk) {
repo.saveAll(chunk.getItems());
}
}
Each chunk is one transaction. If item 437 in a 500-item chunk throws, the whole chunk rolls back (unless skip/retry policies apply). On restart, the reader's saved offset in ExecutionContext determines where reading resumes.
Chunk-oriented processing
Chunk processing balances throughput and memory: read a batch, transform, write, commit—repeat. Chunk size is the primary tuning knob alongside reader type and writer batching strategy.
@EnableBatchProcessing (Boot 2) vs autoconfigured (Boot 3)
| Spring Boot 2.x | Spring Boot 3.x |
|---|---|
| @EnableBatchProcessing on a @Configuration class wired JobRepository, JobLauncher, builders | Add spring-boot-starter-batch—auto-config provides infrastructure beans |
| Manual bean setup common in older tutorials | Inject JobRepository, use JobBuilder / StepBuilder in @Bean methods |
| Batch schema via spring.batch.initialize-schema | Same—Boot creates BATCH_* tables in your datasource (or use external job repository) |
# application.yml
spring:
batch:
jdbc:
initialize-schema: always # dev only; use Flyway/Liquibase in prod
job:
enabled: false # don't auto-run all jobs on startup
Spring Batch 5 (Boot 3) uses Jakarta EE namespaces and requires Java 17+. JobBuilder / StepBuilder replace deprecated JobBuilderFactory / StepBuilderFactory.
Chunk size tuning
- Too small — excessive commits, slow throughput, DB round-trip overhead
- Too large — long transactions, memory pressure, large rollback on failure
- Starting point — 100–500 for JDBC; align with DB batch insert size and connection pool
- Measure — items/sec, transaction log growth, GC pauses; tune under production-like volume
return new StepBuilder("importStep", jobRepository)
.<CustomerRecord, Customer>chunk(250, transactionManager)
.reader(reader)
.processor(processor)
.writer(writer)
.faultTolerant()
.retryLimit(3)
.retry(TransientDataAccessException.class)
.skipLimit(50)
.skip(ValidationException.class)
.build();
Built-in readers and writers
FlatFileItemReader
@Bean
@StepScope
FlatFileItemReader<CustomerRecord> csvReader(
@Value("#{jobParameters['inputFile']}") Resource input) {
return new FlatFileItemReaderBuilder<CustomerRecord>()
.name("customerCsvReader")
.resource(input)
.delimited()
.names("id", "email", "name")
.targetType(CustomerRecord.class)
.linesToSkip(1) // header row
.build();
}
JdbcCursorItemReader
Streams rows with a forward-only cursor—good for large tables when paging is awkward. Holds a DB connection for the step duration.
@Bean
@StepScope
JdbcCursorItemReader<OrderRow> orderCursorReader(DataSource ds) {
return new JdbcCursorItemReaderBuilder<OrderRow>()
.name("orderCursorReader")
.dataSource(ds)
.sql("SELECT id, status, total FROM orders WHERE status = 'PENDING'")
.rowMapper(new BeanPropertyRowMapper<>(OrderRow.class))
.fetchSize(500)
.build();
}
JpaPagingItemReader
Pages through JPA entities—integrates with Spring Data JPA. Each page is a separate query; restart-friendly with sort keys.
@Bean
@StepScope
JpaPagingItemReader<Customer> customerPagingReader(EntityManagerFactory emf) {
return new JpaPagingItemReaderBuilder<Customer>()
.name("customerPagingReader")
.entityManagerFactory(emf)
.queryString("SELECT c FROM Customer c WHERE c.synced = false ORDER BY c.id")
.pageSize(500)
.build();
}
JdbcBatchItemWriter & FlatFileItemWriter
// JDBC batch insert — one round-trip per chunk
@Bean
JdbcBatchItemWriter<Customer> jdbcWriter(DataSource ds) {
return new JdbcBatchItemWriterBuilder<Customer>()
.dataSource(ds)
.sql("INSERT INTO customers (email, name) VALUES (:email, :name)")
.beanMapped()
.build();
}
// Export to CSV
@Bean
@StepScope
FlatFileItemWriter<CustomerExport> csvWriter(
@Value("#{jobParameters['outputFile']}") Resource output) {
return new FlatFileItemWriterBuilder<CustomerExport>()
.name("exportWriter")
.resource(output)
.delimited()
.names("email", "name", "createdAt")
.build();
}
CompositeItemProcessor
Chain multiple processors without nesting logic in one class—validation, enrichment, mapping as separate beans.
@Bean
CompositeItemProcessor<CustomerRecord, Customer> compositeProcessor(
ValidateProcessor validate,
EnrichProcessor enrich,
MapToEntityProcessor map) {
CompositeItemProcessor<CustomerRecord, Customer> pipeline =
new CompositeItemProcessor<>();
pipeline.setDelegates(List.of(validate, enrich, map));
return pipeline;
}
// Each delegate: ItemProcessor<CustomerRecord, CustomerRecord> until the last maps to Customer
Mark readers/writers with @StepScope when they depend on jobParameters or stepExecutionContext—one instance per step execution, not singleton.
Job control
Batch jobs are long-running and failure-prone by nature. Spring Batch's metadata model makes jobs re-runnable, restartable, and observable— the difference between a script and production-grade ETL.
JobParameters — making jobs re-runnable
A JobInstance is identified by job name + parameters. Running the same job with identical parameters is rejected (already completed)—unless you use RunIdIncrementer or unique params like a timestamp.
@Bean
Job importJob(JobRepository repo, Step importStep) {
return new JobBuilder("importCustomersJob", repo)
.incrementer(new RunIdIncrementer()) // adds run.id — allows re-run
.start(importStep)
.build();
}
@Service
class JobService {
private final JobLauncher launcher;
private final Job importJob;
void runImport(String filePath) throws Exception {
JobParameters params = new JobParametersBuilder()
.addString("filePath", filePath)
.addLong("timestamp", System.currentTimeMillis())
.toJobParameters();
JobExecution execution = launcher.run(importJob, params);
// execution.getStatus() → STARTED, COMPLETED, FAILED, …
}
}
Identifying parameters must be stable for restart semantics but unique for new runs. Use filePath + run.id, not only a static job name. Document which parameters form the instance identity.
JobExecution & StepExecution — status tracking
| Entity | Tracks | Key fields |
|---|---|---|
| JobExecution | One run of a job | status, startTime, endTime, exitStatus |
| StepExecution | One run of a step within a job | readCount, writeCount, skipCount, commitCount |
| ExecutionContext | Restart state (reader offset, custom flags) | Serialized key-value per step/job |
@Component
class BatchMonitor {
private final JobExplorer jobExplorer;
BatchStatus report(Long executionId) {
JobExecution job = jobExplorer.getJobExecution(executionId);
return new BatchStatus(
job.getStatus(),
job.getStepExecutions().stream()
.map(s -> s.getStepName() + ": read=" + s.getReadCount()
+ " write=" + s.getWriteCount()
+ " skip=" + s.getSkipCount())
.toList()
);
}
}
Failed jobs with restartable steps can be relaunched—Batch skips completed steps and resumes from the last committed chunk. Expose status via Actuator or admin UI (see upcoming Observability chapter).
Retry and skip policies
| Policy | Behavior | Use when |
|---|---|---|
| Retry | Re-attempt failed chunk/item up to limit | Transient DB deadlock, network blip |
| Skip | Log bad record, continue processing | Malformed CSV row, business validation failure |
| No rollback skip | Skip without rolling back entire chunk | Item-level failures in otherwise good batch |
return new StepBuilder("importStep", jobRepository)
.<CustomerRecord, Customer>chunk(100, transactionManager)
.reader(reader)
.processor(processor)
.writer(writer)
.faultTolerant()
.retry(DeadlockLoserDataAccessException.class)
.retryLimit(3)
.skip(FlatFileParseException.class)
.skip(ValidationException.class)
.skipLimit(100)
.listener(new SkipLoggingListener())
.build();
class SkipLoggingListener implements SkipListener<CustomerRecord, Customer> {
@Override
public void onSkipInRead(Throwable t) { log.warn("Skip read: {}", t.getMessage()); }
@Override
public void onSkipInProcess(CustomerRecord item, Throwable t) {
log.warn("Skip row: {} — {}", item, t.getMessage());
}
}
"Retry vs skip?" — Retry assumes success on a later attempt (transient failure). Skip assumes the item is permanently bad—record it and move on. Always cap both with limits so a poison pill cannot loop forever.
Partitioning for parallelism
Split one large step into partitions—each worker processes a slice of data (ID ranges, file segments, hash buckets). A manager step delegates to worker steps, often on a TaskExecutor thread pool.
flowchart TB M[Manager Step] --> P1[Worker partition 0\nrows 1–100k] M --> P2[Worker partition 1\nrows 100k–200k] M --> P3[Worker partition 2\nrows 200k–300k] P1 --> DB[(Database)] P2 --> DB P3 --> DB
@Bean
Step partitionedImportStep(JobRepository repo, Step workerStep,
Partitioner rangePartitioner, TaskExecutor batchExecutor) {
return new StepBuilder("partitionedImportStep", repo)
.partitioner("workerStep", rangePartitioner)
.step(workerStep)
.gridSize(8)
.taskExecutor(batchExecutor)
.build();
}
@Bean
Partitioner rangePartitioner(DataSource ds) {
return grid -> {
Map<String, ExecutionContext> map = new HashMap<>();
int min = 1, max = 800_000, gridSize = 8, range = (max - min) / gridSize;
for (int i = 0; i < gridSize; i++) {
ExecutionContext ctx = new ExecutionContext();
ctx.putInt("minId", min + i * range);
ctx.putInt("maxId", min + (i + 1) * range - 1);
map.put("partition" + i, ctx);
}
return map;
};
}
@Bean
@StepScope
JdbcCursorItemReader<OrderRow> partitionedReader(
@Value("#{stepExecutionContext['minId']}") int minId,
@Value("#{stepExecutionContext['maxId']}") int maxId,
DataSource ds) {
return new JdbcCursorItemReaderBuilder<OrderRow>()
.sql("SELECT * FROM orders WHERE id BETWEEN ? AND ?")
.preparedStatementSetter(ps -> { ps.setInt(1, minId); ps.setInt(2, maxId); })
.dataSource(ds)
.rowMapper(new BeanPropertyRowMapper<>(OrderRow.class))
.build();
}
Partition on a indexed column with non-overlapping ranges. Avoid hot partitions (e.g. all recent rows in one range)—use hash partitioning or time-based splits for skewed data.
Scheduling — @Scheduled or Quartz
Batch jobs are typically triggered on a schedule, not on every app startup (spring.batch.job.enabled=false).
@Component
@EnableScheduling
class NightlyImportScheduler {
private final JobLauncher launcher;
private final Job importJob;
@Scheduled(cron = "0 0 2 * * *", zone = "America/New_York")
void runNightlyImport() throws Exception {
JobParameters params = new JobParametersBuilder()
.addString("inputFile", "/data/incoming/customers.csv")
.addLong("runAt", System.currentTimeMillis())
.toJobParameters();
launcher.run(importJob, params);
}
}
For clustered environments, complex calendars, or misfire handling, use Quartz with Spring—store triggers in JDBC so only one node fires a given job window.
@Bean
JobDetail batchJobDetail(JobLauncher launcher, Job importJob) {
return JobBuilder.newJob(BatchLauncherJob.class)
.withIdentity("nightlyImport")
.usingJobData("jobName", "importCustomersJob")
.storeDurably()
.build();
}
// BatchLauncherJob implements org.quartz.Job — calls JobLauncher inside execute()
In Kubernetes, prefer CronJob resources that invoke your app via HTTP/actuator endpoint or a one-shot pod—keeps scheduling outside the JVM. Use in-app @Scheduled for single-instance batch workers or when Quartz clustering is already in place.
BATCH_JOB_INSTANCE, BATCH_JOB_EXECUTION, and BATCH_STEP_EXECUTION tables are the source of truth. Back them up, monitor growth, and archive old executions—millions of step rows slow the metadata queries that power restart and dashboards.