Simple ETL with Spring Batch

Spring Batch is literally a batch framework based on Spring Framework. I usually use it to develop a simple ETL(Extraction, Transaformation and Loading) program.

In this post, I’ll show you how to write a simple ETL program. (This sample is tested on Spring Batch 3.0.10)

Prerequisites

  • Database (MySQL or Oracle)
  • Spring batch context database
  • Spring batch libraries (spring-batch-core, spring framework, spring-jdbc)
  • Database JDBC driver
  • DBCP (Optional)

Spring batch context database

Spring batch context database must be created to run Spring batch job. Table creation DDL can be found on spring-batch-core-version.jar (org.springframework.batch.core package contains DDL for several databases)

It contains the following tables.

  • BATCH_JOB_INSTANCE
  • BATCH_JOB_EXECUTION
  • BATCH_JOB_EXECUTION_PARAMS
  • BATCH_STEP_EXECUTION
  • BATCH_STEP_EXECUTION_CONTEXT
  • BATCH_JOB_EXECUTION_CONTEXT

Test scenario

In this post, I use two tables : TB_SOURCE, TB_TARGET. The sample program reads from TB_SOURCE and writes all data into TB_TARGET.

CREATE TABLE TB_SOURCE(
col1 VARCHAR2(10),
col2 VARCHAR2(20),
col3 VARCHAR2(20)
);
CREATE TABLE TB_TARGET(
new_col1 VARCHAR2(10),
new_col2 VARCHAR2(20),
new_col3 VARCHAR2(20)
);
view raw test db schema hosted with ❤ by GitHub

Writing base Spring context

To run a Spring batch program, some beans need to be declared.

  • Spring Batch Context DataSource
  • DataSource for source and target
  • TransactionManager for Spring Batch Context
  • TransactionManager for source and target
  • Job Repository bean
  • Job Launcher (which is the starting point of a job)

The following is a snippet of context.xml.

<bean id="repositoryDataSource"
class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="oracle.jdbc.OracleDriver" />
<property name="url" value="jdbc:oracle:thin:@guest01:1521:myora" />
<property name="username" value="scott" />
<property name="password" value="tiger" />
<property name="initialSize" value="2" />
<property name="minIdle" value="2" />
<property name="maxActive" value="10" />
<property name="maxIdle" value="5" />
</bean>
<bean id="testDataSource"
class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="oracle.jdbc.OracleDriver" />
<property name="url" value="jdbc:oracle:thin:@guest01:1521:myora" />
<property name="username" value="scott" />
<property name="password" value="tiger" />
<property name="initialSize" value="2" />
<property name="minIdle" value="2" />
<property name="maxActive" value="10" />
<property name="maxIdle" value="5" />
</bean>
<bean id="repositoryTransactionManager" lazy-init="true"
class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<property name="dataSource" ref="repositoryDataSource" />
</bean>
<bean id="testTransactionManager" lazy-init="true"
class="org.springframework.jdbc.datasource.DataSourceTransactionManager">
<property name="dataSource" ref="testDataSource" />
</bean>
<batch:job-repository id="jobRepository"
data-source="repositoryDataSource"
transaction-manager="repositoryTransactionManager"
isolation-level-for-create="READ_COMMITTED"
table-prefix="BATCH_"
/>
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository" />
</bean>

Writing Job flow

Now, it’s ready to write a ETL program. A job is also declared in spring context.xml. The following is the basic structure of a Job.

spring_batch_structure

  • A job can have several Steps (for simplicity, I use one Step for this sample)
  • Each Step has a Reader, a Processor and a Writer
  • Reader reads data from database, file or some data store
  • Reader invokes RowMapper to convert raw data into Source VO
  • Processor reads a Source VO and convert it into Target VO
  • Writer writes Target VO into database, file or some data store

The following is a sample ETL job definition.

<batch:job id="TestJob01" job-repository="jobRepository">
<batch:step id="SimpleStep">
<batch:tasklet transaction-manager="testTransactionManager" allow-start-if-complete="true">
<batch:chunk
reader="TestReader"
processor="TestProcessor"
writer="TestWriter"
commit-interval="1"
reader-transactional-queue="false"
/>
</batch:tasklet>
</batch:step>
</batch:job>
<bean id="SourceMapper"
class="test.rowmapper.TbSourceMapper"/>
<bean id="TestReader"
class="org.springframework.batch.item.database.JdbcCursorItemReader"
scope="step">
<property name="dataSource" ref="testDataSource"/>
<property name="sql"
value="SELECT COL1, COL2, COL3
FROM TB_SOURCE"/>
<property name="rowMapper"
ref="SourceMapper"/>
<property name="fetchSize" value="100" />
<property name="maxRows" value="0" />
</bean>
<bean id="TestProcessor"
class="test.processor.TestProcessor"
scope="step">
</bean>
<bean id="TestWriter"
class="org.springframework.batch.item.database.JdbcBatchItemWriter"
scope="step">
<property name="assertUpdates" value="true" />
<property name="itemSqlParameterSourceProvider">
<bean class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
</property>
<property name="sql"
value="INSERT INTO TB_TARGET(NEW_COL1, NEW_COL2, NEW_COL3)
VALUES(:newCol1, :newCol2, :newCol3)"
/>
<property name="dataSource" ref="testDataSource" />
</bean>

As the above sample shows, if a Reader or a Writer’s target is database, SQL can be used directly. (It’s the strength of Spring batch)

Writing RowMapper

RowMapper is a component which converts raw source data into Source VO. It must implement org.springframework.jdbc.core.RowMapper. The sample RowMapper is as follows.

import java.sql.ResultSet;
import java.sql.SQLException;
import org.springframework.jdbc.core.RowMapper;
import test.vo.SourceVO;
public class TbSourceMapper implements RowMapper<SourceVO> {
@Override
public SourceVO mapRow(ResultSet rs, int rowNum) throws SQLException {
SourceVO vo = new SourceVO();
vo.col1 = rs.getString(1);
vo.col2 = rs.getString(2);
vo.col3 = rs.getString(3);
return vo;
}
}

Writing Processor

Processor is the core of Spring Batch. It can transform source data, verify it or execute any additional logic. In this sample, it simply maps source to another column name. The sample Processor is as follows.

import org.springframework.batch.item.ItemProcessor;
import test.vo.SourceVO;
import test.vo.TargetVO;
public class TestProcessor implements ItemProcessor<SourceVO, TargetVO> {
@Override
public TargetVO process(SourceVO item) throws Exception {
TargetVO targetVO = new TargetVO();
targetVO.new_col1 = item.col1;
targetVO.new_col2 = item.col2;
targetVO.new_col3 = item.col3;
return targetVO;
}
}

Writing Source VO

public class SourceVO {
public String col1;
public String col2;
public String col3;
}

Writing Target VO

Target VO must have getter method in this sample.

public class TargetVO {
public String new_col1;
public String new_col2;
public String new_col3;
public String getNewCol1() {
return this.new_col1;
}
public String getNewCol2() {
return this.new_col2;
}
public String getNewCol3() {
return this.new_col3;
}
}

 

And there is no more components to write except a program which invokes this sample Job.

You can download the full sources from github.

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.