Accelerating data loading into Data lake using CDC


Organizations are building Hadoop based data lakes to get insight into their enterprise data and perform analytics over it. People have realized the advantages of having a enterprise data lake and recent trend is to have these data lakes over the cloud to get the agility, scalability and monetary benefits from pay-as-you-go model. However they have now started to hit the road blocks because of not able to keep the data fresh in their lakes and we all know how a stale water can make your lake smell like a swamp !!!

So organizations now must find a new ways to acquire, load, store, transform and manage data in their data lakes. In this article we will talk about one of such approach – CDCing, which will help you in keeping your data lake as a pristine enterprise data lake and not a swamp.

Gone are days of SQOOPing , CDCing is the new era approach

There were days when everyone felt the need to bring the data from their traditional RDBMS to Hadoop data lake using Apache Sqoop. It was and even today as well, it is the ideal solution to bring data into Hadoop data lake for such organizations which are starting their journey. It is free, can perform full and incremental data loads, supports multiple database formats, can be integrated Apache Oozie for scheduling and can load data directly into Apache Hive.  So whats the issue ?

Issues with Sqoop

  • Load on Source system –  Apache Sqoop uses MapReduce to load data from source system DB. To increase the throughput we need to increase the number of mapper tasks as increasing the number of mapper tasks means more concurrent data transfer tasks. However it would increase the load on Source system DB as it would execute more concurrent queries on DBs. This would impact the other queries running on source system.
  • Limitation on increasing the throughput – Now as we know increasing the number of mappers would put more load on Source system DB so once you have hit the saturation point on source DBs you con not increase the number of mappers beyond that point. This would put limitations on increasing the throughput.
  • High latency and operational challenges – Because Sqoop puts load on Source system so from practical point of view, organization runs their Sqoop jobs during Off peek hours for their Source system. Depending on the business of organization, this off peek period could be in the night, on weekends or month ends etc. So this brings latency into the system and also posses a operational challenge in keeping the data fresh in your data lake.
  • Bulk ingestion and unevenly utilization of resources –  As  discussed above, organizations tend to Sqoop and load the data during off peek hours of their source system, it would mean bulk ingestion needs to be done during these peek hours and job needs to finish in these off peek hours. This would mean unevenly utilization of your resources.
  • Changes in Source Application and DB –   To reduce the amount of data that needs to be fetched, Sqoop allows you to fetch just new/changed data from source tables but to identify new data you need to have timestamp information in source tables. This means source application and tables needs to changed if this information is not available.
  • Issues when Schema changes –  Sqoop does not capture any changes in the schema. So your ingestion process may fail when there is any change in schema. And you may have to do some development to absorb those schema changes in your ingestion process.
  • Issue with data loading from other sources like Mainframe –  Sqoop does not allow you to load data from Mainframe systems. So either you have to first load such data into RDBMS or ingest the files etc.
  • Inability to handle in-between job failures – If a Sqoop job fails in between, you have to first clean the data from failed run and then re start the job. This means extra care needs to be taken to handle in-between failure scenarios.

Here comes CDC  

With CDC tools like Attunity and IBM CDC in market , you can get rid of all the above issues.

How CDC tools works and their advantages

These CDC tools reads the transaction logs produced by RDBMS to identify the new changes and supply these changes to be stored in HDFS or Kafka. You can then take out this incremental data and apply it to your base data to come up with newer version for base data set.

Because these tools reads transaction logs produced by Source DBs, they do not put extra load on DBs. You can configure these tools to remotely read the transaction logs so nothing needs to be installed on Source system. These CDC tools takes care of in-between failure scenarios as well as they store the information like till what point data has been sent to target.

CDC-SQOOP

In our recent project, there were needs to ingest data from more than 400 systems into our data lake. We started with identifying the true CDC candidates. The criteria we have taken for this identification is – identify the source systems in which data changes at frequent basis like on hourly basis, sources which contains transactional data. We identified 50-60 sources for our first release. We are using Attunity and IBM CDC as our choice of CDC tools. Both tools are being used because of commercial reasons and relationship of our client with these vendors.  Currently we are ingesting data from Oracle, IBM DB2 and Mainframe ( z-series , i-series ).

CDC

 

Additional challenges in our current project

Tools like IBM CDC and Attunity are already being used by our client from past few year. But these tools were being used to bring data into traditional EDW like Teradata. Though these CDC tools could send data to multiple targets but for that these tools would read the transaction logs for multiple times. This means additional I/O required while reading the logs.

The other challenge was presence of multiple data lakes with in the organization. Every department has created their own little data lake which defeats the whole purpose of creating Enterprise Data Lake. Everybody needs the data from same sources and not ready to demise their version of Data lake until it is proven that Enterprise data lake is sufficient enough to full fill their data requirements along with clarity on other aspects like Operating model – how alerting and monitoring would work, infra cost sharing among departments, support model etc.

All these concerns are valid and will take time for everybody to shift to Enterprise data lake. Client has vision of moving the Enterprise data lake on Cloud as well so as middle path we have gone with below architecture

CurrentArchCDC

 

Pushing data to Kafka is allowing us to hook any number of consumers to Kafka. So all those existing individual departmental data lakes could still be able to get data from Central Kafka Cluster. Looking at the futuristic need of moving to cloud, NiFi will allow us to integrate with Cloud and will allow us to push data to Cloud.

Applying incremental updates to base data set

Once you get the incremental data set in your Hadoop environment on HDFS, you need to merge incremental data with base data set.

Attunity Compose

Attunity provides Attunity Compose to do this task. it requires additional license and Hortonworks Data Platform version 2.6 and above as it is dependent on ACID merge feature introduced in  HDP 2.6.

Spark based Consolidation framework

It was very simple to write such component in Spark SQL which picks up incremental data and merge it to base data to create new base data set.So we have currently chosen it.

CDC_Arch

 

Other aspects

There are various things as well like versioning, partitioning of data in HDFS, Schema change handling, Kafka topic modelling and partitioning strategy etc.

Schema changes are handled using Schema registry and also handled through Consolidation Framework. Also format of incremental data produced by IBM CDC is different from Attunity. We have written a component which brings both kind of data in common format so that Consolidation framework could work for both.

Kafka topics are created on per source basis and topic partitioning is done based on transaction ID.

Currently consolidations are happening every 2 hrs.

One thought on “Accelerating data loading into Data lake using CDC

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s