Monday, March 2, 2015

Part 2: Incremental Load - Using Lookup Component and Hash Function


Continue with my Previous Posts:

Part 1: Incremental Load


In this approach, we maintain hash value in destination table; create hash value during transformation on source rows.


Figure: Package Screen Shot




Figure: (DFT) Load Data Into Destination Table




Source Sql in current example, That can either be based on CreateDate and LastModifiedDate, to get only new and modified  rows from source:
SELECT
  [CustomerAlternateKey] -- this is our business key (BK)
 ,[Title]
 ,[FirstName]
 ,[MiddleName]
 ,[LastName]
 ,[BirthDate]
 ,[MaritalStatus]
 ,[Suffix]
 ,[Gender]
 ,[EmailAddress]
 ,[YearlyIncome]
 ,[TotalChildren]
 ,[NumberChildrenAtHome]
 ,[EnglishEducation]
 ,[EnglishOccupation]
 ,[HouseOwnerFlag]
 ,[NumberCarsOwned]
FROM [dbo].[DimCustomer];

Destination table in current example: (Destination table has Surrogate Key column as well as Hash column)
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyDestination]')
               AND type in (N'U'))
BEGIN
 CREATE TABLE [dbo].[MyDestination](
  [SK_Customers] [int] IDENTITY(1,1) NOT NULL,
  [CustomerAlternateKey] [nvarchar](15) NOT NULL,
  [Title] [nvarchar](8) NULL,
  [FirstName] [nvarchar](50) NULL,
  [MiddleName] [nvarchar](50) NULL,
  [LastName] [nvarchar](50) NULL,
  [BirthDate] [date] NULL,
  [MaritalStatus] [nchar](1) NULL,
  [Suffix] [nvarchar](10) NULL,
  [Gender] [nvarchar](1) NULL,
  [EmailAddress] [nvarchar](50) NULL,
  [YearlyIncome] [money] NULL,
  [TotalChildren] [tinyint] NULL,
  [NumberChildrenAtHome] [tinyint] NULL,
  [EnglishEducation] [nvarchar](40) NULL,
  [EnglishOccupation] [nvarchar](100) NULL,
  [HouseOwnerFlag] [nchar](1) NULL,
  [NumberCarsOwned] [tinyint] NULL,
  [DWH_Hash] [char](66) NULL,
  SCD_StartDate      DateTime Null,
  SCD_EndDate DateTime Null,
  CurrentFlag Bit NULL,
  CONSTRAINT [PK_MyDestination] PRIMARY KEY CLUSTERED
   ([SK_Customers] ASC)
 );
END
GO

The original incremental load pattern uses an OLE DB Command in the data flow to perform the updates. Mostly “OLEDB Command Component” dislike, because it issues a statement for every single row that passes through the component, which is horribly slow for large data sets and it can also bloat the transaction log. Therefore a staging table is created that will hold the update rows so that a set-based UPDATE statement can be issued in the third task. This improves performance immensely.
Task 1 ((SQL) Create Update Table): To Create staging table for modified rows, that can be drop later in task 4
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[UPD_MyDestination]')
           AND type in (N'U'))
   DROP TABLE [dbo].[UPD_MyDestination];
GO
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[UPD_MyDestination]')
               AND type in (N'U'))
BEGIN
 CREATE TABLE [dbo].[UPD_MyDestination](
  [SK_Customers] [int] NOT NULL,
  [CustomerAlternateKey] [nvarchar](15) NOT NULL,
  [Title] [nvarchar](8) NULL,
  [FirstName] [nvarchar](50) NULL,
  [MiddleName] [nvarchar](50) NULL,
  [LastName] [nvarchar](50) NULL,
  [BirthDate] [date] NULL,
  [MaritalStatus] [nchar](1) NULL,
  [Suffix] [nvarchar](10) NULL,
  [Gender] [nvarchar](1) NULL,
  [EmailAddress] [nvarchar](50) NULL,
  [YearlyIncome] [money] NULL,
  [TotalChildren] [tinyint] NULL,
  [NumberChildrenAtHome] [tinyint] NULL,
  [EnglishEducation] [nvarchar](40) NULL,
  [EnglishOccupation] [nvarchar](100) NULL,
  [HouseOwnerFlag] [nchar](1) NULL,
  [NumberCarsOwned] [tinyint] NULL,
  [DWH_Hash] [char](66) NULL
 );
END
GO

Task 2 ((DFT) Load Data Into Destination Table):
A. Set the source
B.  Add a script task to create a HashByte value for each row using System.Security.Cryptography class, except primary key columns, that will never be change. There we will add the Hash column to the output with string data type and length 66.







C. Lookup task is configured with destination table as full cache, that’s return the matched rows (Modified rows) and unmatched rows (New row). Unmatched rows is directly pushed into destination table.
SELECT SK_Customers,CustomerAlternateKey,DWH_Hash FROM MyDestination;









D. Whereas Matched rows (Modified rows) can further be differentiating with conditional split by comparing hash value. And push only those rows where hash values are not matched with destination.



Task 3 ((SQL) Update Destination Table) Now we have the modified rows in staging table. That can be directly updated using update statement.
UPDATE dest
SET  [Title]    = upd.[Title]   
 ,[FirstName]   = upd.[FirstName]  
 ,[MiddleName]   = upd.[MiddleName]  
 ,[LastName]    = upd.[LastName]   
 ,[BirthDate]   = upd.[BirthDate]  
 ,[MaritalStatus]  = upd.[MaritalStatus] 
 ,[Suffix]    = upd.[Suffix]   
 ,[Gender]    = upd.[Gender]   
 ,[EmailAddress]   = upd.[EmailAddress]  
 ,[YearlyIncome]   = upd.[YearlyIncome]  
 ,[TotalChildren]  = upd.[TotalChildren] 
 ,[NumberChildrenAtHome] = upd.[NumberChildrenAtHome]
 ,[EnglishEducation]  = upd.[EnglishEducation] 
 ,[EnglishOccupation] = upd.[EnglishOccupation]
 ,[HouseOwnerFlag]  = upd.[HouseOwnerFlag] 
 ,[NumberCarsOwned]  = upd.[NumberCarsOwned] 
 ,DWH_Hash           =upd.[DWH_Hash]
FROM [dbo].[MyDestination]  dest
JOIN [dbo].[UPD_MyDestination] upd ON dest.SK_Customers = upd.SK_Customers;



No comments:

Post a Comment