Thursday, March 26, 2015

SQL Basic : Index Rebuild or Recreate via Index Fragmentation






CREATE PROCEDURE dbo.[IndexRebuildOrRecreate]
AS
DECLARE @TableName NVARCHAR(500);
DECLARE @SQLIndex NVARCHAR(MAX);
DECLARE @RowCount INT;
DECLARE @Counter INT;

DECLARE @IndexAnalysis TABLE
    (
      AnalysisID INT IDENTITY(1, 1)
                     NOT NULL
                     PRIMARY KEY ,
      TableName NVARCHAR(500) ,
      SQLText NVARCHAR(MAX) ,
      IndexDepth INT ,
      AvgFragmentationInPercent FLOAT ,
      FragmentCount BIGINT ,
      AvgFragmentSizeInPages FLOAT ,
      PageCount BIGINT
    )

BEGIN
    INSERT  INTO @IndexAnalysis
            SELECT  [objects].name ,
                    'ALTER INDEX [' + [indexes].name + '] ON ['
                    + [schemas].name + '].[' + [objects].name + '] '
                    + ( CASE WHEN (   [dm_db_index_physical_stats].avg_fragmentation_in_percent >= 20
                                    AND [dm_db_index_physical_stats].avg_fragmentation_in_percent < 40
                                  ) THEN 'REORGANIZE'
                             WHEN [dm_db_index_physical_stats].avg_fragmentation_in_percent > = 40
                             THEN 'REBUILD'
                        END ) AS zSQL ,
                    [dm_db_index_physical_stats].index_depth ,
                    [dm_db_index_physical_stats].avg_fragmentation_in_percent ,
                    [dm_db_index_physical_stats].fragment_count ,
                    [dm_db_index_physical_stats].avg_fragment_size_in_pages ,
                    [dm_db_index_physical_stats].page_count
            FROM    [sys].[dm_db_index_physical_stats](DB_ID(), NULL, NULL,
                                                       NULL, 'LIMITED') AS   [dm_db_index_physical_stats]
                    INNER JOIN [sys].[objects] AS [objects] ON (   [dm_db_index_physical_stats].[object_id] = [objects].[object_id] )
                    INNER JOIN [sys].[schemas] AS [schemas] ON ( [objects].[schema_id]  = [schemas].[schema_id] )
                    INNER JOIN [sys].[indexes] AS [indexes] ON (  [dm_db_index_physical_stats].[object_id] = [indexes].[object_id]
                                                          AND  [dm_db_index_physical_stats].index_id = [indexes].index_id
                                                          )
            WHERE   index_type_desc <> 'HEAP'
                    AND [dm_db_index_physical_stats].avg_fragmentation_in_percent > 20
END

SELECT  @RowCount = COUNT(AnalysisID)
FROM    @IndexAnalysis

SET @Counter = 1
WHILE @Counter <= @RowCount
    BEGIN

        SELECT  @SQLIndex = SQLText
        FROM    @IndexAnalysis
        WHERE   AnalysisID = @Counter

        EXECUTE sp_executesql @SQLIndex

        SET @Counter = @Counter + 1

    END
 GO




Monday, March 9, 2015

SQL Basic - Constraints




Heap Tables
1. Heap and clustered indexes directly affect how data in their underlying tables are stored. Nonclustered indexes are independent of row storage.
2. When a table is first created, the initial storage structure is called a heap. Rows are inserted into the table in the order in which they are added. A table will use a heap until a clustered index is created on the table.

XML Index
1. Based on XML column.
2. For every node in an XML document an entry is made in the XML index. This information is persisted in internal tables.
3. Creating and maintaining XML indexes can be quite costly. Every time the index is updated, it needs to shred all of the nodes of the XML document into the XML index.
4. For example, in a library system, system list out all the books based on there author / title and now want to list out based on chapters exists in the books, then it should be based on chapter per book like wise in xml index sql server maintain node level info based on there clustered index on the table at internal tables

Primary Key Constraint
1. A Primary Key is a unique value that identify a record within a table.
2. No Null value is allowed in primary key column.
3. There can only be one primary key within a table.
4. Typically a primary key created on a single column, but it can be a composit primary key on multiple columns. include up to 16 columns for a composit primary key and also can not exceed 900 bytes

Unique Key Constraint
1. A Unique Key is as similar as primary key, except one exception, it can contain one null value.
2. There can be more than one unique key in a table.

Foreign Key Constraint: Define relational integrity between 2 tables, 2 tables can be logically connect with Primary Key (Parent Table) and foreign Key (Child Table)

Check Constraint: Check constraint is used to check a particular expression during insertation and updation data, During bulk insert or Heavy insertation or updation check constraint should be disable.

On Delete Cascade / On Update Cascade: that is inhancement of triggers, child table is automatically updated / deleted


System Tables for Indexes:

sys.Indexes: sys.Indexes provide information for all tables / index / table valued functions. That return information about type of index (Clustered . Non-Clustered, Heap)
sys.Index_Columns: sys.Index_Columns provides list of all key columns as well as included columns with in a index.
sys.xml_indexes: Return rows
sys.Column_Store_Dictionaries
sys.Column_Store_Segements




SQL Basic - Fillfactor




“Fillfactor” is a setting for indexes in SQL Server. When we need to  Recreate or rebuild an index, you can tell SQL Server what percentage of each 8K data page used in the “leaf” level of the index it should fill up. By default, SQL Server uses a 100% fillfactor. This mean maximum number of rows need to be fit within a page. If the page is completely filled and new data is inserted in the table which belongs to completely filled page, the “page split” event happens to accommodate new data.  This page split process is expensive in terms of the resources.

References:
http://www.brentozar.com/archive/2013/04/five-things-about-fillfactor/




SQL Basic - Organizational Structure



Heap: Heap is a default organizational sturcture of sql server, heap organize data pages or extets when clustered index is not created on a table.  In heap rows insearted not in a particular order. As soon as a new row insert, it organize at the end of the data page and add pointer in IMAC page.

B-tree: B-Tree or Balanced Tree is most commanly used structure for organizing indexes in both clustered and non-clustered. In B-Tree pages are organized in a hierarchical tree structure.B-trees start with an IAM page that identifies where the first page of the B-tree is located, The first page of the B-tree is an index page, also know as root level page, contains key values and page addresses for the next pages in the index. Depending on the size of the index, the next level of the index may be data pages or additional index pages know as intermidate level indexes. The next level of pages below the root and intermediate levels of the indexes know as leaf level, contains all the data pages of index. In clustered index these data pages contains data itself or in non-clustered index these pages contain address of data pages actually stored.



Columnar / Column Store Index: This is new organizational structure introduced in sql server 2012. SQL Server 2012 introduced nonclustered columnstore indexes, Whereas SQL Server 2014 has both clustered and nonclustered columnstore indexes. Basically useful for data warehouse queries, not for OLTP system.

Few Limitations:
Column store index makes a table Read Only.
Column store index cannot be created on a view. Only one Column store Index is allowed.
Column store index cannot have more than 1024 columns.It cannot include Sparse Column.
Column store index definition cannot be changed using ALTER INDEX statement We need to drop and re-create instead.
Below datatypes are not supported by Column store index (binary, varbinary, text, ntext, varchar(max), nvarchar(max))
     
How it Works
In RowStore, all column values are stored in pages and then these pages become a segment. Whereas in ColumnStore, each column store in different segment to be consistent with clustered index.( a segement consist million of rows). The data in each column segment matches with row by row order by clustered index, that mean if a particular row found in a segment, same row value should be found on another column segment.

References:
https://www.simple-talk.com/sql/database-administration/columnstore-indexes-in-sql-server-2012/
http://www.pythian.com/blog/column-store-indexes-in-sql-server-2012/



SQL Basic - Storage Basic




Pages
1. Pages are used by SQL Server to store everything in the database data, metadata, index everything.
2. When space is allocated to database data files, all of the space is divided into pages. During allocation, each page is created to use 8KB (8192 bytes) of space and they are numbered starting at 0 and incrementing 1 for every page allocated.
3. There are three primary components to a page: the page header, records, and offset array. All pages begin with the page header. The header is 96 bytes and contains meta-information about the page, such as the page number, the owning object, and type of page. At the end of the page is the offset array. The offset array is 36 bytes and provides pointers to the byte location of the start of rows on the page. Between these two areas are 8060 bytes where records are stored on the page. As rows are added to a page, the row is added to the first open position in the records area of the page. After this, the starting location of the page is stored in the last available position in the offset array, Every time the row inserted, row add at the end of the page and make an entry in the offset.


Extents
1. Pages are grouped together eight at a time into structures called extents. There are two types of extents use by SQL Server databases: mixed and uniform extents.
2. In mixed extents, the pages can be allocated to multiple objects. The table will use mixed extents as long as the total size of the table is less than eight pages, Once the number of pages in a table exceeds eight pages, it will begin using uniform extents. In a uniform extent, all pages in the extent are allocated to a single object in the database. Due to this, pages for an object will be contiguous, which increases the number of pages of an object that can be read in a single read.

SQL Server stores information in the database on 8 KB pages. In general, records in tables are limited to that size; if they are smaller than 8 KB, SQL Server stores more than one record per page. There are two way to store data in pages
Forwarded Records : The forwarded records approach applied, when the table is heap. In the heap table when the record is updated that is not fit with in a page (8 KB) then the record is deleted from the current page and add a new record at the end of available page or new page. For example if we have a table that has 4 record and each record consume 2 KB space, its means 1 page memory is totally consumed. And then later if we update one record out of 4 that increase the row size with 2.5 KB, so it will no longer to fit in the existing page, so in this case sql server write the records on next available page in the heap and set the pointer with the new page. One drawback of this sql server has not physically free the space from the old page and as the modified record inserted at the end of the available page, so it would be a big reason of fragmentation

Page Split : The page split approach is applied, when a table has either clustered index or non-clustered index. In this approach, if a row is updated to a size that will no longer fit on the data page, SQL Server will take half of the records on that page and place them on a new page or next available page, Then SQL Server will attempt to write the data for the row to the page again. If the data will then fit on the page, the page will be written. If not, then the process will be repeated until it fits on the page.

The main difference between page splits and forwarded records, page split moves the location of records to make room for the records within the logical index ordering, whereas forwarded record moves the location on updated record at end of next available space.

Although, Performance issue occur in both the approaches, the page that is being split needs to be locked exclusively while the records are split between two pages. After the page is split, the physical order of the data pages in the index are almost always not in their logical order within the index, this means the more pages that need to be read into memory for a query to execute




SQL Basic - Data Types



Char / nChar: Fixed length character, nChar is used for unicode / Special characters

VarChar / nVarChar / VarChar(Max) / nVarChar(Max): Variable length character, max limit of VarChar is 4000 character, max length of nVarChar is 2000 character, max length of VarChar(Max) is 4 GB, max length of nVarChar(Max) is 2 GB

Datetime2 / DateTime / SmallDateTime / Date / Time: For DateTime Range January 1, 1753, through December 31, 9999 , a single datetime variable need 8 bytpe, internally first four bytes represent days, no of days before and after 1 January 1900, remaining 4 bytes represent time in milliseconds midnight

DateTimeOffset: Datetime with timezone, here hours would be in 24 hour format and time zone range in between 0 to 14

TinyInt(0 to 255), smallint (-32768 to 32768), int, bigint

Decimal / Float / Real / Numeric:  store decimal values

Bit: Store 0 or 1 value, smallest data type


Spatial Datatype: Spatial data represents information about the physical location and shape of geometric objects.
There are 2 type of spatial datatypes in sql server Geometry and Geography.
Point, LineString, CircularString, CompundCurve, Polygon, CurvePolygon, MultiPoint, MultiPolygon, MultiLineString are methods.
DECLARE @g1 geometry;
DECLARE @g2 geometry;
SET @g1 = geometry::STGeomFromText('LINESTRING(1 1, 5 1, 3 5, 1 1, 3 3)', 0);
--SET @g1 = geometry::STGeomFromText('LINESTRING(0 0, 2 2, 4 0)', 0);
SET @g2 = geometry::STGeomFromText('CIRCULARSTRING(0 0, 2 2, 4 0)', 0);
IF @g1.STIsValid() = 1 AND @g2.STIsValid() = 1
  BEGIN
      --SELECT @g1.ToString()--, @g2.ToString()
      --SELECT @g1.STLength() AS [LS Length]--, @g2.STLength() AS [CS Length]
   SELECT  @g1
  END


Sql_Variant: The Sql_Variant datatype is use to store data from many different datatypes
SELECT SQL_VARIANT_PROPERTY(@v,'BaseType')
, SQL_VARIANT_PROPERTY(@v,'TotalBytes')
, SQL_VARIANT_PROPERTY(@v,'MaxLength')

As Microsoft recomeded, we should avoid using SQL Server’s sql_variant data type. Some limitations:
Variants won’t work with LIKE in a WHERE clause.
OLE DB and ODBC providers automatically convert variants to nvarchar(4000)




SQL Basic - Shrinking



Shrinking DOES NOT means compressing, shrinking in SQL Server means, removing empty space from database files and releasing the empty space either to operating system or to SQL Server.
Shrink command is a logged operation, When we perform Shrink operation, this information is logged in log file, and if there is no empty space in log file, SQl Server cannot write to log file because there is no empty space in logfile and that is why you cannot shrink a database in that case.


Shrinking best practices:
1. Use DBCC Shrinkfile (‘Filename’, Estimated_File_Size_After_Shrink (in MB) ) instead of DBCC Shrinkdatabase command.
2. Do not shrink file/database in big intervals, shrink in small intervals and issue shrink command multiple times For example shrink individual database files in smaller chunks, say 50 MB at a time.
DBCC SHRINKFILE (DataFil1, 25000)
go
DBCC SHRINKFILE (DataFil1, 25950)
go
...and so on...
3. Do not shrink your database when running backup jobs, otherwise backup jobs will fail.
4. Always Rebuild your indexes after you shrink database, because shrinking rearrange datapages, that means fragmentation, Clustered index can only be rebuild.

What are the USEDPAGES & ESTIMATEDPAGES that appear on the Results Pane after using the DBCC SHRINKDATABASE?
UsedPages: The number of 8-KB pages currently used by the file.
EstimatedPages: The number of 8-KB pages that SQL Server estimates the file could be shrunk down to.




SQL Basic - Checkpoint



Checkpoint is an internal process that writes all dirty pages (modified pages) from Buffer Cache to Physical disk, apart from this it also writes the log records from log buffer to physical file.
It is a dedicated process and runs automatically by SQL Server at specific intervals. SQL Server runs checkpoint process for each Database individually.
Checkpoint helps to reduce the recovery time for SQL Server in the event of unexpected shutdown or system crash\Failure.
Four types of Checkpoints:
Automatic
Indirect
Manual
Internal



SQL Basic - Database File Types & File Group



mdf: main / primary database file also know as primary filegroup, consist all medata information (Database Catalogs) like sysobject, sysindex, syscolumns, sysroutines, sysuser. as well as data, if secondary file (File Group) is defined.

ldf: Transaction log file

ndf: Secondary file, also know as file-group file
A single database has multiple transaction log file (ldf) and secondary file (ndf) but a single primary data file (mdf). There are many advantages of storing data in different Files. We can put your read only tables or SCD tables in one filegroup and read write tables in another filegroup and take backup of only filegroup that has read write data. Creating additional files on different physical hard disks also improves I/O performance.

Using separate filegroup for tables that frequently joined
Using separate filegroup for indexes
Using separate disk for Log file
Using multiple files and filegroups can help improve the performance of join operations. By separating tables that are frequently joined into separate filegroups and then putting files within the filegroups on separate disks the separated I/O paths can result in improved performance. It is recommended for performance and recovery purposes that, if multiple filegroups are to be used, the primary filegroup should be used only for system objects and secondary filegroups should be used only for user objects. This approach improves the ability to recover from corruption. Spreading a database into multiple files, even on the same drive, makes it easy to move the database files onto separate drives, making future disk upgrades easier. SQL Server can take advantage of multiple filegroups by accessing tables and corresponding nonclustered
indexes using separate I/O paths. A nonclustered index can be created on a specific filegroup.

ALTER DATABASE AdventureWorks2008R2 ADD FILEGROUP Indexes ;
ALTER DATABASE AdventureWorks2008R2 ADD FILE (NAME = AdventureWorks2008_Data2j,
FILENAME = 'C:\DATA\AdventureWorks2008_2.ndf',
SIZE = 1mb,
FILEGROWTH = 10%) TO FILEGROUP Indexes;

CREATE INDEX IndexBirthDate
ON HumanResources.Employee (BirthDate)
ON Indexes;




SQL Basic - Default Databases




Master: System tables, System and Information_Schema Views, All extended and system-stored procedures

Model: A template db, Any new database created has to be at least as large as the model database, Default 100 MB Size

MSDB: SQL Agent process stores, Jobs, Schedules related information, SSIS packages
and policy-based management definitions

Tempdb: Temporary db, the key working areas for your server, temporary storage of a large/complex sql, Consist local and global temporary tables (temp table is just an alias of table in current db, physically its available in tempdb), tempdb rebuilt from scratch every time you start your SQL Server.

ReportServer: The database that serves Reporting Server confi guration and model storage needs

ReportServerTempDB: The working database for Reporting Server



Saturday, March 7, 2015

SSIS - Password Protection



PackagePassword --> Package password
ProtectionLevel Property -->
  EncryptSensitiveWithUserKey : Default Type, Encrypt sensitive information like connection string etc with author active directory. One draw back if active directly go away there is no key to unlock the package
  EncryptAllWithUserKey : Same as above, main difference is its encrypt full package not only the sensitive information.
  EncryptSensitiveWithPassword : Encrypt sensitive information with some user defined password
  EncryptAllWithPassword : Encrypt all not only the sensitive information with some user defined password

SSIS - Precedence Constraint




On Success : Process moves on next available task only on current task complete successfully
On Failure : Process moves on next available taks only on current task complete with failure
On Completion : Process moves on next available task, no matter current task success or fail

Constraint Options: Precedence constraints set with below evaluation operations
Constraint : evaluate only constraint
Expression : evaluate only expression
Constraint and Expression : evaluate expression as well as constraint
Expression Or Constraint : evaluate either expression or constraint

More over if the task has more than one precedence constrain, we can define "logical and" and "logical or" between these constraints

Logical and / Logical Or apply only when a task has more than one input precedence constraint.



Tuesday, March 3, 2015

SSIS - Best Practice and Guidelines




  • Extract data in parallel; Via using sequence container in control flow or defining different source and destination within a single Data flow Task, we can extract data parallel for  non-dependent tables or files.
    • MaxConcurrentExecutables is a package level property, by default value is -1, which mean maximum number of parallel process can be equal to no of processors + 2
    • EngineThread is a DFT level property, by default value is 10, which means maximum of 10 threads can be created to executing data flow task.
  • Extract required data; pull only the required set of data from any table or file. we need to avoid the tendency to pull everything available on the source and then delete unwanted data; it eats up network bandwidth, consumes system resources (I/O and CPU), requires extra storage, and it degrades the overall performance of ETL system.
  • Try to avoid the use of Asynchronous transformation components, that basically block the data table.
  • Limited use event in event handlers, Events are very useful but excess use of events will cost extra overhead on ETL execution.
  • Indexes on destination table should be disable, that would be re-build or re-create later.
  • Highly recommended to use "fast load" "access mode" with OLEDB destination task. Via using this option, we can change and modify some other properties like keep identity, check constraints, table lock, further more we can Rows Per Batch and Maximum insert Commit Size properties, Default value is blank, which means pull all data from source and then push it into one go, that will be overhead on tempdb.
  • Avoid implicit data type casting. For example when source is flat file then ssis implicitly cast all columns as string, that will increase data buffer size, We should be explicitly convert the data types for known columns to imporve the performance.


Reference: http://www.developer.com/db/top-10-methods-to-improve-etl-performance-using-ssis.html







SSIS - Data Flow Tasks



DataFlow Tasks - Source and Destination Connections

Excel Source
Flat File Source
OLEDB Source
ODBC Source
XML Source
Raw File Source: Retrieve data from binary file, This component is one of the only components that does not use a Connection Manager.
Excel Destination
Flat File Destination
OLEDB Destination
ODBC Destination
Partition Processing:
Dimension Processing:
Raw File Destination:
Record-set Destination
Sql Server Destination


DataFlow Tasks - Data Transformation

Row Transformations - It transforms each row present in the pipeline, Comes under synchronous (non-blocking transformation)
  • Character Map: Convert String into Lower case / Upper Case / Different supported language within the same column or new column 
    • Reference: http://sqlage.blogspot.in/2013/08/ssis-how-to-use-ole-db-command.html
  • Data Conversion: Perform Casting, Used to convert the data type of a column, It create a new output column with the converted datatype.
  • Copy Column: Copies columns to new output columns.
  • Derived Column: Used to create a calculated column in the output, also can be used to convert the datatype of a column. With in the expression, we can used system/user variables as well as list of columns. It create a new output column as well as replace the existing column.
  • OLE DB Command: Executes a command against a connection manager for each row. This transformation can behave as a destination
  • Split and join Transformations - Distribute rows into different output or join multiple output into one. Comes under synchronous (non-blocking or semi blocking transformation)
  • Conditional Split: Split data based on different conditions configured. As similar as Case When statement. Return multiple output, based on number of conditions defined into conditional split task.
    • Reference: https://www.simple-talk.com/sql/ssis/ssis-basics-using-the-conditional-split/
    • Reference: http://oakdome.com/programming/SSIS_ConditionalSplit_Multicast.php
  • Lookup: Lookup is used to access additional information in a related table, based on the values in common join columns. Three types of caching options available - Full cache (The referenced dataset is generated and loaded into cache before the Lookup transformation is executed), Partial cache (The referenced dataset is generated when the Lookup transformation is executed, and the dataset is loaded into cache) and No cache (The referenced dataset is generated when the Lookup transformation is executed, but no data is loaded into cache). Lookup source  can be configured "Cache Connection Manager" (A pre-defined cache for example set of hardcoded values) and OLE DB Connection Manager (A table stored in the database). The output of lookup can be "Matched Rows" (Return rows that are matched in the referenced table), "Unmached Rows" (Return rows that are not found in referenced table), Error output (returns if any error occur).
    • Reference: http://www.dwbiconcepts.com/etl/29-etl-ssis/144-ssis-lookup-transform.html
    • Reference: https://www.simple-talk.com/sql/ssis/implementing-lookup-logic-in-sql-server-integration-services/
  • Union All / Merge: Merge is as similar as union all. the main difference between merge or union transformation is the merge is used to merge 2 sorted inputs and return as a single sorted output. where as union all transformation can takes more than 2 inputs, no need to be sorted and return single unsorted output. the columns list and there datatype should be match for each input.
  • Merge Join: Merge Join transformation is used to join 2 sorted resultset via Inner / Left / Right join. As similar as joins in Tsql.
    • Reference: https://www.simple-talk.com/sql/ssis/ssis-basics-using-the-merge-join-transformation/
  • MultiCast: The Multicast transformation distributes its input to multiple similar outputs, that means using this we can create many copies of the input dataset. It has one input and multiple outputs. but does not support an error output

Business Intelligence Transformations - Bsically used in data mining and data cleansing. These are only available with the Enterprise Edition
  • Fuzzy Lookup: The Fuzzy Lookup transformation performs data cleaning tasks such as standardizing data, correcting data, and providing missing values. This transformation differs from the Lookup transformation in its use of fuzzy matching. The Lookup transformation uses an equi-join to locate matching records in the reference table. It returns either an exact match or nothing from the reference table. In contrast, the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches from the reference table.
    • Reference: http://www.codeproject.com/Tips/528243/SSIS-Fuzzy-lookup-for-cleaning-dirty-data
  • Fuzzy Grouping: http://ssis-tutorial-online.blogspot.in/2013/04/fuzzy-grouping-transformation.html
  • Term Lookup: Available only in enterprise edition. Term lookup is basically used to search set of text records, within a particular table, input is a text column (nVarChar, nText) and output is occurance count of that text row by row basis.
    • Reference: http://www.codeproject.com/Tips/574437/Term-Lookup-Transformation-in-SSIS
    • Reference: http://www.codeproject.com/Tips/528243/SSIS-Fuzzy-lookup-for-cleaning-dirty-data
  • Term Extraction: Available only in enterprise edition. Term Extraction extract list of words from a simple text and retrun no of occurance each word with in a text. This transformation uses its own english dictionary, so if a word is out from this dictionary then it will not be consider.
    • Reference: http://www.packtpub.com/article/term-extraction-tasks-in-sql-server-integration-services

Rowset Transformations – These transformations are also called Asynchronous (Blocking Transformation). Stores all the rows into the memory before it begins the process of modifying input data to the required output format.
  Pivot
  UnPivot
  Row Sampling
  Percentage Sampling
  Aggregate
  Sort




SSIS - Blocking - Syncronization / Asyncronization



All the data flow components can be categorized as synchronous and asynchronous.
They can be further differentiate in

  • non-blocking (synchronous)
  • semi-blocking (asynchronous)
  • fully-blocking (asynchronous).  


Synchronous

  • The number of rows input to the synchronous components is equal to the number of rows output from synchronous components
  • Components use the same buffer
  • All destination components are synchronous
  • Non-blocking transformations are synchronized


Asynchronous

  • The number of rows output from Asynchronous components may be less\ more than the number of records input to the components (filtration inside)
  • Stores all the rows into the memory buffer  before it begins the process of modifying input data to the required output   format.
  • They block the data flow in the pipeline until all the input rows are read into the memory E.g. – Sort Transformation- where the   component has to process the complete set of rows in a single operation.
  • Components use different buffer
  • All source components are asynchronous. they create two buffers; one for the success output and one for the error output.
  • Semi\ Full blocking transformations are asynchronous in nature



Synchronous reuse buffers and therefore are generally faster than asynchronous components

Non-Blocking transformations

  • Audit
  • Character Map
  • Conditional Split
  • Copy Column
  • Data Conversion
  • Derived Column
  • Lookup
  • Multicast
  • Percent Sampling
  • Row Count
  • Script Component
  • Export Column
  • Import Column
  • Slowly Changing Dimension
  • OLE DB Command


Semi-blocking transformations

  • Data Mining Query
  • Merge
  • Merge Join
  • Pivot
  • Unpivot
  • Term Lookup
  • Union All


Blocking transformations

  • Aggregate
  • Fuzzy Grouping
  • Fuzzy Lookup
  • Row Sampling
  • Sort
  • Term Extraction






Monday, March 2, 2015

Part 6: Incremental Load - Using Lookup Component with Customized SCD (Slowly Changing Dimension) Type 2




Continue with my Previous Posts:


Part 4: Incremental Load - Using MERGE JOIN
Part 5: Incremental Load - Using TSQL Merge Function




Package: MSSQLTIPS_Hash_SCD2.dtsx

This approach is very much similar as Incremental Load using Hash, The basic difference is, it is customize with SCD Type 2









A. Change Lookup source as below ((LKP) Match Rows With Destination)
SELECT SK_Customers,CustomerAlternateKey,DWH_Hash FROM MyDestination Where CurrentFlag = 1;
B. Add Derived Column Component as below


C. Change update statement as below ((SQL) Update Destination Table)
UPDATE dest
SET        SCD_EndDate = GetDate()
                ,CurrentFlag = 0
FROM [dbo].[MyDestination]              dest
JOIN [dbo].[UPD_MyDestination] upd ON dest.SK_Customers = upd.SK_Customers;
D. Add Data Flow Task (DFT) Load Updates into Destination Table to insert modified rows with new start date and current flag 1.





Part 5: Incremental Load - Using TSQL Merge Function





Continue with my Previous Posts:


Part 4: Incremental Load - Using MERGE JOIN



MERGE [dbo].[MyDestination] TT
USING (       SELECT [CustomerAlternateKey],[Title],[FirstName],[MiddleName],[LastName],[BirthDate],[MaritalStatus],[Suffix],[Gender],[EmailAddress],[YearlyIncome],[TotalChildren],[NumberChildrenAtHome],[EnglishEducation],[EnglishOccupation],[HouseOwnerFlag],[NumberCarsOwned]
                            ,HASHBYTES('SHA2_256',CONCAT([Title],[FirstName],[MiddleName],[LastName],[BirthDate],[MaritalStatus],[Suffix],[Gender],[EmailAddress],[YearlyIncome],[TotalChildren],[NumberChildrenAtHome],[EnglishEducation],[EnglishOccupation],[HouseOwnerFlag],[NumberCarsOwned])) DWH_Hash
              FROM [dbo].[DimCustomer]
         )ST
        ON (TT.[CustomerAlternateKey] = ST.[CustomerAlternateKey])
--Inserts
WHEN NOT MATCHED BY TARGET THEN
        INSERT ([CustomerAlternateKey],[Title],[FirstName],[MiddleName],[LastName],[BirthDate],[MaritalStatus],[Suffix],[Gender],[EmailAddress],[YearlyIncome],[TotalChildren],[NumberChildrenAtHome],[EnglishEducation],[EnglishOccupation],[HouseOwnerFlag],[NumberCarsOwned])
        VALUES (ST.[CustomerAlternateKey],ST.[Title],ST.[FirstName],ST.[MiddleName],ST.[LastName],ST.[BirthDate],ST.[MaritalStatus], ST.[Suffix], ST.[Gender], ST.[EmailAddress], ST.[YearlyIncome], ST.[TotalChildren], ST.[NumberChildrenAtHome], ST.[EnglishEducation], ST.[EnglishOccupation], ST.[HouseOwnerFlag], ST.[NumberCarsOwned])
--Updates
WHEN MATCHED AND (TT.[DWH_Hash] != ST.[DWH_Hash]) THEN
        UPDATE SET BirthDate = ST.BirthDate
                     , EmailAddress = ST.EmailAddress
                     , EnglishEducation = ST.EnglishEducation
                     , EnglishOccupation = ST.EnglishOccupation
                     , FirstName = ST.FirstName
                     , Gender = ST.Gender
                     , HouseOwnerFlag =ST.HouseOwnerFlag
                     , LastName = ST.LastName
                     , MaritalStatus = ST.MaritalStatus
                     , MiddleName = ST.MiddleName
                     , NumberCarsOwned = ST.NumberCarsOwned
                     , NumberChildrenAtHome = ST.NumberChildrenAtHome
                     , Suffix = ST.Suffix
                     , Title = ST.Title
                     , TotalChildren = ST.TotalChildren
                     , YearlyIncome = ST.YearlyIncome

                     , [DWH_Hash] = ST.[DWH_Hash]



Continue with my Next Posts:

Part 6: Incremental Load - Using Lookup Component with Customized SCD (Slowly Changing Dimension) Type 2

Part 4: Incremental Load - Using MERGE JOIN

Part 1: Incremental Load




Incremental Load (Delta Load / Upsert)
In this design pattern, SSIS is used to identify which rows are new and which rows need to be updated are sent to a staging table (Based on ‘Create Date’ and ‘Last Modified Date’ at source), which is later joined to the destination table to do the update. The advantage of this approach is that only the rows to be updated are written to disk twice.


Methods of Incremental Load

  1. Write the matching keys to a new table, then DELETE from current table, and re-insert: TSQL approach not in fashion
  2. Write to a Staging table, then do a TSQL MERGE: Can only be taken care when source and destination are on same server.
  3. Using Lookup Component: The Lookup component, with Full Cache, is pretty fast. The big problem with a lookup component with full cache is that all data sits in RAM, so you need enough RAM for it all to sit there!
  4. Using Hash Function with lookup component: In this approach, we maintain hash value in destination table; create hash value during transformation on source rows.
  5. Using MERGE JOIN: The main drawback of this approach is, here we need to use fully blocking component (a-synchronization)  sort as well as semi –blocking component (a-synchronization) Merge Join to complete their operation. In this approach, one side we use source table as a source and another side destination table as a source after sort join both these 2 dataset with left join via Merge Join component, later then split the output as matched and unmatched rows, Unmatched rows are directly uploaded at destination, whereas matched rows again compared with hash value from source and destination.
  6. Using Lookup Component with customize SCD Type 2: This approach is basically used for Historical Dimension or Fact (SCD Type2)






Part 3: Incremental Upload - Using Lookup Component


Using Lookup component without using hash value, is very much similar. The only differences are:
A. In this approach, we does not need to calculate has value.
B. The Lookup component sql would be like below and change the output column with postfix as _Dest.

Continue with my Previous Posts:

Part 1: Incremental Load


SELECT
  [CustomerAlternateKey] -- this is our business key (BK)
 ,[Title]
 ,[FirstName]
 ,[MiddleName]
 ,[LastName]
 ,[BirthDate]
 ,[MaritalStatus]
 ,[Suffix]
 ,[Gender]
 ,[EmailAddress]
 ,[YearlyIncome]
 ,[TotalChildren]
 ,[NumberChildrenAtHome]
 ,[EnglishEducation]
 ,[EnglishOccupation]
 ,[HouseOwnerFlag]
 ,[NumberCarsOwned]
FROM [dbo].[MyDestination];
C. Expression in conditional status will be changes as below
(ISNULL(BirthDate) ? "" : BirthDate) != (ISNULL(BirthDate_Dest) ? "" : BirthDate_Dest)
|| (ISNULL(EmailAddress) ? "" : EmailAddress) != (ISNULL(EmailAddress_Dest) ? "" : EmailAddress_Dest)
|| (ISNULL(EnglishEducation) ? "" : EnglishEducation) != (ISNULL(EnglishEducation_Dest) ? "" : EnglishEducation_Dest)
|| (ISNULL(EnglishOccupation) ? "" : EnglishOccupation) != (ISNULL(EnglishOccupation_Dest) ? "" : EnglishOccupation_Dest)
|| (ISNULL(FirstName) ? "" : FirstName) != (ISNULL(FirstName_Dest) ? "" : FirstName_Dest)
|| (ISNULL(Gender) ? "" : Gender) != (ISNULL(Gender_Dest) ? "" : Gender_Dest)
|| (ISNULL(HouseOwnerFlag) ? "" : HouseOwnerFlag) !=(ISNULL(HouseOwnerFlag_Dest) ? "" : HouseOwnerFlag_Dest)
|| (ISNULL(LastName) ? "" : LastName) != (ISNULL(LastName_Dest) ? "" : LastName_Dest)
|| (ISNULL(MaritalStatus) ? "" : MaritalStatus) != (ISNULL(MaritalStatus_Dest) ? "" : MaritalStatus_Dest)
|| (ISNULL(MiddleName) ? "" : MiddleName) != (ISNULL(MiddleName_Dest) ? "" : MiddleName_Dest)
|| (ISNULL(NumberCarsOwned) ? "" : NumberCarsOwned) != (ISNULL(NumberCarsOwned_Dest) ? "" : NumberCarsOwned_Dest)
|| (ISNULL(NumberChildrenAtHome) ? "" : NumberChildrenAtHome) != (ISNULL(NumberChildrenAtHome_Dest) ? "" : NumberChildrenAtHome_Dest)
|| (ISNULL(Suffix) ? "" : Suffix) != (ISNULL(Suffix_Dest) ? "" : Suffix_Dest)
|| (ISNULL(Title) ? "" : Title) != (ISNULL(Title_Dest) ? "" : Title_Dest)
|| (ISNULL(TotalChildren) ? "" : TotalChildren) != (ISNULL(TotalChildren_Dest) ? "" : TotalChildren_Dest)
|| (ISNULL(YearlyIncome) ? "" : YearlyIncome) != (ISNULL(YearlyIncome_Dest) ? "" : YearlyIncome_Dest)