Best practice for ETL
1. Analyze source data and fix the issues at source level before start extraction.
2. Extract only needed data instead all, ETL package should be design as incremental data extraction either be based on lastmodified and submit date or based on some key.
3. Configre Parllal processing within ETL package, it can either be at package level (MaxConcurrentProcess property that max by no of processor + 2) or at Dataflow task level Engine Thread
4. Avoid to use of Asynchronous tasks that block the data to complete there operation also need more then one buffer.
5. Avoid to use implicit type casting, use Data conversion task instead.
6. Indexes should be disable before pushing data into destination table. Even clustered index should be drooped if the clustered index is not auto-generated surrogate key that save time,
that can be rebuild or reorganize. Clustered index can only be drop and re-create.
7. Use fast load if possible, fast load is only possible if source server is also a sql server.
8. Use SqlBulk.WriteToServer, if we need to store large amout of data.
9. Use bulkCopy or bcp utility to upload the data from flat file.
10. Data-reconcilation should be done after ETL that can either be based on Checksum, binarychecksum or Hash_Byte functions.
11. Avoid use of event handling
12. Proper logging at ETL level to track failure or any performance issue. And use sql server loging.
13. Avoid to use of SCD task.
No comments:
Post a Comment