An event was held in Calgary at the Microsoft office on February, 27 about ETL Best Practices and Techniques, with a focus on SSIS. The speaker was Marc Beacom, a Managing Partner at Datalere, Leverages.
The main items that were discussed were :
- ETL Templates: How we can use ETL template to have a consistent framework, reuse them, log and audit.
- Standards: Using system variables, assisting in troubleshooting, standard names, checklists for code reviews and environmental setup. The check list is reviewed before promoting an ETL package from Dev to QA. This needs to be done for Control Flow (e.g. All Control Flow tasks have the proper prefix), Data Flow (e g. Lookup transformations are set to partial cache unless no cache is needed and Event Handlers (The SQL Log Error task exists at the Package level for the OnError Event Handler).
- ETL Load Patterns: Having a Blueprint for ETL development, ensuring consistency and socializing patterns with all ETL developers
- Documentation: Including configuration, SCD method, patterns, templates, standards, checklists, etc
- Address bad data: How we can develop a process to address bad data, document outcome, and choose an option (e g. Ignore, insert and flag, redirect to another table)
- Simplify data movement: How we can move data as small as possible, or land data close to destination, or shorten loads cycles and advantages.
- Decode and performance: Challenges (e g. Long Data matching and loading) and recommendations (e.g. Sorting and Joining on database)
- Format and Organize: e g. adding comments / annotations where needed
- Incremental Loading: e.g. Loading only changed data and no more, reducing bandwidth during loss
- Parallel processing: How we can take advantage of idle resources in Control Flow and Data Flow.
- Parallel processing case study: Current load was 6.5 hours. After partitioning table, parallel processing, using better data types, and page level compression, the load time reduced to 2 hours.
- Stage data: How we can reduce overall complexity, reduce the hit on the source system, have separate database where possible and different schema where not
- Optimize Source and destinations : e g. only pulling in columns and rows you need, Using Index optimization with adding lookup queries, and distributing data source
- Blocking transformations: Which transformations are blocked and we should not use or use less (e g. Sort), which ones we can use partially (e g. Merge joins), and which ones we are allowed to use (e g. Row Count and Derived columns)
- Validation Framework: How we can test data to ensure accuracy such as comparing source and DW data.