Thursday, January 4, 2007

ETL - Transformations and Loading

Transformation in Data Warehouses

Data transformations are often the most complex and, in terms of processing time,

the most costly part of the ETL process. They can range from simple data

conversions to extremely complex data scrubbing techniques.

Transformation Flow

From an architectural perspective, you can transform your data in two ways:

  1. Multistage Data Transformation
  2. Pipelined Data Transformation

Multistage Data Transformation

The data transformation logic for most data warehouses consists of multiple steps. In this case each step would consist of an independent transformation that does the transformation and stores the data in a temporary stage table.

Advantage of this method is that the coding is simpler. Also even if at any stage during running, the code fails then the previous temporary stage table still contains the data. This option is called restartability.

Pipelined Data Transformation

When the entire steps are taken care in a single pipeline wherein the data flows after passing through all the transformations is called pipelined data transformation.

Check the below diagram that has the same transformations as shown in the previous example.

In ETL tools like Informatica, the pipelined data transformations are more suited because the tool itself can handle such functionality and also restartability can be taken care internally. These ETL tools have a set of specialized transformation (objects that can modify the incoming data). An example can be aggregator which aggregates the records according to the requirements.

Loading into the datawarehouse:

This is rather a simple topic, because in SQL - INSERT means to load. There are varieties of loading methods. But these are dependent on the database where the target table resides.

For e.g.: In Teradata there are the following load options – FastLoad, MultiLoad and TPump.

No comments: