Thursday, January 4, 2007

ETL - Transformations and Loading

Transformation in Data Warehouses

Data transformations are often the most complex and, in terms of processing time,

the most costly part of the ETL process. They can range from simple data

conversions to extremely complex data scrubbing techniques.

Transformation Flow

From an architectural perspective, you can transform your data in two ways:

  1. Multistage Data Transformation
  2. Pipelined Data Transformation

Multistage Data Transformation

The data transformation logic for most data warehouses consists of multiple steps. In this case each step would consist of an independent transformation that does the transformation and stores the data in a temporary stage table.

Advantage of this method is that the coding is simpler. Also even if at any stage during running, the code fails then the previous temporary stage table still contains the data. This option is called restartability.

Pipelined Data Transformation

When the entire steps are taken care in a single pipeline wherein the data flows after passing through all the transformations is called pipelined data transformation.

Check the below diagram that has the same transformations as shown in the previous example.

In ETL tools like Informatica, the pipelined data transformations are more suited because the tool itself can handle such functionality and also restartability can be taken care internally. These ETL tools have a set of specialized transformation (objects that can modify the incoming data). An example can be aggregator which aggregates the records according to the requirements.

Loading into the datawarehouse:

This is rather a simple topic, because in SQL - INSERT means to load. There are varieties of loading methods. But these are dependent on the database where the target table resides.

For e.g.: In Teradata there are the following load options – FastLoad, MultiLoad and TPump.

ETL - Transportation

Transportation in Data Warehouses

Transportation is the operation of moving data from one system to another system. In a data warehouse environment, the most common requirements for transportation are in moving data from:

  1. A source system to a staging database or a data warehouse database
  2. A staging database to a data warehouse
  3. A data warehouse to a data mart

Introduction to Transportation Mechanisms in Data Warehouses

You have three basic choices for transporting data in warehouses:

  1. Transportation Using Flat Files
  2. Transportation Through Distributed Operations
  3. Transportation Using Transportable Table spaces

Let me discuss just one transportation method.

Transportation Using Flat Files

The most common method for transporting data is by the transfer of flat files, using mechanisms such as FTP or other remote file system access protocols. Data is unloaded or exported from the source system into flat files using techniques as discussed in "Extraction in Data Warehouses", and is then transported to the target platform using FTP or similar mechanisms. Because source systems and data warehouses often use different operating systems and database systems, using flat files is often the simplest way to exchange data between heterogeneous systems with minimal transformations. However, even when transporting data between homogeneous systems, flat files are often the most efficient and most easy-to-manage mechanism for data transfer.




ETL - Extraction

Extraction:

Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. The source systems for a data warehouse are typically transaction processing applications. Designing and creating the extraction process is often one of the most time-consuming tasks in the ETL process and in the entire data warehousing process. The source systems might be very complex and poorly documented, and thus determining which data needs to be extracted can be difficult. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date.

Extraction Methods in Data Warehouses:

The extraction method to be used is highly dependent on the source system and also from the business needs in the target data warehouse environment. Very often, there’s no possibility to add additional logic to the source systems to enhance an incremental extraction of data due to the performance or the increased workload of these systems. Sometimes even the customer is not allowed to add anything to an out-of-the-box application system. The estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective.

Logical Extraction Methods

There are two kinds of logical extraction:

1) Full Extraction

2) Incremental Extraction

Full Extraction

The data is extracted completely from the source system. Since this extraction reflects all the data currently available on the source system, there’s no need to keep track of changes to the data source since the last successful extraction. The source data will be provided and no additional logical information (for example, timestamps) is necessary on the source site

Incremental Extraction

At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself like an application column, reflecting the last-changed timestamp or a change table where an appropriate additional mechanism keeps track of the changes besides the originating transactions. In most cases, using the latter method means adding extraction logic to the source system. Many data warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared with a previous extract from the source system to identify the changed data. This approach may not have significant impact on the source systems, but it clearly can place a considerable burden on the data warehouse processes, particularly if the data volumes are large.

Physical Extraction Methods

Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the extracted data can be physically extracted by two mechanisms. The data can either be extracted online from the source system or from an offline structure. Such an offline structure might already exist or it might be generated by an extraction routine. There are the following methods of physical extraction:

1) Online Extraction

2) Offline Extraction

Online Extraction

The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily physically different from the source system. With online extractions, you need to consider whether the distributed transactions are using original source objects or prepared source objects.

Offline Extraction

The data is not extracted directly from the source system but is staged explicitly outside the original source system. The data already has an existing structure (for example, flat files) or was created by an extraction routine. This is comparatively safer and also aids in performance improvement.

Wednesday, January 3, 2007

Overview of ETL

Data needs to be loaded into the data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. Therefore ETL must not be restricted to just the three. What happens during the ETL process?
Extraction is the operation of extracting data from a source system for further use in a data warehouse environment.
Transportation is the operation of moving data from one system to another system.
In a data warehouse environment, the most common requirements for transportation are in moving data from:
A source system to a staging database or a data warehouse database
A staging database to a data warehouse
A data warehouse to a data mart
Data Transformations are often the most complex and, in terms of processing time, the most costly part of the ETL process. They can range from simple data conversions to extremely complex data scrubbing techniques
Loading is the operation of loading the transformed data into the data warehouse environment (target tables).

Issues with ETL:
Consumes 70-80% of project time:
This is the part that consumes most of the project time. Extreme careful steps must be taken to prevent overlapping of project scope, time and resources.
Heterogeneous source systems
There are various sources that can be the source systems like PeopleSoft for HR , legacy systems for registry of OLTP transactions etc.
Little or no control over source systems
Scattered source systems working is different time zones having different currencies
Different measurement unit
Data Quality


More details of each step in the next session.