Thursday, March 8, 2007

Characteristics of Datawarehousing - Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. Adata warehouse’s focus on change over time is what is meant by the term time variant.

Characteristics of Datawarehousing - Nonvolatile

This characteristic of datawarehousing is very interesting. Normally when there is any change in data , those changes are made in the Operational data base. However in case of datawarehouse the snapshot of the table is made on an interval basis ( daily ) and gets inserted into the datawarehouse. There is no need to change the data present in DW.

Characteristics of Datawarehousing - Integrated

Integration is closely related to subject orientation. Data warehouses must put data from heterogeneous sources having different formats into a single or multiple warehouses having a consistent format. When all issues relating to the conflicts like naming conventions and datatypes are resolved, they are said to be integrated.

Characteristics of Datawarehousing - Subject Oriented

Data warehouses are designed to help you analyze data - so datawarehouses can be modeled according to the way the data needs to be analysed. Eg : When a bank wants to analyse the customers and their accounts inorder to provide discounts or promotions , they will have to build a data warehouse around the accounts . This ability to define a data warehouse by subjectarea makes the data warehouse subject oriented.

Characteristics of Datawarehousing :

A common way of introducing data warehousing is to refer to the characteristics ofa data warehouse as set forth by William Inmon:
1) Subject Oriented
2) Integrated
3) Nonvolatile
4) Time Variant

What is slowly changing dimension. What kind of scd used in your project?

Dimension attribute values may change constantly over the time. (Say for example customer dimension has customer_id,name, and address) customer address may change over time.
How will you handle this situation?
There are 3 types,
1) To overwrite the existing record,
2) To create additional new record at the time of change with the new attribute values.
3) To create new field to keep new values in the original dimension table

What is difference between data mart and data warehouse?

A data mart designed for a particular line of business, such as sales, marketing, or finance.

Where as data warehouse is enterprise-wide/organizational

The data flow of data warehouse depending on the approach.

Thursday, January 4, 2007

ETL - Transformations and Loading

Transformation in Data Warehouses

Data transformations are often the most complex and, in terms of processing time,

the most costly part of the ETL process. They can range from simple data

conversions to extremely complex data scrubbing techniques.

Transformation Flow

From an architectural perspective, you can transform your data in two ways:

  1. Multistage Data Transformation
  2. Pipelined Data Transformation

Multistage Data Transformation

The data transformation logic for most data warehouses consists of multiple steps. In this case each step would consist of an independent transformation that does the transformation and stores the data in a temporary stage table.

Advantage of this method is that the coding is simpler. Also even if at any stage during running, the code fails then the previous temporary stage table still contains the data. This option is called restartability.

Pipelined Data Transformation

When the entire steps are taken care in a single pipeline wherein the data flows after passing through all the transformations is called pipelined data transformation.

Check the below diagram that has the same transformations as shown in the previous example.

In ETL tools like Informatica, the pipelined data transformations are more suited because the tool itself can handle such functionality and also restartability can be taken care internally. These ETL tools have a set of specialized transformation (objects that can modify the incoming data). An example can be aggregator which aggregates the records according to the requirements.

Loading into the datawarehouse:

This is rather a simple topic, because in SQL - INSERT means to load. There are varieties of loading methods. But these are dependent on the database where the target table resides.

For e.g.: In Teradata there are the following load options – FastLoad, MultiLoad and TPump.

ETL - Transportation

Transportation in Data Warehouses

Transportation is the operation of moving data from one system to another system. In a data warehouse environment, the most common requirements for transportation are in moving data from:

  1. A source system to a staging database or a data warehouse database
  2. A staging database to a data warehouse
  3. A data warehouse to a data mart

Introduction to Transportation Mechanisms in Data Warehouses

You have three basic choices for transporting data in warehouses:

  1. Transportation Using Flat Files
  2. Transportation Through Distributed Operations
  3. Transportation Using Transportable Table spaces

Let me discuss just one transportation method.

Transportation Using Flat Files

The most common method for transporting data is by the transfer of flat files, using mechanisms such as FTP or other remote file system access protocols. Data is unloaded or exported from the source system into flat files using techniques as discussed in "Extraction in Data Warehouses", and is then transported to the target platform using FTP or similar mechanisms. Because source systems and data warehouses often use different operating systems and database systems, using flat files is often the simplest way to exchange data between heterogeneous systems with minimal transformations. However, even when transporting data between homogeneous systems, flat files are often the most efficient and most easy-to-manage mechanism for data transfer.




ETL - Extraction

Extraction:

Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. The source systems for a data warehouse are typically transaction processing applications. Designing and creating the extraction process is often one of the most time-consuming tasks in the ETL process and in the entire data warehousing process. The source systems might be very complex and poorly documented, and thus determining which data needs to be extracted can be difficult. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date.

Extraction Methods in Data Warehouses:

The extraction method to be used is highly dependent on the source system and also from the business needs in the target data warehouse environment. Very often, there’s no possibility to add additional logic to the source systems to enhance an incremental extraction of data due to the performance or the increased workload of these systems. Sometimes even the customer is not allowed to add anything to an out-of-the-box application system. The estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective.

Logical Extraction Methods

There are two kinds of logical extraction:

1) Full Extraction

2) Incremental Extraction

Full Extraction

The data is extracted completely from the source system. Since this extraction reflects all the data currently available on the source system, there’s no need to keep track of changes to the data source since the last successful extraction. The source data will be provided and no additional logical information (for example, timestamps) is necessary on the source site

Incremental Extraction

At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself like an application column, reflecting the last-changed timestamp or a change table where an appropriate additional mechanism keeps track of the changes besides the originating transactions. In most cases, using the latter method means adding extraction logic to the source system. Many data warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared with a previous extract from the source system to identify the changed data. This approach may not have significant impact on the source systems, but it clearly can place a considerable burden on the data warehouse processes, particularly if the data volumes are large.

Physical Extraction Methods

Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the extracted data can be physically extracted by two mechanisms. The data can either be extracted online from the source system or from an offline structure. Such an offline structure might already exist or it might be generated by an extraction routine. There are the following methods of physical extraction:

1) Online Extraction

2) Offline Extraction

Online Extraction

The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily physically different from the source system. With online extractions, you need to consider whether the distributed transactions are using original source objects or prepared source objects.

Offline Extraction

The data is not extracted directly from the source system but is staged explicitly outside the original source system. The data already has an existing structure (for example, flat files) or was created by an extraction routine. This is comparatively safer and also aids in performance improvement.

Wednesday, January 3, 2007

Overview of ETL

Data needs to be loaded into the data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. Therefore ETL must not be restricted to just the three. What happens during the ETL process?
Extraction is the operation of extracting data from a source system for further use in a data warehouse environment.
Transportation is the operation of moving data from one system to another system.
In a data warehouse environment, the most common requirements for transportation are in moving data from:
A source system to a staging database or a data warehouse database
A staging database to a data warehouse
A data warehouse to a data mart
Data Transformations are often the most complex and, in terms of processing time, the most costly part of the ETL process. They can range from simple data conversions to extremely complex data scrubbing techniques
Loading is the operation of loading the transformed data into the data warehouse environment (target tables).

Issues with ETL:
Consumes 70-80% of project time:
This is the part that consumes most of the project time. Extreme careful steps must be taken to prevent overlapping of project scope, time and resources.
Heterogeneous source systems
There are various sources that can be the source systems like PeopleSoft for HR , legacy systems for registry of OLTP transactions etc.
Little or no control over source systems
Scattered source systems working is different time zones having different currencies
Different measurement unit
Data Quality


More details of each step in the next session.