Thursday, January 4, 2007

ETL - Extraction

Extraction:

Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. The source systems for a data warehouse are typically transaction processing applications. Designing and creating the extraction process is often one of the most time-consuming tasks in the ETL process and in the entire data warehousing process. The source systems might be very complex and poorly documented, and thus determining which data needs to be extracted can be difficult. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date.

Extraction Methods in Data Warehouses:

The extraction method to be used is highly dependent on the source system and also from the business needs in the target data warehouse environment. Very often, there’s no possibility to add additional logic to the source systems to enhance an incremental extraction of data due to the performance or the increased workload of these systems. Sometimes even the customer is not allowed to add anything to an out-of-the-box application system. The estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective.

Logical Extraction Methods

There are two kinds of logical extraction:

1) Full Extraction

2) Incremental Extraction

Full Extraction

The data is extracted completely from the source system. Since this extraction reflects all the data currently available on the source system, there’s no need to keep track of changes to the data source since the last successful extraction. The source data will be provided and no additional logical information (for example, timestamps) is necessary on the source site

Incremental Extraction

At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself like an application column, reflecting the last-changed timestamp or a change table where an appropriate additional mechanism keeps track of the changes besides the originating transactions. In most cases, using the latter method means adding extraction logic to the source system. Many data warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared with a previous extract from the source system to identify the changed data. This approach may not have significant impact on the source systems, but it clearly can place a considerable burden on the data warehouse processes, particularly if the data volumes are large.

Physical Extraction Methods

Depending on the chosen logical extraction method and the capabilities and restrictions on the source side, the extracted data can be physically extracted by two mechanisms. The data can either be extracted online from the source system or from an offline structure. Such an offline structure might already exist or it might be generated by an extraction routine. There are the following methods of physical extraction:

1) Online Extraction

2) Offline Extraction

Online Extraction

The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily physically different from the source system. With online extractions, you need to consider whether the distributed transactions are using original source objects or prepared source objects.

Offline Extraction

The data is not extracted directly from the source system but is staged explicitly outside the original source system. The data already has an existing structure (for example, flat files) or was created by an extraction routine. This is comparatively safer and also aids in performance improvement.

2 comments:

Anonymous said...

Nice dispatch and this fill someone in on helped me alot in my college assignement. Thanks you seeking your information.

Anonymous said...

[url=http://www.payloansonline.com]payday loans[/url]
This is the best way to get all your health products online like green coffee, african mango, phen375 and others. Visit now

[url=http://www.prlog.org/12050970-zetaclear-new-year-sale-2013-on-zeta-clear-last-day-of-this-biggest-sale.html]Buy Zeta Clear[/url]