ETL stands for Extract, Transform, and Load. ETL is a process that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn't really matter that that data is in different forms or formats. The data can come from any source. ETL process is powerful enough to handle such data disparities.
Extract
ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization/format. Goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing. More often as a best practice, the Data is extracted from the source in to PREP tables, where the data is in exactly the same format as the source system, but this need not be true always. Extracting data in to the PREP tables always ensures that all the data from the source has been extracted, and is ready for applying transformations as per business requirements and load the data to the target system.
Transform
Post data extraction, the next step in ETL is to apply transformations (or in simple terms the business rules), that are applicable for the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:
- Restricting unwanted data from propagating in to the target system (for ex, discarding null values, selecting only certain columns to load, etc).
- Automated Cleansing and decoding the data (For ex:- if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)
- Deriving new calculated fields (ex:- sale_amount = qty * unit_price)
- Applying filtering and sorting
- Merging data from various sources (for ex:- Lookup, Merge, etc)
- Aggregation (for example, rollup - summarizing multiple rows of data - total sales for each store, and for each region, etc.)
- Generating surrogate-key values
- Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
- Lookup and Constraints validation to check the relevant data from tables or referential files for slowly changing dimensions.
- Data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data are handed over to the next step, depending on the rule design and exception handling.
Load
Once the data is extracted, transformed, the data is loaded in to the target system, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information, updates from the extract data is done periodically (daily, weekly, monthly etc). Other DW may add new data in a historicized form.