BI Testing and DW/SQL knowledge: DW

Showing posts with label DW. Show all posts

Monday, July 12, 2010

What is Fact and Fact table?

A Fact is a measurable value of the business for example Daily Sales, Headcount, etc.

A Fact Table contains the measurements or metrics or facts of business process. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

What is Dimension and Dimension table?

A Dimension defines the entry point to analyze the facts and enable users to answer business questions.

A Dimension table is a collection of hierarchies and categories along which the user can drill down and drill up. It contains only the textual attributes. Dimension tables are nothing but a master tables through which extract the actual transactions. The dimension tables store the textual descriptions of the dimensions. Dimension tables, also known as lookup or reference tables; contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries.

Explain relation between dimension and fact?

Facts are the metrics that business users would use for making business decisions. Generally, facts are mere numbers. The facts cannot be used without their dimensions. Dimensions are those attributes that qualify facts. They give structure to the facts. Dimensions give different views of the facts. In example of employee expenses, the employee expense forms a fact. The Dimensions like department, employee, and location qualify it. Facts are like skeletons of a body. Skin forms the dimensions. The dimensions give structure to the facts.

What is difference between Fact table and dimension table?

Dimension table	Fact table
It provides the context /descriptive information for fact table measurements.	It provides measurement of an enterprise.
Structure of Dimension - Surrogate key, one or more other fields that compose the natural key and set of Attributes.	Structure of Fact Table - foreign key, Degenerated Dimension and Measurements.
Size of Dimension Table is smaller than Fact Table.	Size of Fact Table is larger than Dimension Table.
In a schema more no of dimensions tables are presented than Fact tables.	In a schema less no of Fact tables are presented than Dimension tables.
Values of fields are in text representation.	Values of the fields always in integer form.
We can load the dimension table directly	We can't load the fact table first. So to load the fact table we need to load the dimension table first. Also while loading the fact table we will make a lookup on the dimension table cause the fact table contains the measures/facts & the foreign keys which are primary keys in the dimension tables surrounded to that fact table.

What are the different types of Fact?

Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. If the columns of a fact table is not able in the position to aggregate then it is called non-additive facts.

What are the different types of Dimension?

Confirmed Dimension:

Confirmed Dimension is a dimension which is shared by different fact tables.
Confirmed Dimension is a dimension table which is connected to the multiple fact tables across multiple Data Marts.
Conformed Dimensions are the Dimensions which are common to two cubes.
Example: CUBE-1 contains F1 D1 D2 D3 and CUBE-2 contains F2 D1 D2 D4 are the Facts and Dimensions here D1 D2 are the Conformed Dimensions
Generally the TIME (Period, Date, etc) Dimension is a conformed Dimension.

Junk dimension:

A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.
The fact would contain several metrics (principal amount, net amount, price per share etc.) and would be related to several dimensions (such as account, date etc.) This fact would also contain several codes and flags that were related to the transaction rather than any of the dimensions such as origin code (that indicates whether the trade was initiated with a phone call or via the Web), a reinvest flag (that indicates whether or not this trade as was the result of the reinvestment of a dividend payout) and a comment field for storing special instructions from the customer. These three attributes would normally be removed from the fact table and stored in a junk dimension. In this way, the number of indexes on the fact table would be reduced, and performance would be enhanced.

Degenerated dimension:

A column of the key section of the fact table that does not have the associated dimension table but used for reporting and analysis, such column is called degenerate dimension or line item dimension. For ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, date in key section and price, quantity, amount in measure section. In this fact table, bill_no from key section is a single value, it has no associated dimension table. Instead of creating a separate dimension table for that single value, we can include it in fact table to improve performance.

What is the Factless Fact table and purpose of Factless Fact Table?

Factless Fact table:

A) A Fact table without measures (numeric data) for a column is called Factless Fact table.

B) Fact table contains only id, keys and description columns and not measures are known as factless fact tables.

Purpose:

Generally factless fact table used when events that happen only at information level but not included in the calculations level. Just information about an event that happen over a period.

This type of fact table itself can be used to generate the useful reports such as count the number of occurrences with various criteria. For example, factless fact table to capture the student attendance. The following questions can be answered:

Which class has the least attendance?

What is the average number of attendance of a given course?

All the queries are based on the COUNT () with the GROUP BY queries. I think that the interesting metrics are the nested GROUP BY so you can first count and then apply other aggregate functions such as AVERAGE, MAX, MIN.

What is Surrogate key?

A) A surrogate key is a system-generated (non-meaningful from a business perspective) primary key for purposes of ensuring uniqueness within a database table.

Example: Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

B) Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover has to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your datawarehouse everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.'

C) To generate sequence number

Generally it starts with 1

Basically it is to differentiate the primary key

The Surrogate key role is it links the Dimension and Fact table

Data type of the surrogate key is either integer or long.

What is a slowly changing dimension (SCD)?

A dimension whose attribute values change over a period of time is called as a Slowly Changing Dimension; there are many ways of SCD implementation, depending on the business requirements. The following are most commonly used SCD implementation.

Type 1: Overwriting the old values. As the old value is overwritten by the new (current) value, there is no way to find out what the old (historical) value(s) were. In other words, it holds ONLY current value.

Ex: Any changes in the Residence Address of a Customer of a retail outlet can be a SCD Type 1, where the past address of the customer can be overwritten with the new one.

Type 2: Adding an additional record in the dimension table. In this case, an additional record is added in the dimension table. This method is very useful for the ACCURATE reporting purpose as all the OLD VALUES are captured. Do not update the existing record. Create a new record (with version number or change date as part of key) of the dimension, while retaining the old one. In this type of implementation, an n-level history is possible.

Ex: Whenever the product category attribute of a product is changed, it is required to keep the history of the old and new category values, for internal audit purpose, so a product category attribute of the product dimension can be treated as SCD Type 2.

Type 3: In this implementation, a column to retain the old value is maintained in the dimension table. In this method, when the value of the dimension is updated (by the current value), the old value column of the table is updated by the most previous value of the dimension. Thus we have the CURRENT value and the MOST PREVIOUS value of the dimension, and hence only 1 level of history can be maintained in this implementation.

Ex: Any changes in the Residence Address of a Customer of a retail outlet can be a SCD Type 3 If the store decides to maintain at least 1 level of history, Here the previous address of the customer as well as the current address is maintained in two different columns.

What is the difference between aggregate table and fact table ?

1) Fact tables generally contain measures and the composite key, which is generally a combination of foreign keys from the respective dimension tables. Aggregate tables are summary tables which contains the summary information say for eg: total sales revenue (for a quarter).

2) So the difference is in the granularity. Fact tables store date in more detail level, but in aggregate tables the granularity is high.

BI Testing and DW/SQL knowledge