Data warehouse design - DataBase and Data Mining Group
Transcript
Data warehouse design - DataBase and Data Mining Group
DB MG Data warehouse design DataBase and Data Mining Group of Politecnico di Torino Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG DB MG Logical design DataBase and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Politecnico di Torino • We address the relational model (ROLAP) – inputs • • • • Logical design conceptual fact schema workload data volume system constraints – output • relational logical schema • Based on different principles with respect to traditional logical design Elena Baralis Politecnico di Torino Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 37 Copyright – All rights reserved – data redundancy – table denormalization Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 38 Copyright – All rights reserved Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Star schema • Dimensions – – – – DB MG Star schema DataBase and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Politecnico di Torino Category Type Supplier one table for each dimension surrogate (generated) primary key it contains all dimension attributes hierarchies are not explicitly represented Product Week • all attributes in a table are at the same level Salesman SALES Month Quantity Amount Shop City Country Week – totally denormalized representation Dimension table • it causes data redundancy • Facts Week_ID Week Month Product – one fact table for each fact schema – primary key composed by foreign keys of all dimensions – measures are attributes of the fact table Dimension table Product_ID Product Type Category Supplier Shop Shop_ID Week_ID Product_ID Quantity Amount Shop_ID Shop City Country Salesman Dimension table Fact table From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 39 Copyright – All rights reserved Copyright – All rights reserved Database and data mining group, Politecnico di Torino Shop_ID 1 2 3 4 Shop N1 N2 N3 N4 City RM RM MI MI Country I I I I Salesman R1 R1 R2 R2 Week Jan1 Jan2 Feb1 Feb2 Month Jan. Jan. Feb. Feb. Dimension Table Product_ID Product 1 P1 2 P2 3 P3 4 P4 DataBase and Data Mining Group of Politecnico di Torino • Some functional dependencies are separated, by partitioning dimension data in several tables Dimension Table Type A A B B DB MG Snowflake schema DataBase and Data Mining Group of Politecnico di Torino Shop_ID Week_ID Product_ID Quantity Amount 1 1 1 100 100 1 2 1 150 150 3 3 4 350 350 4 4 4 200 200 Week_ID 1 2 3 4 Database and data mining group, Politecnico di Torino DB MG Star schema Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 40 – a new table separates two branches of a dimensional hierarchy (hierarchy is cut on a given attribute) – a new foreign key correlates the dimension with the new table Fact Table • Decrease in space required for storing the dimension Category Supplier X F1 X F1 X F2 X F2 – decrease is frequently not significant • Increase in cost for reading entire dimension – one or more joins are needed From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Copyright – All rights reserved DATA WAREHOUSE: DESIGN - 41 Elena Baralis Politecnico di Torino Elena Baralis Politecnico di Torino Copyright – All rights reserved Pag. 1 DATA WAREHOUSE: DESIGN - 42 Elena Baralis Politecnico di Torino DB MG Data warehouse design DataBase and Data Mining Group of Politecnico di Torino Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Snowflake schema Category Type Supplier Product Salesman Week Quantity Amount Shop City Country Week Week_ID Week Month Shop Shop_ID Shop City_ID Salesman Shop_ID Week_ID Product_ID Quantity Amount Product Product_ID Product Type_ID Supplier City Type_ID Type Category From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 DATA WAREHOUSE: DESIGN - 43 Shop_ID Week_ID 1 1 1 2 3 3 4 4 Week_ID. 1 2 3 4 Week Jan1 Jan2 Feb1 Feb2 Month Jan. Jan. Feb. Feb. Shop_ID 1 2 3 4 Shop N1 N2 N3 N4 City_ID 1 1 2 2 Product_ID 1 1 4 4 Quantity 100 150 350 200 Amount 100 150 350 200 Foreign key City_ID City Country Type Copyright – All rights reserved DataBase and Data Mining Group of Politecnico di Torino Product_ID Product Supplier Type_ID 1 P1 F1 1 2 P2 F1 1 3 P3 F2 2 4 P4 F2 2 Type_ID Type Category 1 A X 2 B X SALES Month DB MG Snowflake schema DataBase and Data Mining Group of Politecnico di Torino Salesman R1 R1 R2 R2 City_ID 1 2 City RM MI Country I I From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Elena Baralis Politecnico di Torino Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Star or snowflake? Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 44 Copyright – All rights reserved Multiple edges DataBase and Data Mining Group of Politecnico di Torino • The snowflake schema is usually not recommended DB MG DataBase and Data Mining Group of Politecnico di Torino genre SALE author book – storage space decrease is rarely beneficial quantity income date month year • most storage space is consumed by the fact table (difference with dimensions is several orders of magnitude) • Implementation techniques – cost of join execution may be significant – bridge table • The snowflake schema may be useful • new table which models many to many relationship • new attribute weighting the contribution of tuples in the relationship – when part of a hierarchy is shared among dimensions (e.g., geographic hierarchy) – for materialized views, which require an aggregate representation of the corresponding dimensions • multiple edge integrated in the fact table • new corresponding dimension in the fact table Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 45 Copyright – All rights reserved – push down Copyright – All rights reserved Database and data mining group, Politecnico di Torino Multiple edges Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 46 Database and data mining group, Politecnico di Torino DB MG Multiple edges DataBase and Data Mining Group of Politecnico di Torino DB MG DataBase and Data Mining Group of Politecnico di Torino • Queries genre – Weighted query: consider the weight of the multiple edge SALE author book Books Sales Book_ID Date_ID Quantity Income Book_ID Book Genre quantity income date month year BRIDGE Book_ID Author_ID Weight Authors Author_ID Author • example: author income • by using bridge table: SUM(Income*weight) group by ID_author Books Sales Book_ID Author_ID Date_ID Quantity Income Book_ID Book Genre – Impact query: do not consider the weight of the multiple edge Authors Author_ID Author • example: book copies sold for each author • by using bridge table: SUM(Quantity) group by ID_author From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Copyright – All rights reserved DATA WAREHOUSE: DESIGN - 47 Elena Baralis Politecnico di Torino Elena Baralis Politecnico di Torino Copyright – All rights reserved Pag. 2 DATA WAREHOUSE: DESIGN - 48 Elena Baralis Politecnico di Torino DB MG Data warehouse design DataBase and Data Mining Group of Politecnico di Torino Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Multiple edges • Comparison Category • (push down) hard to perform impact queries • (push down) weight is computed when feeding the DW • (push down) weight modifications are hard Supplier Type Product Shipping Mode – push down causes significant redundancy in the fact table – query execution time for push down Return code ORDER LINE Quantity Amount Order Customer City Line Order Status • less joins • higher cardinality for the fact table Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 49 Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 50 Copyright – All rights reserved Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Degenerate dimensions DataBase and Data Mining Group of Politecnico di Torino • Dimensions with a single attribute – weight is explicited in the bridge table, but wired in the fact table for push down Copyright – All rights reserved DB MG Degenerate dimensions DataBase and Data Mining Group of Politecnico di Torino DB MG Junk dimension DataBase and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Politecnico di Torino • Implementations Order – (usually) directly integrated into the fact table • only for attributes with a (very) small size Order Line Order_ID Product_ID SRL_ID Quantity Amount – junk dimension • single dimension containing several degenerate dimensions • no functional dependencies among attributes in the junk dimension Order_ID Order Customer City_ID SRL SRL_ID ShippingMode ReturnCode LineOrderStatus – all attribute value combinations are allowed – feasible only for attribute domains with small cardinality Copyright – All rights reserved From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 51 Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Materialized views Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 52 Copyright – All rights reserved DB MG Materialized views DataBase and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Politecnico di Torino • Materialized views may be exploited for answering several different queries • Precomputed summaries for the fact table – explicitly stored in the data warehouse – provide a performance increase for aggregate queries – not for all aggregation operators {a,c} d b c a {b,c} {a,d} v1 = {product, date, shop} {b,d} {c} {a} v2 = {type, date, city} v4 = {type, month, region} v3 = {category, month, city} {d} v5 = {quarter, region} {b} {} Multidimensional lattice From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Copyright – All rights reserved DATA WAREHOUSE: DESIGN - 53 Elena Baralis Politecnico di Torino From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Elena Baralis Politecnico di Torino Copyright – All rights reserved Pag. 3 DATA WAREHOUSE: DESIGN - 54 Elena Baralis Politecnico di Torino DB MG Data warehouse design DataBase and Data Mining Group of Politecnico di Torino Database and data mining group, Politecnico di Torino Database and data mining group, Politecnico di Torino DB MG Materialized view selection DB MG Materialized view selection DataBase and Data Mining Group of Politecnico di Torino • Huge number of allowed aggregations DataBase and Data Mining Group of Politecnico di Torino {a,c} – most attribute combinations are eligible • Selection of the “best” materialized view set • Cost function minimization {b,c} – query execution cost – view maintainance (update) cost {a} {c} q3 {b} {d} • Constraints – – – – {a,d} {b,d} q1 available space time window for update response time data freshness Multidimensional lattice + {} = candidate views, possibly useful to increase workload query performance From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 55 Copyright – All rights reserved q2 DATA WAREHOUSE: DESIGN - 56 Copyright – All rights reserved Database and data mining group, Politecnico di Torino {a,c} Database and data mining group, Politecnico di Torino DB MG Materialized view selection Space and time minimization {b,c} Disk Update space window {a} {c} q1 {a,d} Query cost {b,d} q3 q1 {} From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 DATA WAREHOUSE: DESIGN - 57 From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Elena Baralis Politecnico di Torino Copyright – All rights reserved Database and data mining group, Politecnico di Torino DB MG Materialized view selection {a,c} DataBase and Data Mining Group of Politecnico di Torino All constraints {b,c} {a,d} {b,d} Disk Update Query space window cost {a} {c} q3 {b} q1 q2 {} From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006 Copyright – All rights reserved q2 {} Copyright – All rights reserved {d} {b} {d} q2 Disk Update Query space window cost {a} {c} q3 {b} {d} DataBase and Data Mining Group of Politecnico di Torino {a,c} {a,d} {b,d} DB MG Materialized view selection DataBase and Data Mining Group of Politecnico di Torino Cost minimization {b,c} Elena Baralis Politecnico di Torino DATA WAREHOUSE: DESIGN - 59 Elena Baralis Politecnico di Torino Elena Baralis Politecnico di Torino Pag. 4 DATA WAREHOUSE: DESIGN - 58 Elena Baralis Politecnico di Torino