Data warehouse design - DataBase and Data Mining Group

Transcript

Data warehouse design - DataBase and Data Mining Group
DB
MG
Data warehouse design
DataBase and Data Mining Group of Politecnico di Torino
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
DB
MG
Logical design
DataBase and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
• We address the relational model (ROLAP)
– inputs
•
•
•
•
Logical design
conceptual fact schema
workload
data volume
system constraints
– output
• relational logical schema
• Based on different principles with respect to
traditional logical design
Elena Baralis
Politecnico di Torino
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 37
Copyright – All rights reserved
– data redundancy
– table denormalization
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 38
Copyright – All rights reserved
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Star schema
• Dimensions
–
–
–
–
DB
MG
Star schema
DataBase and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
Category
Type
Supplier
one table for each dimension
surrogate (generated) primary key
it contains all dimension attributes
hierarchies are not explicitly represented
Product
Week
• all attributes in a table are at the same level
Salesman
SALES
Month
Quantity
Amount
Shop
City
Country
Week
– totally denormalized representation
Dimension
table
• it causes data redundancy
• Facts
Week_ID
Week
Month
Product
– one fact table for each fact schema
– primary key composed by foreign keys of all
dimensions
– measures are attributes of the fact table
Dimension
table
Product_ID
Product
Type
Category
Supplier
Shop
Shop_ID
Week_ID
Product_ID
Quantity
Amount
Shop_ID
Shop
City
Country
Salesman
Dimension
table
Fact table
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 39
Copyright – All rights reserved
Copyright – All rights reserved
Database and data mining group, Politecnico di Torino
Shop_ID
1
2
3
4
Shop
N1
N2
N3
N4
City
RM
RM
MI
MI
Country
I
I
I
I
Salesman
R1
R1
R2
R2
Week
Jan1
Jan2
Feb1
Feb2
Month
Jan.
Jan.
Feb.
Feb.
Dimension
Table
Product_ID Product
1
P1
2
P2
3
P3
4
P4
DataBase and Data Mining Group of Politecnico di Torino
• Some functional dependencies are separated, by
partitioning dimension data in several tables
Dimension
Table
Type
A
A
B
B
DB
MG
Snowflake schema
DataBase and Data Mining Group of Politecnico di Torino
Shop_ID Week_ID Product_ID Quantity Amount
1
1
1
100
100
1
2
1
150
150
3
3
4
350
350
4
4
4
200
200
Week_ID
1
2
3
4
Database and data mining group, Politecnico di Torino
DB
MG
Star schema
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 40
– a new table separates two branches of a dimensional
hierarchy (hierarchy is cut on a given attribute)
– a new foreign key correlates the dimension with the
new table
Fact Table
• Decrease in space required for storing the
dimension
Category Supplier
X
F1
X
F1
X
F2
X
F2
– decrease is frequently not significant
• Increase in cost for reading entire dimension
– one or more joins are needed
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Copyright – All rights reserved
DATA WAREHOUSE: DESIGN - 41
Elena Baralis
Politecnico di Torino
Elena Baralis
Politecnico di Torino
Copyright – All rights reserved
Pag. 1
DATA WAREHOUSE: DESIGN - 42
Elena Baralis
Politecnico di Torino
DB
MG
Data warehouse design
DataBase and Data Mining Group of Politecnico di Torino
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Snowflake schema
Category
Type
Supplier
Product
Salesman
Week
Quantity
Amount
Shop
City
Country
Week
Week_ID
Week
Month
Shop
Shop_ID
Shop
City_ID
Salesman
Shop_ID
Week_ID
Product_ID
Quantity
Amount
Product
Product_ID
Product
Type_ID
Supplier
City
Type_ID
Type
Category
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
DATA WAREHOUSE: DESIGN - 43
Shop_ID Week_ID
1
1
1
2
3
3
4
4
Week_ID.
1
2
3
4
Week
Jan1
Jan2
Feb1
Feb2
Month
Jan.
Jan.
Feb.
Feb.
Shop_ID
1
2
3
4
Shop
N1
N2
N3
N4
City_ID
1
1
2
2
Product_ID
1
1
4
4
Quantity
100
150
350
200
Amount
100
150
350
200
Foreign key
City_ID
City
Country
Type
Copyright – All rights reserved
DataBase and Data Mining Group of Politecnico di Torino
Product_ID Product Supplier Type_ID
1
P1
F1
1
2
P2
F1
1
3
P3
F2
2
4
P4
F2
2
Type_ID Type Category
1
A
X
2
B
X
SALES
Month
DB
MG
Snowflake schema
DataBase and Data Mining Group of Politecnico di Torino
Salesman
R1
R1
R2
R2
City_ID
1
2
City
RM
MI
Country
I
I
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Elena Baralis
Politecnico di Torino
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Star or snowflake?
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 44
Copyright – All rights reserved
Multiple edges
DataBase and Data Mining Group of Politecnico di Torino
• The snowflake schema is usually not
recommended
DB
MG
DataBase and Data Mining Group of Politecnico di Torino
genre
SALE
author book
– storage space decrease is rarely beneficial
quantity
income
date month year
• most storage space is consumed by the fact table (difference
with dimensions is several orders of magnitude)
• Implementation techniques
– cost of join execution may be significant
– bridge table
• The snowflake schema may be useful
• new table which models many to many relationship
• new attribute weighting the contribution of tuples in the
relationship
– when part of a hierarchy is shared among dimensions
(e.g., geographic hierarchy)
– for materialized views, which require an aggregate
representation of the corresponding dimensions
• multiple edge integrated in the fact table
• new corresponding dimension in the fact table
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 45
Copyright – All rights reserved
– push down
Copyright – All rights reserved
Database and data mining group, Politecnico di Torino
Multiple edges
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 46
Database and data mining group, Politecnico di Torino
DB
MG
Multiple edges
DataBase and Data Mining Group of Politecnico di Torino
DB
MG
DataBase and Data Mining Group of Politecnico di Torino
• Queries
genre
– Weighted query: consider the weight of the
multiple edge
SALE
author book
Books
Sales
Book_ID
Date_ID
Quantity
Income
Book_ID
Book
Genre
quantity
income
date month year
BRIDGE
Book_ID
Author_ID
Weight
Authors
Author_ID
Author
• example: author income
• by using bridge table:
SUM(Income*weight)
group by ID_author
Books
Sales
Book_ID
Author_ID
Date_ID
Quantity
Income
Book_ID
Book
Genre
– Impact query: do not consider the weight of the
multiple edge
Authors
Author_ID
Author
• example: book copies sold for each author
• by using bridge table:
SUM(Quantity)
group by ID_author
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Copyright – All rights reserved
DATA WAREHOUSE: DESIGN - 47
Elena Baralis
Politecnico di Torino
Elena Baralis
Politecnico di Torino
Copyright – All rights reserved
Pag. 2
DATA WAREHOUSE: DESIGN - 48
Elena Baralis
Politecnico di Torino
DB
MG
Data warehouse design
DataBase and Data Mining Group of Politecnico di Torino
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Multiple edges
• Comparison
Category
• (push down) hard to perform impact queries
• (push down) weight is computed when feeding the DW
• (push down) weight modifications are hard
Supplier
Type
Product
Shipping
Mode
– push down causes significant redundancy in the
fact table
– query execution time for push down
Return
code
ORDER LINE
Quantity
Amount
Order
Customer City
Line
Order
Status
• less joins
• higher cardinality for the fact table
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 49
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 50
Copyright – All rights reserved
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Degenerate dimensions
DataBase and Data Mining Group of Politecnico di Torino
• Dimensions with a single attribute
– weight is explicited in the bridge table, but wired
in the fact table for push down
Copyright – All rights reserved
DB
MG
Degenerate dimensions
DataBase and Data Mining Group of Politecnico di Torino
DB
MG
Junk dimension
DataBase and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
• Implementations
Order
– (usually) directly integrated into the fact table
• only for attributes with a (very) small size
Order Line
Order_ID
Product_ID
SRL_ID
Quantity
Amount
– junk dimension
• single dimension containing several degenerate
dimensions
• no functional dependencies among attributes in the
junk dimension
Order_ID
Order
Customer
City_ID
SRL
SRL_ID
ShippingMode
ReturnCode
LineOrderStatus
– all attribute value combinations are allowed
– feasible only for attribute domains with small cardinality
Copyright – All rights reserved
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 51
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Materialized views
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 52
Copyright – All rights reserved
DB
MG
Materialized views
DataBase and Data Mining Group of Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
• Materialized views may be exploited for answering several
different queries
• Precomputed summaries for the fact table
– explicitly stored in the data warehouse
– provide a performance increase for aggregate queries
– not for all aggregation operators
{a,c}
d
b
c
a
{b,c}
{a,d}
v1 = {product, date, shop}
{b,d}
{c}
{a}
v2 = {type, date, city}
v4 = {type, month, region}
v3 = {category, month, city}
{d}
v5 = {quarter, region}
{b}
{}
Multidimensional lattice
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Copyright – All rights reserved
DATA WAREHOUSE: DESIGN - 53
Elena Baralis
Politecnico di Torino
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Elena Baralis
Politecnico di Torino
Copyright – All rights reserved
Pag. 3
DATA WAREHOUSE: DESIGN - 54
Elena Baralis
Politecnico di Torino
DB
MG
Data warehouse design
DataBase and Data Mining Group of Politecnico di Torino
Database and data mining group, Politecnico di Torino
Database and data mining group, Politecnico di Torino
DB
MG
Materialized view selection
DB
MG
Materialized view selection
DataBase and Data Mining Group of Politecnico di Torino
• Huge number of allowed aggregations
DataBase and Data Mining Group of Politecnico di Torino
{a,c}
– most attribute combinations are eligible
• Selection of the “best” materialized view set
• Cost function minimization
{b,c}
– query execution cost
– view maintainance (update) cost
{a}
{c}
q3
{b}
{d}
• Constraints
–
–
–
–
{a,d}
{b,d}
q1
available space
time window for update
response time
data freshness
Multidimensional lattice
+
{}
= candidate views,
possibly useful to
increase workload
query performance
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 55
Copyright – All rights reserved
q2
DATA WAREHOUSE: DESIGN - 56
Copyright – All rights reserved
Database and data mining group, Politecnico di Torino
{a,c}
Database and data mining group, Politecnico di Torino
DB
MG
Materialized view selection
Space and time
minimization
{b,c}
Disk Update
space window
{a}
{c}
q1
{a,d}
Query
cost
{b,d}
q3
q1
{}
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
DATA WAREHOUSE: DESIGN - 57
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Elena Baralis
Politecnico di Torino
Copyright – All rights reserved
Database and data mining group, Politecnico di Torino
DB
MG
Materialized view selection
{a,c}
DataBase and Data Mining Group of Politecnico di Torino
All
constraints
{b,c}
{a,d}
{b,d}
Disk Update Query
space window cost
{a}
{c}
q3
{b}
q1
q2
{}
From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006
Copyright – All rights reserved
q2
{}
Copyright – All rights reserved
{d}
{b}
{d}
q2
Disk Update Query
space window cost
{a}
{c}
q3
{b}
{d}
DataBase and Data Mining Group of Politecnico di Torino
{a,c}
{a,d}
{b,d}
DB
MG
Materialized view selection
DataBase and Data Mining Group of Politecnico di Torino
Cost
minimization
{b,c}
Elena Baralis
Politecnico di Torino
DATA WAREHOUSE: DESIGN - 59
Elena Baralis
Politecnico di Torino
Elena Baralis
Politecnico di Torino
Pag. 4
DATA WAREHOUSE: DESIGN - 58
Elena Baralis
Politecnico di Torino