Data mining - Maurizio Pighin home page

Transcript

Data mining - Maurizio Pighin home page
Data Warehousing
and elements of Data Mining
prof. Maurizio Pighin
e-mail: [email protected]
Dipartimento di Matematica e Informatica
Università di Udine - Italy
Motivation: “Necessity is the
Mother of Invention”
DW and
elements of DM
Maurizio Pighin
• Data explosion problem
– Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases and other information repositories
• Difficult to analyze data
– Complex query, long time of analysis
• We are drowning in data, but starving for knowledge!
• Solution: Data warehousing and Data mining
– Data warehousing and on-line analytical processing
– Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Slide 2
Copyright © 2008 by Maurizio Pighin
Pagina 1
Evolution of Database Technology
DW and
elements of DM
Maurizio Pighin
• 1960s: Data collection, database creation, IMS and
network DBMS
• 1970s: Relational data model, relational DBMS
implementation
• 1980s: RDBMS, advanced data models (extendedrelational, OO, deductive, etc.) and applicationoriented DBMS (spatial, scientific, engineering, etc.)
• 1990s—2000s: Data mining and data warehousing,
multimedia databases, and Web databases
Slide 3
Evolution of data analysis
DW and
elements of DM
Maurizio Pighin
• 1960s: batch reports
– Difficult to find and analyze data
– Expensive, every request needs a new report (today a
lot of systems offers only this kind of analysis)
• 1970s: First procedures to help decision process
– Usually very poor and do not integrated with office
automation tools
• 1980s: Office automation tools
– Query tools, spreadsheets, GUIs
– Access to operational data (usually very complex)
• 1990s: Data warehousing and data mining
Slide 4
Copyright © 2008 by Maurizio Pighin
Pagina 2
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 5
What is Data Warehouse?
DW and
elements of DM
Maurizio Pighin
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained
separately from the organization’s operational
database
– Support information processing by providing a solid
platform of consolidated, historical data for analysis.
Slide 6
Copyright © 2008 by Maurizio Pighin
Pagina 3
What is Data Warehouse?
DW and
elements of DM
Maurizio Pighin
• “A data warehouse is a subject-oriented, integrated,
time-variant, and non volatile collection of data in
support of management’s decision-making process.”
- W. H. Inmon (1985)
• “A single, complete and consistent data warehouse,
obtained by different sources, available to final users
to be immediately utilized” – IBM System Journal
(1990)
• Data warehousing:
– The process of constructing and using data
warehouses
Slide 7
Data Warehouse - Subject-Oriented
DW and
elements of DM
Maurizio Pighin
• Organized around major subjects, such as customer,
product, sales.
• Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
• Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
Slide 8
Copyright © 2008 by Maurizio Pighin
Pagina 4
DW and
elements of DM
Maurizio Pighin
Data Warehouse - Integrated
• Constructed by integrating multiple, heterogeneous
data sources
– relational databases, flat files, on-line transaction
records
• Data cleaning and data integration techniques are
applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
Slide 9
DW and
elements of DM
Maurizio Pighin
Data Warehouse - Time Variant
• The time horizon for the data warehouse is
significantly longer than that of operational systems.
– Operational database: current value data.
– Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”.
Slide 10
Copyright © 2008 by Maurizio Pighin
Pagina 5
Data Warehouse - Non-Volatile
DW and
elements of DM
Maurizio Pighin
• A physically separate store of data transformed from
the operational environment.
• Operational update of data does not occur in the data
warehouse environment.
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data
• access of data.
Slide 11
Data Warehouse
DW and
elements of DM
Maurizio Pighin
• Data analysis system characteristics:
FASMI – OLAP Report 1995
–
–
–
–
–
Fast
Analytical
Shared
Multidimensional
Informational
Slide 12
Copyright © 2008 by Maurizio Pighin
Pagina 6
Why do we need all that?
DW and
elements of DM
Maurizio Pighin
• Operational databases are for On Line Transaction
Processing (OLTP)
– automate day-to-day operations (purchasing, banking
etc)
– transactions access (and modify!) a few records at a
time
– database design is application (process) oriented
– metric: transactions/sec
Slide 13
Why do we need all that?
DW and
elements of DM
Maurizio Pighin
• Data Warehouse is for On Line Analytical Processing
(OLAP)
complex queries that access millions of records
need historical data for trend analysis
long scans would interfere with normal operations
synchronizing data-intensive queries among physically
separated databases would be a nightmare!
– metric: query response time
–
–
–
–
Slide 14
Copyright © 2008 by Maurizio Pighin
Pagina 7
Examples of OLAP
DW and
elements of DM
Maurizio Pighin
• Comparisons (this period v.s. last period)
– Show me the sales per region for this year and
compare it to that of the previous year to identify
discrepancies
• Multidimensional ratios (percent to total)
– Show me the contribution to weekly profit made by all
items sold in the northeast stores between may 1 and
may 7
Slide 15
Examples of OLAP
DW and
elements of DM
Maurizio Pighin
• Ranking and statistical profiles
(top N/bottom N)
– Show me sales, profit and average call volume per day
for my 10 most profitable salespeople
• Custom consolidation
(market segments, ad hoc groups)
– Show me an abbreviated income statement by quarter
for the last four quarters for my northeast region
operations
Slide 16
Copyright © 2008 by Maurizio Pighin
Pagina 8
Data Warehouse vs.
Heterogeneous DBMS
DW and
elements of DM
Maurizio Pighin
• Traditional heterogeneous DB integration:
– Build wrappers/mediators on top of heterogeneous
databases
– Query driven approach
• When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
• Complex information filtering, compete for resources
• Data warehouse: update-driven, high performance
– Information from heterogeneous sources is integrated
in advance and stored in warehouses for direct query
and analysis
Slide 17
Data Warehouse vs.
Operational DBMS
DW and
elements of DM
Maurizio Pighin
• OLTP (on-line transaction processing)
– Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
–
–
–
–
–
System orientation: process vs. business subject
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. Multidimensional + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
Slide 18
Copyright © 2008 by Maurizio Pighin
Pagina 9
DW and
elements of DM
Maurizio Pighin
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Slide 19
Why Separate Data Warehouse?
DW and
elements of DM
Maurizio Pighin
• High performance for both systems
– DBMS - tuned for OLTP: access methods, indexing, concurrency
control, recovery
– Warehouse - tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
• Different functions and different data:
– missing data: Decision Support requires historical data which
operational DBs do not typically maintain
– data consolidation: Decision Support requires consolidation
(aggregation, summarization) of data from heterogeneous sources
– data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Slide 20
Copyright © 2008 by Maurizio Pighin
Pagina 10
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 21
Multidimensional model
DW and
elements of DM
Maurizio Pighin
• A data warehouse is based on a multidimensional
data model which views data in the form of a data
cube (hypercube)
• An hypercube is a multidimensional array which
represents particular event
• We define “fact” a point of this multidimensional array
obtained crossing exiting co-ordinates
– Dimension: fact co-ordinate
– Measure: numerical value characterizing the event
Slide 22
Copyright © 2008 by Maurizio Pighin
Pagina 11
Multidimensional model - example
DW and
elements of DM
Maurizio Pighin
• A data cube, such as sales, allows numerical data
(measures) to be modeled and viewed in multiple
dimensions
– Measures such as transaction value (dollars_sold),
quantity (item_quantity)
– Dimension, such as item (item_name, brand, type), or
time (day, week, month, quarter, year), or customer
(customer_name, city, region, state)
Slide 23
Measures
DW and
elements of DM
Maurizio Pighin
• Every fact can contain more than one measure
• A measure may be
– Saved on the Data Warehouse (effective)
– Run-time evaluated from effective measures
– Implicit (presence or absence of a fact)
Slide 24
Copyright © 2008 by Maurizio Pighin
Pagina 12
Fact aggregation
DW and
elements of DM
Maurizio Pighin
• It is possible to aggregate elementary facts to obtain
synthetic facts
• The measures of the synthetic facts can be obtained
with aggregation operators
– Sum, mean, max, min,…
• For each couple measure-dimension it is possible to
define different aggregation-operators
Slide 25
Fact aggregation
DW and
elements of DM
Maurizio Pighin
• The measures can be
– Addictive: can be aggregate by sum on every
dimension (for instance total income)
– Semi-addictive: can be aggregate by sum on some
dimension but not on other (for instance quantity can
be summed on “item” but not on “store” (where are
present different items))
– Not-addictive: they never can be summed, you must
use other operators (mean, median, max, min) (for
instance unitary price)
Slide 26
Copyright © 2008 by Maurizio Pighin
Pagina 13
DW and
elements of DM
Maurizio Pighin
Dimension hierarchy
• Hierarchy
– Set of dimensional attributes hierarchically linked to
one dimension
– Dimensional attributes
• Are used to aggregate elementary facts
• Are univocally determined by a dimension
• Represent a “classification” of the dimension
Slide 27
DW and
elements of DM
Maurizio Pighin
Example of dimension hierarchy
all
all
Europe
region
country
city
Germany
Frankfurt
office
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Wind
Slide 28
Copyright © 2008 by Maurizio Pighin
Pagina 14
View of Warehouses and
Hierarchies
DW and
elements of DM
Maurizio Pighin
Slide 29
DW and
elements of DM
Maurizio Pighin
Multidimensional Data
• Sales volume as a function of Product, Location, and
Time
Lo
ca
tio
n
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Item
City
Office
Month Week
Day
Time
Slide 30
Copyright © 2008 by Maurizio Pighin
Pagina 15
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 31
OLAP Server Architectures
DW and
elements of DM
Maurizio Pighin
• Relational OLAP (ROLAP)
– Use relational or extended-relational DBMS to store
and manage warehouse data and OLAP middle ware
to support missing pieces
– Include optimization of DBMS backend,
implementation of aggregation navigation logic, and
additional tools and services
– Greater scalability
Slide 32
Copyright © 2008 by Maurizio Pighin
Pagina 16
OLAP Server Architectures
DW and
elements of DM
Maurizio Pighin
• Multidimensional OLAP (MOLAP)
– Array-based multidimensional storage engine (sparse
matrix techniques)
– fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)
– User flexibility, e.g., low level: relational, high-level:
array
Slide 33
Conceptual Modeling of Data
Warehouses
DW and
elements of DM
Maurizio Pighin
• Modeling data warehouses: dimensions & measures
on ROLAP Systems
– Star schema: A fact table in the middle connected to a
set of dimension tables
– Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
– Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Slide 34
Copyright © 2008 by Maurizio Pighin
Pagina 17
Components of Star Schema
DW and
elements of DM
Maurizio Pighin
Fact tables contain factual
or quantitative data
Dimension tables are denormalized to
maximize performance
1:N relationship between
dimension tables and fact tables
Dimension tables contain descriptions
about the subjects of the business
Excellent for ad-hoc queries, but bad for online transaction processing
Slide 35
Star Schema example
DW and
elements of DM
Maurizio Pighin
Fact table provides statistics for sales
broken down by product, period and
store dimensions
Slide 36
Copyright © 2008 by Maurizio Pighin
Pagina 18
DW and
elements of DM
Maurizio Pighin
Star Schema with sample data
Slide 37
DW and
elements of DM
Maurizio Pighin
Another example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
branch_key
location
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
location_key
street
city
province_or_street
country
Measures
Slide 38
Copyright © 2008 by Maurizio Pighin
Pagina 19
DW and
elements of DM
Maurizio Pighin
Example of Snowflake Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
item_key
item_name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
supplier
supplier_key
supplier_type
branch_key
location
branch
location_key
branch_key
branch_name
branch_type
location_key
street
city_key
units_sold
dollars_sold
avg_sales
city
city_key
city
province_or_street
country
Measures
Slide 39
DW and
elements of DM
Maurizio Pighin
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
Copyright © 2008 by Maurizio Pighin
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
Slide 40
shipper_type
Pagina 20
Main Data Warehouse
Architectures
DW and
elements of DM
Maurizio Pighin
• Architectures
– Generic Two-Level Architecture
– Independent Data Mart
– Dependent Data Mart and Operational Data Store Three-Level Architecture
• All involve some form of extraction, transformation
and loading (ETL)
Slide 41
Generic Two Level
Data Warehousing Architecture
L
T
DW and
elements of DM
Maurizio Pighin
One,
companywide
warehouse
E
Periodic extraction Î data is not completely current in warehouse
Slide 42
Copyright © 2008 by Maurizio Pighin
Pagina 21
DW and
Indipendent data mart
elements of DM
Data marts:
Maurizio Pighin
Data Warehousing Architecture Mini-warehouses, limited in scope
L
T
E
Data access complexity
due to multiple data marts
Separate ETL for each
independent data mart
Slide 43
Dependent data mart with operational
datastore at three level architecture
DW and
elements of DM
Maurizio Pighin
L
T
E
Single ETL for
enterprise data warehouse
(EDW)
Simpler data access
Dependent data marts
loaded from EDW
Slide 44
Copyright © 2008 by Maurizio Pighin
Pagina 22
DW and
elements of DM
Maurizio Pighin
General Architecture
Metadata
other
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Server
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
General Architecture
OLAP Engine
Front-End
Slide 45
DW and
elements of DM
Maurizio Pighin
• Enterprise warehouse
– collects all of the information about subjects spanning
the entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data
mart
Slide 46
Copyright © 2008 by Maurizio Pighin
Pagina 23
ETL function
DW and
elements of DM
Maurizio Pighin
• Data extraction:
– get data from multiple, heterogeneous, and external sources
• Data cleaning:
– detect errors in the data and rectify them when possible
• Data transformation:
– convert data from legacy or host format to warehouse format
• Load:
– sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
• Refresh:
– propagate the updates from the data sources to the warehouse
Slide 47
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 48
Copyright © 2008 by Maurizio Pighin
Pagina 24
Design of a Data Warehouse:
A Business Analysis Framework
DW and
elements of DM
Maurizio Pighin
• Four views regarding the design of a data warehouse
– Top-down view
• allows selection of the relevant information necessary for the
data warehouse
– Data source view
• exposes the information being captured, stored, and managed
by operational systems
– Data warehouse view
• consists of fact tables and dimension tables
– Business query view
• sees the perspectives of data in the warehouse from the view
of end-user
Slide 49
Data Warehouse Design Process
DW and
elements of DM
Maurizio Pighin
• Top-down, bottom-up approaches or a combination
of both
– Top-down: Starts with overall design and planning
(mature)
– Bottom-up: Starts with experiments and prototypes
(rapid)
• From software engineering point of view
– Waterfall: structured and systematic analysis at each
step before proceeding to the next (top-down)
– Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
(bottom-up)
Slide 50
Copyright © 2008 by Maurizio Pighin
Pagina 25
Data Warehouse Design Process
DW and
elements of DM
Maurizio Pighin
• Typical data warehouse design process with bottom up
process
–
–
–
–
–
–
–
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
Design the architecture of the DW
Design the ETL
Install and test
• Advantages
– Results in short time
– Not too expensive
– Give to the management a clear perspective of the OLAP world
Slide 51
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 52
Copyright © 2008 by Maurizio Pighin
Pagina 26
DW and
elements of DM
Maurizio Pighin
Exploration of Data Cubes
• OLAP
– Interactive navigation through data
• Two models
– Hypothesis-driven: exploration by user driven by
hypothesis formulated by the user
– Discovery-driven: pre-compute measures indicating
exceptions, guide user in the data analysis, at all levels
of aggregation. Then users utilize Hypothesis driven
exploration
Slide 53
DW and
elements of DM
Maurizio Pighin
TV
PC
VCR
sum
1Qtr
2Qtr
Date
3Qtr
4Qtr
sum
Total annual sales
of TV in U.S.A.
U.S.A
Canada
Mexico
Country
Pr
od
uc
t
A Sample Data Cube
sum
Slide 54
Copyright © 2008 by Maurizio Pighin
Pagina 27
DW and
elements of DM
Maurizio Pighin
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slide 55
DW and
elements of DM
Maurizio Pighin
Roll-up/Drill-down
All
Roll-up
All
All
Date
Drill-Down
Pr
Copyright © 2008 by Maurizio Pighin
l
Drill-Down
Al
l
Country
Date
ll
Drill-Down
Al
Country
Roll-up
Country
Roll-up
A
t
uc
d
o
Slide 56
Pagina 28
OLAP Operations
DW and
elements of DM
Maurizio Pighin
drill-down
Slide 57
OLAP Operations
DW and
elements of DM
Maurizio Pighin
drill-down
Slide 58
Copyright © 2008 by Maurizio Pighin
Pagina 29
DW and
elements of DM
Maurizio Pighin
OLAP Operations
drill-down
Slide 59
DW and
elements of DM
Maurizio Pighin
OLAP Operations
roll-up
Slide 60
Copyright © 2008 by Maurizio Pighin
Pagina 30
DW and
elements of DM
Maurizio Pighin
OLAP Operations
roll-up
Slide 61
DW and
elements of DM
Maurizio Pighin
OLAP Operations
roll-up
Slide 62
Copyright © 2008 by Maurizio Pighin
Pagina 31
DW and
elements of DM
Maurizio Pighin
OLAP Operations
• Slice and Dice: select and project on one or more
dimensions
pr
od
uc
t
country
date
customer = “Smith”
Slide 63
DW and
elements of DM
Maurizio Pighin
Slice
Pr
od
uc
t
Country
Pr
o
Date (4 quarters)
du
ct
Slice
Country
Date ( 2 quarters)
Slide 64
Copyright © 2008 by Maurizio Pighin
Pagina 32
DW and
elements of DM
Maurizio Pighin
OLAP Operations
slice-and-dice
Slide 65
DW and
elements of DM
Maurizio Pighin
OLAP Operations
slice-and-dice
Slide 66
Copyright © 2008 by Maurizio Pighin
Pagina 33
DW and
elements of DM
Maurizio Pighin
OLAP Operations
slice-and-dice
Slide 67
OLAP Operations
DW and
elements of DM
Maurizio Pighin
• Pivot (rotate):
– reorient the cube visualization, 3D to
series of 2D planes.
Slide 68
Copyright © 2008 by Maurizio Pighin
Pagina 34
DW and
elements of DM
Maurizio Pighin
OLAP Operations
Tim
e
Pivot
Tim
e
Store
Product
Product
Pivot
Pivot
Sto
re
Store
Product
Pivot
Time
Slide 69
OLAP Operations
DW and
elements of DM
Maurizio Pighin
pivoting
Slide 70
Copyright © 2008 by Maurizio Pighin
Pagina 35
DW and
elements of DM
Maurizio Pighin
OLAP Operations
pivoting
Slide 71
DW and
elements of DM
Maurizio Pighin
OLAP Operations
pivoting
Slide 72
Copyright © 2008 by Maurizio Pighin
Pagina 36
DW and
elements of DM
Maurizio Pighin
OLAP Operations
• Drill across: involving (across) more
than one fact table
Slide 73
OLAP Operations
DW and
elements of DM
Maurizio Pighin
drill-across
Slide 74
Copyright © 2008 by Maurizio Pighin
Pagina 37
DW and
elements of DM
Maurizio Pighin
OLAP Operations
drill-across
Slide 75
Exploration of Data Cubes
DW and
elements of DM
Maurizio Pighin
• Hypothesis-driven
– exploration by user, huge search space
• Discovery-driven
– Pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
– Exception: significantly different from the value
anticipated, based on a statistical model
– Visual cues such as background color are used to
reflect the degree of exception of each cell
– Computation of exception indicator can be overlapped
with cube construction
Slide 76
Copyright © 2008 by Maurizio Pighin
Pagina 38
Examples: Discovery-Driven Data
Cubes
DW and
elements of DM
Maurizio Pighin
Slide 77
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 78
Copyright © 2008 by Maurizio Pighin
Pagina 39
Data Warehouse Usage
DW and
elements of DM
Maurizio Pighin
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
• Differences among the three tasks
Slide 79
From On-Line Analytical Processing to
On Line Analytical Mining (OLAM)
DW and
elements of DM
Maurizio Pighin
• Why online analytical mining?
– High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
– Available information processing structure surrounding
data warehouses
• ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
– OLAP-based exploratory data analysis
• mining with drilling, dicing, pivoting, etc.
– On-line selection of data mining functions
Slide 80
Copyright © 2008 by Maurizio Pighin
Pagina 40
Data Warehousing and
Data Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
OLAP analysis
From data warehousing to data mining
Principles of data mining
Slide 81
What Is Data Mining?
DW and
elements of DM
Maurizio Pighin
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases
• Alternative names:
– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Slide 82
Copyright © 2008 by Maurizio Pighin
Pagina 41
What Is Data Mining?
DW and
elements of DM
Maurizio Pighin
• Other Definitions
– Non-trivial extraction of implicit, previously unknown
and potentially useful information from data
– Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in
order to discover meaningful patterns
Slide 83
Why Mine Data?
Commercial Viewpoint
DW and
elements of DM
Maurizio Pighin
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– Purchases at department stores
– Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
– Provide better, customized services (e.g. in Customer
Relationship Management)
Slide 84
Copyright © 2008 by Maurizio Pighin
Pagina 42
Mining Large Data Sets
Motivation
DW and
elements of DM
Maurizio Pighin
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information
• Much of the data is never analyzed at all
Slide 85
Why Data Mining?
Potential Applications
DW and
elements of DM
Maurizio Pighin
• Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, quality control, competitive
analysis
– Fraud detection and management
• Other Applications
– Text mining (news group, email, documents) and Web
analysis.
Slide 86
Copyright © 2008 by Maurizio Pighin
Pagina 43
Market Analysis and Management
DW and
elements of DM
Maurizio Pighin
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the
same characteristics: interest, income level, spending
habits, etc.
Slide 87
Market Analysis and Management
DW and
elements of DM
Maurizio Pighin
• Determine customer purchasing patterns over time
– Changing of customer habits with age
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
• Customer profiling
– Indentifying what types of customers buy what
products (clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– using prediction to find what factors will attract new
customers
Slide 88
Copyright © 2008 by Maurizio Pighin
Pagina 44
Corporate Analysis and Risk
Management
DW and
elements of DM
Maurizio Pighin
• Finance planning and asset evaluation
– cash flow analysis and prediction
– cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
• Resource planning
– summarize and compare the resources and spending
• Competition
– monitor competitors and market directions
– group customers into classes and a class-based
pricing procedure
– set pricing strategy in a highly competitive market
Slide 89
Fraud Detection and Management
DW and
elements of DM
Maurizio Pighin
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
– approach: use historical data to build models of
fraudulent behavior and use data mining to help
identify similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money
transactions
Slide 90
Copyright © 2008 by Maurizio Pighin
Pagina 45
DW and
elements of DM
Maurizio Pighin
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown or future
values of other variables.
• Description Methods
– Find human-interpretable patterns that describe the
data.
Slide 91
Principal Data Mining Tasks.
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Slide 92
Copyright © 2008 by Maurizio Pighin
Pagina 46
DW and
elements of DM
Maurizio Pighin
Classification: Definition
• Given a collection of records (training set)
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned
a class as accurately as possible.
• Metodology: a test set is used to determine the
accuracy of the model. Usually, the given a collection
of known data set is randomly divided into training
and test sets, with training set used to build the
model and test set used to validate it.
Slide 93
DW and
elements of DM
Maurizio Pighin
Classification Example
te
ca
ric
go
al
te
ca
ric
go
al
n
co
uo
ti n
us
s
as
cl
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
10
Training
Set
Learn
Classifier
Test
Set
Model
Slide 94
Copyright © 2008 by Maurizio Pighin
Pagina 47
Classification: Application
DW and
elements of DM
Maurizio Pighin
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
Slide 95
Clustering Definition
DW and
elements of DM
Maurizio Pighin
• Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that
– Data points in one cluster are more similar to one
another.
– Data points in separate clusters are less similar to one
another.
• Similarity Measures
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures
Slide 96
Copyright © 2008 by Maurizio Pighin
Pagina 48
DW and
elements of DM
Maurizio Pighin
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
Slide 97
DW and
elements of DM
Maurizio Pighin
Clustering: Application
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Slide 98
Copyright © 2008 by Maurizio Pighin
Pagina 49
Association Rule Discovery:
Definition
DW and
elements of DM
Maurizio Pighin
• Given a set of records each of which contain some
number of items from a given collection;
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
Slide 99
Association Rule Discovery:
Application 1
DW and
elements of DM
Maurizio Pighin
• Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
– Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Slide 100
Copyright © 2008 by Maurizio Pighin
Pagina 50
Association Rule Discovery:
Application 2
DW and
elements of DM
Maurizio Pighin
• Supermarket shelf management.
– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
– A classic rule -• If a customer buys diaper and milk, then he is very likely to buy
beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!
Slide 101
Regression
DW and
elements of DM
Maurizio Pighin
• To identify unknown values in a continuous domain
• Build tendency functions with interpolation of known
points (regression)
• Different models
– Linear regression (two variables)
• Y=q+mX
– Multi-linear regression (more variables)
• Y = q + m1 X1 + m2 X2+ m3 X3
– Non-linear regression (polynomial, exponential,
logarithmic ...)
• Y = q + m1X+ m2X2+ m3X3
Slide 102
Copyright © 2008 by Maurizio Pighin
Pagina 51
Regression
DW and
elements of DM
Maurizio Pighin
• Example
Slide 103
Deviation Detection
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
The search of “Outlier”
Outlier: exception, element out of range
The search is based on the same principles of clustering
Concentrates the efforts in finding elements “far” from the other
Search method
– Statistical
• Can be used if a statistical distribution is evaluable
– Distance based
• Search for elements with maximize the distance from the other
elements of the set
– Deviation based
• Search for elements with maximize the deviance from the other
elements of the set.
• Example: fraud detection
Slide 104
Copyright © 2008 by Maurizio Pighin
Pagina 52
Challenges of Data Warehousing
and Mining
•
•
•
•
•
•
•
DW and
elements of DM
Maurizio Pighin
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Ownership and Distribution
Privacy Preservation
Streaming Data
Data Quality
Slide 105
Data Quality
DW and
elements of DM
Maurizio Pighin
• A process quality measures its adherence to users
targets
• In the following tables you can find some aspects of
“quality”(Wang-Wand (1999): quality dimensions)
Slide 106
Copyright © 2008 by Maurizio Pighin
Pagina 53
DW and
elements of DM
Maurizio Pighin
Data Quality
Slide 107
Main Competitors in DW Systems
Vendor
DW and
elements of DM
Maurizio Pighin
Global Revenue 2006
(Millions USD)
Microsoft Corporation
1,801
Hyperion Solutions Corporation
1,077
Cognos
735
Business Objects
416
MicroStrategy
416
SAP AG
330
Cartesis SA
210
Applix
205
Infor
199
Oracle Corporation
159
Others
152
Total
5,700
Slide 108
Copyright © 2008 by Maurizio Pighin
Pagina 54
Bibliography – Data warehousing
DW and
elements of DM
Maurizio Pighin
• Berson A. and Smith S.J., “Data warehousing, data mining and
OLAP”, McGraw-Hill, 1997
• Berthold M., Hand D.J., “Intelligent data analysis: an
introduction”, Springer-Verlag, 1999
• Inmon W.H., “Building the data warehouse”, John Wiley &
Sons, 1996
• Inmon W.H., Zachman J.A., Geiger G., “Data stores, data
warehousing and Zachman framework; managing enterprise
knowledge”, McGraw-Hill, 1997
• Kimball R., Ross M., “The Data Warehouse Toolkit. Practical
techniques for building dimensional Data Warehouses”, 2nd
ed. John Wiley, 2002
• Thomsen E., “OLAP solutions: building multidimensional
information systems”, John Wiley & Sons, 1997
Slide 109
Bibliography – Data mining
DW and
elements of DM
Maurizio Pighin
• Bramer M., “Principles of Data Mining”, Springer, 2007
• Han J., Kamber M., “Data Mining – Concepts and techniques”,
Academic Press, 2001
• Parr Rud O., “Data mining cookbook – Modeling data for
marketing, risk and CRM”, John Wiley & Sons, 2000
• Pyle D., “Data preparation for data mining”, Morgan Kaufmann,
1999
• Weiss S.M., Indurkhya N., “Predictive Data Mining”, Morgan
Kaufmann, 1998
• Witten I.H., Frank E., “Data mining, Practical Machine Learning
Tools and Techniques”, 2nd Edition, Elsivier, 2005
Slide 110
Copyright © 2008 by Maurizio Pighin
Pagina 55