Data warehousing
The hype about data warehousing
Data warehouse trade materials talk about using a data
warehouse to:
Convert data into
business intelligence
Make management
decision making based on facts, not intuition
Get closer to the
customers
Gain a
competitive advantage
According to one source, In
probably 99% of the data warehousing implementations, data warehousing is only one
step out of many in the long road toward the ultimate goal of accomplishing
these objectives.
Basic reasons organizations implement data
warehouses
To perform server/disk bound tasks associated with querying
and reporting on servers not used by transaction processing systems
To use data models that speed up querying and
reporting and that are not appropriate for transaction processing
To provide an environment where a relatively small
amount of knowledge of db is required to write and maintain queries and
reports
To provide a repository of cleaned up data
from traditional system
To make it easier to query and report data from multiple
transaction processing systems
To provide a repository of data that contains data
from a longer span of time than can be efficiently held in a transaction
processing system.
To prevent persons who only need to query and report
data for business decisions from having any access whatsoever to the
transaction processing system database.
Most of these arise from the fact that the goal of
transaction processing is different than the goal of data warehousing
Operational data and business analysis
The primary concept of data warehousing is that
the data stored for business analysis can most
effectively be accesses by separating it from
the data in the operational systems for
transaction processing
The most important reason for separating them has
always been the potential performance degradation on the operational
system that can result from the analysis process
High performance
and quick response time is almost universally critical for operational systems
Warehousing vs. traditional db
Traditional databases that store the operational data
They are
transaction-oriented
Often
support OLTP (on-line transaction processing)
They
are optimized to process queries from a small part of the database
They balance the
requirements of data access with the need to ensure integrity of data
Warehousing is used to store data for business
analysis
It provide
greater storage, more functionality and faster response time than traditional
databases
Often only
read-access is required
The data often
comes from multiple databases, stored on different platforms
Many of the
analyses needed are recurrent and predictable
Reports to management are often of prime importance
Data warehousing
There is no one definition of data warehousing
It has been
developed by multiple organizations to meet particular needs
One good
definition is: A data warehouse is a
copy of transaction data specifically structured for querying and reporting.
Data warehouses are quite distinct from traditional
database in their structure, functioning , performance
and purpose
They provide data for complex analysis, knowledge
discovery and decision making
Data warehouses
They support
high-performance demands on an organizations data and information
They generally contain very large amounts of data from
multiple sources
This may include databases from different data models
and sometimes files acquired from independent systems and platforms.
Several types of
applications are used with data warehouses
OLAP
(on-line analytical processing)
The
analysis of complex data from the data warehouse
They
use distributed computing capabilities for analyses that require vast amounts
of storage and processing power
DSS (decision
support systems)
Support
an organizations leading decision makers with higher-level data for complex
and important decisions
Data mining sometimes
call knowledge discovery
Characteristics of a data warehouse
It is frequently
a store of integrated data from multiple sources
It is processed for storage in a multidimensional model
It usually
supports time-series and trend analysis
This requires
more historical data than
a traditional database
It is nonvolatile (mostly)
The data in the
warehouse changes rarely
It is much more course grained than traditional
db
Often the data must be cleaned and reformatted
before it is warehoused
Data modeling for data warehouses
Data cubes are multidimensional models that take advantage of
relationships in data
They may be
called hypercubes if they have more than three
dimensions.
An example: a
spreadsheet is a 2D matrix;
suppose it holds data of regional sales by product for a
particular time period
If we add a third
dimension, time, it would produce a 3D matrix
The
time increments could be fiscal quarters
It
could also be summary data, for a week, a month, a quarter, a yerar, ect.
The 3D matrix is
represented using a data cube.
Using a data cube
The data can be queried directly in any combination of
dimension.
Tools exist for viewing data according to the users
choice of dimensions
Pivoting (or
rotation) is the technique used to changed from one dimensional orientation to
another
The data can be thought of as rotating to show a
different orientation of the axes
Multidimensional models
Data cubes (or higher dimensions) are especially
useful for hierarchical views of data
Roll-up
display moves up the hierarchy, grouping into larger units along a dimension
The hierarchy may be time, so first data may be displayed weekly,
then monthly, the quarterly, then yearly
The hierarchy may be project categories, moving from
individual products to courser grain of product categories
Drill down moves down the hierarchy, displaying from course grained to
progressively finer grained.
Building a data warehouse
Acquiring data for the warehouse includes these steps:
It must be extracted from multiple,
heterogeneous sources
All the data an organization keeps may not be suitable
for warehousing
It must be formatted for consistency
It must be cleaned to ensure validity
It must be fitted into the data model of the
warehouse
It must be loaded into the warehouse
Questions about updating the data must be
resolved
Cleaning the data
There is nothing about cleaning the data that is
specific to data warehouses
It could be applied to a host database.
Since the data must be examined and formatted for the
warehouse anyway, usually it is also checked for validity and quality.
This is the most
labor intensive part
of warehouse construction
It is difficult to automate recognizing erroneous and
incomplete data, but some can be done
After the data is cleaned, the manager
of the host db often want to update their own db with the cleaned
data.
This is called backflushing.
Loading and updating the data
The tremendous amount of data involved makes loading a
significant task
Tools to monitor the load and methods to recover
from incomplete loads are necessary
Incremental updates is usually the only feasible way
to keep the data up-to-date
Many questions must be answered about how to do
this.
It usually involves comprises between keeping the data
up-to-date, and taking time to do the updates
Updating the data in the warehouse
Management and the warehouse administrator must answer
these questions:
Can the warehouse go off-line, and for how long?
What are the data interdependencies?
What is the storage availability?
What are the distribution requirements?
Such as for replication and partitioning
What is the loading time?
This
includes cleaning, formatting, copying transmitting and overhead such as index
building
Storing the data in the warehouse
A data warehouse is typically optimized for access
from a decision makers need.
The data must be stored according to the data model of
the warehouse.
This means the
data structures must be created and maintained
Access paths must
also be created and maintained
Access
paths usually include indexes
Provisions must be made for updating the data
The sheer volume
of data in the warehouse usually makes it impossible to simply reload the
warehouse in its entirety later on
Summary views of the data
Many of the queries and reports needed from a
warehouse depend on summaries of the data.
It may not be practical to summarize the needed data
every time it is needed.
A summary view in a data warehouse refers to an actual
table that is physically stored
Difficulties with data warehousing
The building of a
warehouse in a large organization is a major undertaking
It may take years, from the time it is decided upon
until it is implemented
The time it takes to capture the data, clean it up,
and put it in a format that is useful is too much of a cost to bear
The
administration of a data warehouse requires far broader shills than are needed
for traditional database administration..
Data warehousing
can have a learning curve that may be too long for impatient firms.
Data warehousing
can become an exercise in data for the sake of the data.
There is data to
suggest that the data warehousing failure rates are between 10% and 90%
There is no denying that data warehousing is risky
Summary of warehousing functionality
It consolidates the data.
It cleans the data.
It allows efficient aggregation
It stores summary tables
It lets us view the information along multiple
dimensions
Research topics in warehousing
Automating cleaning the data for warehousing
Indexing/access paths for multi dimensional models
Automating the reformatting of data
Application of active database functionality (e.g.
triggers) into the warehouse
Incorporation of domain and business rules into the
warehouse creation and maintenance processes.