Data warehousing

 

The hype about data warehousing

•      Data warehouse trade materials talk about using a data warehouse to:

–    Convert data into business intelligence

–    Make management decision making based on facts, not intuition

–    Get closer to the customers

–    Gain a competitive advantage

•      According to one source, “ In probably 99% of the data warehousing implementations, data warehousing is only one step out of many in the long road toward the ultimate goal of accomplishing these objectives.”

 

Basic reasons organizations implement data warehouses

–   To perform server/disk bound tasks associated with querying and reporting on servers not used by transaction processing systems

–   To use data models that speed up querying and reporting and that are not appropriate for transaction processing

–   To provide an environment where a relatively small amount of knowledge of db is required to write and maintain queries and reports

–   To provide a repository of “cleaned up” data from traditional system

–   To make it easier to query and report data from multiple transaction processing systems

–   To provide a repository of data that contains data from a longer span of time than can be efficiently held in a transaction processing system.

–   To prevent persons who only need to query and report data for business decisions from having any access whatsoever to the transaction processing system database.

•      Most of these arise from the fact that the goal of transaction processing is different than the goal of data warehousing

 

Operational data and business analysis

•      The primary concept of data warehousing is that

–    the data stored for business analysis can most effectively be accesses by separating it from

–    the data in the operational systems for transaction processing

•      The most important reason for separating them has always been the potential performance degradation on the operational system that can result from the analysis process

–    High performance and quick response time is almost universally critical for operational systems

 

Warehousing vs. traditional db

•      Traditional databases that store the operational data

–    They are transaction-oriented

•    Often support OLTP (on-line transaction processing)

•    They are optimized to process queries from a small part of the database

–    They balance the requirements of data access with the need to ensure integrity of data

•      Warehousing is used to store data for business analysis

–    It provide greater storage, more functionality and faster response time than traditional databases

–    Often only read-access is required

–    The data often comes from multiple databases, stored on different platforms

–    Many of the analyses needed are recurrent and predictable

–    Reports to management are often of prime importance

 

Data warehousing

•      There is no one definition of data warehousing

–    It has been developed by multiple organizations to meet particular needs

•      One good definition is:  A data warehouse is a copy of transaction data specifically structured for querying and reporting.

•      Data warehouses are quite distinct from traditional database in their structure, functioning , performance and purpose

–    They provide data for complex analysis, knowledge discovery and decision making

 

Data warehouses

•      They support high-performance demands on an organization’s data and information

–    They generally contain very large amounts of data from multiple sources

–    This may include  databases from different data models and sometimes files acquired from independent systems and platforms.

•      Several types of applications are used with data warehouses

–    OLAP (on-line analytical processing)

•    The analysis of complex data from the data warehouse

•    They use distributed computing capabilities for analyses that require vast amounts of storage and processing power

–    DSS (decision support systems)

•    Support an organization’s leading decision makers with higher-level data for complex and important decisions

–    Data mining sometimes call knowledge discovery

 

Characteristics of  a data warehouse

•      It is frequently a store of integrated data from multiple sources

•      It is processed for storage in a multidimensional model

•      It usually supports time-series and trend analysis

–    This requires more historical data than  a traditional database

•      It is nonvolatile (mostly)

–    The data in the warehouse changes rarely

•      It is much more course grained than traditional db

•      Often the data must be “cleaned” and reformatted before it is warehoused

•             

Data modeling for data warehouses

•      Data  cubes are multidimensional models that take advantage of relationships in data

–    They may be called hypercubes if they have more than three dimensions.

•      An example:  a spreadsheet is a 2D matrix;

–    suppose it holds data of regional sales by product for a particular time period

–    If we add a third dimension, time, it would produce a 3D matrix

•    The time increments could be fiscal quarters

•    It could also be summary data, for a week, a month, a quarter, a yerar, ect.

–    The 3D matrix is represented using a data cube.

 

Using a data cube

–   The data can be queried directly in any combination of dimension.

–   Tools exist for viewing data according to the user’s choice of dimensions

•      Pivoting (or rotation) is the technique used to changed from one dimensional orientation to another

–   The data can be thought of as rotating to show a different orientation of the axes

 

Multidimensional models

•      Data cubes (or higher dimensions) are especially useful for hierarchical views of data

–   Roll-up display moves up the hierarchy, grouping into larger units along a dimension

•   The hierarchy may be time, so first data may be  displayed weekly, then monthly, the quarterly, then yearly

•   The hierarchy may be project categories, moving from individual products to courser grain of product categories

–   Drill down  moves down the hierarchy, displaying from course grained to progressively finer grained.

 

Building a data warehouse

•      Acquiring data for the warehouse includes these steps:

–   It must be extracted from multiple, heterogeneous sources

•   All the data an organization keeps may not be suitable for warehousing

–   It must be formatted for consistency

–   It must be cleaned to ensure validity

–   It must be fitted into the data model of the warehouse

–   It must be loaded into the warehouse

–   Questions about updating the data must be resolved

 

Cleaning the data

•      There is nothing about cleaning the data that is specific to data warehouses

•      It could be applied to a host database.

•      Since the data must be examined and formatted for the warehouse anyway, usually it is also checked for validity and quality.

–    This is the most labor intensive  part of warehouse construction

•      It is difficult to automate recognizing erroneous and incomplete data, but some can be done

•      After the data is cleaned, the manager of the host db often want to update their own db with the cleaned data. 

–    This is called backflushing.

 

Loading and updating the data

•      The tremendous amount of data involved makes loading a significant task

–   Tools to monitor the load and methods to recover from incomplete loads are necessary

•      Incremental updates is usually the only feasible way to keep the data up-to-date

–   Many questions must be answered about how to do this. 

–   It usually involves comprises between keeping the data up-to-date, and taking time to do the updates

 

Updating the data in the warehouse

•      Management and the warehouse administrator must answer these questions:

–   Can the warehouse go off-line, and for how long?

–   What are the data interdependencies?

–   What is the storage availability?

–   What are the distribution requirements?

•   Such as for replication and partitioning

–   What is the loading time?

•   This includes cleaning, formatting, copying transmitting and overhead such as index building

 

Storing the data in the warehouse

•      A data warehouse is typically optimized for access from a decision maker’s need.

•      The data must be stored according to the data model of the warehouse.

–    This means the data structures must be created and maintained

–    Access paths must also be created and maintained

•    Access paths usually include indexes

•      Provisions must be made for updating the data

–    The sheer volume of data in the warehouse usually makes it impossible to simply reload the warehouse in its entirety later on

 

Summary views of the data

•      Many of the queries and reports needed from a warehouse depend on summaries of the data.

•      It may not be practical to summarize the needed data every time it is needed.

•      A summary view in a data warehouse refers to an actual table that is physically stored

 

Difficulties with data warehousing

•      The building of a warehouse in a large organization is a major undertaking

–    It may take years, from the time it is decided upon until it is implemented

–    The time it takes to capture the data, clean it up, and put it in a format that is useful is too much of a cost to bear

•      The administration of a data warehouse requires far broader shills than are needed for traditional database administration..

•      Data warehousing can have a learning curve that may be too long for impatient firms.

•      Data warehousing can become an exercise in data for the sake of the data.

•      There is data to suggest that the data warehousing failure rates are between 10% and 90%

–    There is no denying that data warehousing is risky

 

Summary of warehousing functionality

•      It consolidates the data.

•      It cleans the data.

•      It allows efficient aggregation

•      It stores summary tables

•      It lets us view the information along multiple dimensions

 

Research topics in warehousing

•      Automating cleaning the data for warehousing

•      Indexing/access paths for multi dimensional models

•      Automating the reformatting of data

•      Application of active database functionality (e.g. triggers) into the warehouse

•      Incorporation of domain and business rules into the warehouse creation and maintenance processes.