OLAP Multidimensional Database Concept

Info: 5365 words (21 pages) Dissertation
Published: 6th Dec 2019

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

CHAPTER 2 LITERATURE REVIEW

2.1 INTRODUCTION

This chapter is designed to provide background information and reviewing the characteristics of data warehouse, OLAP multidimensional database Concept, data mining model and the application of data mining. Within this research, the concept, design and implementation approaches in developing a complete data warehouse technology framework for deploying a successful model with the integration of OLAP Multidimensional Database and data mining model.

Section 2.2 discussed about the fundamental of data warehouse, data warehouse model and also the Extract, Transform and Loading (ETL) of raw database to data warehouse. It includes research and study on existing data warehouse models authored by William Inmon, Ralph Kimball and various scholars venturing into data warehouse models. Section 2.3 introduces background information of OLAP. It includes the studies and research on various OLAP models, OLAP architecture and concept on processing multidimensional databases, multidimensional database schemas design and implementation in this research. It includes studies and research on schema design and implementation. Section 2.4 introduces fundamental information of data mining. It includes studies and research on the available techniques, method and process for OLAP Data Mining. Section 2.5 discussed the product comparisons for data warehouse, data mining and OLAP by Mitch Kramer. It includes the reason why Microsoft is used to design and implement the new proposed model.

In this literature review, introduction to the relationships between data warehouse, OLAP multidimensional database and data mining model for deploying four experimental applications for benchmarking. This research also proves that the “new proposed model” data warehouse technology framework is ready to transform any type of raw data into useful information. It will also help us to review the new proposed model of each existing data warehouse & OLAP Multidimensional database framework.

2.2 DATA WAREHOUSE

According to William Inmon (1999), known as the “Father of Data Warehousing”, data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the management’s decision-making process. Data warehouse is a database containing data that usually represents the business history of an organization. This historical data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit.

Data Warehouse is a type of database system aimed at effective integration of operational databases into an environment that enables strategic use of data (G. Zhou et al., 1995). These technologies include relational and multidimensional database management systems, client/server architecture, meta-data modelling and repositories, graphical user interface and much more (J. Hammer et al., 1995; V. Harinarayan et al., 1996).

Data warehouse currently are much a subject of researched is not only commonly used in business or finance sector but can be applied appropriately in various sectors. Data warehouse are designed for analyzing or processing of data into useful information using data mining tools for critical decision-making. Data warehouse provides access to difficult environments of an enterprise data

In these literature studies, two important authors are identified as the main contributors and co-founder in the area of Data Warehouse, William Inmon (1999; 2005) and Ralph Kimball (1996, 2000). Both author perceptions on data warehouse design and architecture differ from one another. According to Inmon (1996), data warehouse is a dependent data mart structure, whereas Kimball (1999) defined data warehouse as a bus structure which is a combination of data mart populated together as a data warehouse. Table 2.1 discussed the differences in data warehouse ideology between William Inmon and Ralph Kimball.

Table 2.1 William Inmon and Ralph Kimball Data Warehouse Differences

	William Inmon	Ralph Kimball
Paradigm	Inmon’s Paradigm: An enterprise has one data warehouse, and data marts source their information from the data warehouse. Information is stored in 3rd normal form.	Kimball’s Paradigm: Data warehouse is the collection of heterogeneous data marts within the enterprise. Information is always stored in the dimensional model.
Architecture	Architecture: Using TOP-DOWN approach	Architecture: Using Bottom-up approach
Concept	Data’s integration from various systems to centralized repository	Concept of dimensional modelling (Bridging between Relational and multidimensional DB)
Design	The design pattern dependent on 3rd normalization form, purpose is for data granularity.	Data’s marts are connected in a bus structure. Data’s marts are the “union” of data warehouse. This approach is known also as Virtual Data Warehouse.
ETL Methods	Data’s extraction from operational data sources. Data are feed in staging database area. Data are then transformed, integrate, and consolidate and transfer to Operational Data Store database. Data are then load to data mart.	Data extracted from legacy system and then consolidated and verified in staging database. Data feed into ODS and more data us added/updated. Operational Data Store contains fresh copy data that is integrated and transformed to the data mart structure.
Data mart	Data Marts are available as a subset of the data warehouse.	Data Marts can be placed at different at different servers or in geographical locations.

Based on this Data Warehouse literature, both Inmon (2005) and Kimball (2000) have different philosophies, but they do have similar agreement on a successful design and implementation of data warehouse and data marts are mainly depending on the effective collection of operational data and validation of data mart. Both approaches having the same database staging concepts and ETL process of data from a database source. They also have a common understanding that independent data marts or data warehouses cannot fulfil the requirements of end users on an enterprise level for precise, timed and relevant data.

2.2.1 DATA WAREHOUSE ARCHITECTURE

Data warehouse architecture is a wide research area. It has many different sub-areas and it can be treated with different approaches in terms or analysis, design and implementation by different enterprise. In this research studies, the aim is to provide a complete view on data warehouse architecture. Two important scholars Thilini (2005) and Eckerson (2003) from TDWI will discussed in more details on the topic on data warehouse architecture.

According to Eckerson (2003), before implementing a successful business intelligence systems where users can use programs like specialized reporting tools, OLAP tools and data mining tools upfront, a data warehouse architecture model mainly concentrate on the database staging process from different integrated OLTP systems is responsible for the ETL to the whole process workable. Thilini (2005) conducted a two phase study survey on investigating which factors may influence the selection of data warehouse architecture. In Thilini literature study, there are five data warehouse architectures that are practice today as shown in Table 2.2.

Table 2.2 Data Warehouse Architectures (Adapted from Thilini, 2005)

Data Warehouse Architecture

Independent Data Marts

Independent data marts also known as localized and small sized data warehouses. It is mainly used by departments, divisions or regions of company to provide own operational databases. The data marts are different as the structures are different from different location with inconsistent database design which makes it difficult to analyze across the data marts. Thilini (2005) cited the work of Winsberg (1996) and Hoss (2002) that “It is common for organizational units to develop their own data marts”. Data marts are best used as a prototype for adhoc data warehouse and as for evaluation before building a real data warehouse.

Data Mart Bus Architecture

Kimball (1996) pioneered the designed and architecture of data warehouse with unions of data marts which are known as the bus architecture. Bus architecture Data Warehouse is derived from the unions of the data marts which are also known as Virtual Data Warehouse.

Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more as virtual reality mode and gathered all data marts and process as one data warehouse.

Hub-and-spoke architecture

Inmon (2005) developed Hub and Spoke architecture. The hub is the central server taking care of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and Spoke mainly focused on building a scalable and maintainable infrastructure for data warehouse.

Centralized Data Warehouse Architecture

Central data warehouse architecture almost similar to hub-and-spoke architecture without the dependent data marts. This architecture copies and stores heterogeneous operational and external data to a single and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data sources.

According to Inmon (1999) and Kimball (2000), central data warehouse should have Database staging or known as Operational Data Store as an intermediate stage for operational processing of data integration before transform into the data warehouse.

Federated Architecture

According to Hackney (2000), Federated Data Warehouse is a integration of multiple heterogeneous data marts, database staging or Operational data store, combination of analytical application and reporting systems. The concept of federated focus on framework of integration to make data warehouse as greatest as possible. Jindal (2004) conclude that federated data warehouse approach are a practical approach for a data warehouse architecture as it is focus on higher reliability and provide excellent value if it is “well defined, documented and integrated business rules”.

Thilini (2005) conclude that hub and spoke and centralized data warehouse architectures are similar and the survey scores are almost the same. Hub and spoke is faster and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach.

A data warehouse is a read-only data source where end-users are not allow to change the values or data elements. Inmon’s (1999) data warehouse architecture strategy are different from Kimball’s (1996). Inmon’s data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimball’s views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one central repository. Diagram 2.1 illustrates the differences between Inmon’s and Kimball’s data warehouse architecture adapted from Mailvaganam, H. (2007)

Diagram 2.1 Inmon’s and Kimball’s Data Warehouse Architecture (adapted from Mailvaganam, 2007)

In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. According to Laney (2000) and Watson, H. (2005), it is important to understand and select the appropriate data warehouse architecture and “the success of the various architectures” acclaimed by Watson. Analysis of this research proved that the most popular data warehouse architecture is hub-and-spoke proposed by Inmon as it is a centralized data warehouse with dependant data marts and second is the data mart bus architecture with dimensional data marts proposed by Kimball. The selection of the new proposed model will use the combination data warehouse architecture of hub-and-spoke and data mart bus architecture as the new proposed model data warehouse architecture are designed with centralized data warehouse and with data marts that can are used for multidimensional database modelling.

2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, LOADING

Data warehouse architecture begins with extract, transform, loading (ETL) process to ensure the data passes the quality threshold. According to Evin (2001), it is essential that right data are important and critical for the success on an enterprise. ETL are an important tool in data warehouse environment to ensure data in the data warehouse are cleansed from various systems and locations. ETLs are also responsible for running scheduled tasks that extract data from OLTPs. Typically, a data warehouse is populated with historical information from within a particular organization (Bunger, C. J et al., 2001). The complete process descriptions of ETL are discussed in table 2.3.

Table 2.3 Extract, Transform, and Load Process in Data Warehouse architecture

Process	Descriptions
Extract	Extract are the first process which involve in moving data from operational databases into database staging area or operational data store before populating into the data warehouse. In this stage, operational databases data need to be examined by extracting into the staging area for handling exceptions and fix all errors before it enters into data warehouse as this will save lots of time when loading into the data warehouse.
Transform	In completion of data extraction in database staging area, it is then transform to ensure data integrity within the data warehouse. Transformation of data can be done in several methods such as filed mapping and algorithm comparisons.
Load	After extract and transform of data, it is finally loaded into data warehouse (in Inmon’s model) or into data marts (in Kimball’s model). Data loaded into data warehouse are quality data after the process of extraction where erroneous data are removed and data are transform to ensure integrity of the data.

Calvanese, D. et al. (2001) highlight an enterprise data warehouse database tables may be populated with a wide variety of data sources from different locations and often including data providing information concerning a competitor business. Collecting all the different data and storing it in one central location is an extremely challenging task where ETL can make it possible. ETL process as depicts in Diagram 2.2 begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all data’s are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse.

Diagram 2.2Extract, Transport, Load Process

G. Zhou et al.(1995) emphasise on data integration in data warehousing stress that ETL can assist in import and export of operational data between heterogeneous data sources using OLE-DB (Object linking and embedding database) based architecture where the data are transform to populate all quality data into data warehouse. This is important to ensure that there are no restrictions on the size of the data warehouse with this approach.

In Kimball (2000) data warehouse architecture model depict in Diagram 2.3, the model focus in two important modules, “the back room” “presentation server” and “the front room”. In the back room process, where the data staging services in charge of gathering all source systems operational databases to perform extraction of data from source systems from different file format from different systems and platforms. Second step is to run the transformation process to ensure all inconsistency are removed to ensure data integrity. Finally, it is loaded into data marts. The ETL processes are commonly executed from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in star schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by query services such as reporting tools, desktop tools, OLAP and data mining tools.

Diagram 2.3 Data Warehouse Architecture (adapted from Kimball, 2000)

Nicola, M (2000) explains the process of retrieving data from the warehouse can vary greatly depending on the desired results. There are many form of possible retrieval from a data warehouses and it is flexibility that will drive how this retrieving process can be implemented. There are many tools for retrieving the data warehouse, such as building simple query and reporting through SQL statements. The tools may expand to OLAP and data mining, where the structure includes many more third party tools. There are many inherent problems associated with data, which includes the limited amount of portability, and the often-vast amount of data that must be sifted through for each query.

Essentially, ETL are mandatory for data warehouse to ensure data integrity. There are many factors to be considered such as complexity and scalability are among the two major issues that most enterprise faces by integrating information from different sources in order to have a clean and reliable source of data for mission critical business decisions. One way to achieve a scalable, non-complex solution is to adopt a “hub-and-spoke” architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Because of its centralized data warehouse design, it can influence the maintaining full access control of ETL processes. Also, empowers the usage of analytical and data mining tools by knowledge workers.

In this study on ETL for effective data warehouse architecture, it is known that Hub-and-spoke is best for data integration as it has the similarity in Inmon and Kimball architecture. The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Inmon and Kimball also recommend same ETL processes to enable hub-and-spoke architecture. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from Data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly.

In this work on the new proposed model, hub-and-spoke architecture are use as “Central repository service”, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architecture. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement. In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database or operational database store, Data marts, multidimensional database, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data. Diagram 2.4 depicts the concept of the new proposed model data warehouse architecture.

Diagram 2.4 New Proposed Model Data Warehouse Architecture

2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORS

Building a data warehouse is indeed challenging as data warehouse project inheriting a unique characteristic that may impact the overall setup if the analysis, design and implementation phase are not properly done. In this research effort, it discusses the studies on failure and success factors in data warehouse project. Section 2.2.3.1 focuses on the investigation on data warehouse project failure and section 2.2.3.2 discuss and investigate mainly on the success factors by implementing the correct model to support a successful data warehouse project implementation.

2.2.3.1 DATA WAREHOUSE FAILURE FACTORS

Hayen, R.L. (2007) studies shows that implementing a data warehouse project is costly and risky as a data warehouse project can cost over $1 million in the first year. It is estimated that one-half ad two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. Hayen R.L. (2007) citied on the work of Briggs (2002) and noticed three factors for the failure of data warehouse project that is Environment, Project and Technical factors as shown in Diagram 2.5 and table 2.4 discussed the factors in more details.

Diagram 2.5 Factors for Data Warehouse Failures (adapted from Briggs, 2002)

Table 2.4 Factors for Data Warehouse Failures (adapted from Briggs, 2002)

Factors	Descriptions
Environment	This leads to organization changes in business, politics, mergers, takeovers and lack of top management support. Also, including human error, corporate culture, decision making and change management.
Technical	Technical factors of a data warehouse project complexity and workload are taken too lightly where high expenses involving in hardware/software and people. Problems occurred when assigning a Project manager with lack of knowledge and project experience in data warehouse costing may lead to impediment of quantifying the return on investment (ROI). Also, failure of managing a data warehouse projects also includes: · Challenge in setting up a competent operational and development team plus not having a data warehouse manager or expert that is politically sound. · Having an extended timeframe for development and delivery of data warehouse system may due to lack of experience and knowledge for selection of data warehouse products and end-user tools. * Failure to manage the scope of data warehouse project.
Project	Poor knowledge on the requirements of data definitions and data quality on different organization business departments. Also, Running a data warehouse projects with incompetent and insufficient knowledge in what technology to use may lead into problems later on data integration, data warehouse model and data warehouse applications.

Vassiliadis (2004) studies shows that data warehouse project failures are an enormous threat and threatened by factors such as design, technical, procedural and socio-technical as illustrated in Diagram 2.6. These factors of failures are vital in finding any unwanted action for success. Each factor group is described in table 2.5.

Diagram 2.6 Factors for Data Warehouse Failures (adapted from Vassiliadis, 2007)

Table 2.5 Factors for Data Warehouse Failures (adapted from Vassiliadis, 2007)

Factors	Descriptions
Design	Design factors in data warehouse project can put up with “No Standard” techniques or design methodologies. A data warehouse project when doing the analysis and design phase may accept ideas on metadata techniques or languages and data engineering techniques. Also, a proprietary solutions and also recommendations from vendors or in-house experts may define the design of the data warehouse blueprint landscape.
Technical	Technical factors associate to the lack of know-how experience in evaluation and choices of hardware setup for data warehouse systems
Procedural	Procedural factors concerning on the imperfection of data warehouse deployment. This factor focuses on training the end-users extensively on the new technology and the design of data warehouse which are completely different than the conventional IT solutions. users communities plays a vital role and are crucial in this factor.
Socio-Technical	Socio-technical factors in a data warehouse project may lead into problems on violation of organization modus operandi where the data warehouse systems will lead into restructuring or reorganization on the way organization operates by introducing changes to the user community.

According to Vassiliadis (2007) also, another potential factors for the failure of data warehouse projects are the “data ownership and access”. This is considered vulnerable within the organization and one mustn’t share nor acquire someone else data as this is comparable with losing authority on the data ownership and access. Also, restrict any departmental declaration or request to own a total ownership of pure clean and error free data as this might cause potential problem on ownership data rights.

Watson (2004) stress that the general factors for the failures in data warehouse project success comprises of “weak sponsorship” and top management support, inadequate funding and users participation and organizational politic.

2.2.3.2 DATA WAREHOUSE SUCCESS FACTORS

Data Warehouse Failures can lead into disastrous implementation if careful factors or measures are not taken into serious considerations as discussed in section 2.2.3.1 based on Briggs (2002) and Vassiliadis (2004) studies that may lead into data warehouse failures. According to Hwang M.I. (2007), data warehouse implementations are an important area of research and industrial practices but only few researches made an assessment in the critical success factors for data warehouse implementations. No doubt there is procedure for data warehouse design and implementation but only certain guidelines are subjected for experimental testing. So, it is best to decide and choose the proper data warehouse model for implementation success.

In this study on identifying and filling the gap analysis of the data warehouse success factors, a number of success factors are gathered from data warehouse scholars and professionals (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) to validate their experimental work and research strength individually on various characteristics of data warehouse success. This study beneficial in planning and implementing data warehouse projects and direct into the success of designing and implementing the new proposed model in this research.

There are several success factors in designing and implementing data warehouse solutions and the most important success factors depend on the data warehouse model selection, as different organization may have different scope and road maps in the development of data warehouse. The results of building a successful data warehouse are then used to quantify the factors that are used and also prioritize those factors that are beneficial for continued research purpose to improve and enhanced the data warehouse model success.

According to Hayen, R.L. (2007), data warehouse is a complex system which can complicate business procedures. The complexity of data warehouse prevents companies from changing data or transaction which are necessary. It is important then to analyze on which data warehouse model to be used for such complex systems that are sound critical to an organization. Hwang M.I. (2007) conducted a study on data warehousing model and success factors as a critical area of practice and research but only a few studies have been accomplish to measure the data warehouse projects and success factors.

Many scholars had conducted a profound research in the area of data warehouse and may have succeeded or failed due to possible reasons based on each scholars outcomes on the research. It is useful inspect a few case studies on a selected companied data warehouse implementation and to experiment the failure and success factors through survey. (Winter, 2001; Watson et al., 2004)

Hwang M. I. (2007) conducted a survey study on six data warehouse scholars (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) on the success factors in a data warehouse project. Each scholar has different success factors that are measures in a project. Table 2.6 shows the mentioned six scholars survey study on data warehouse, Watson (1997) measures data warehouse success factors, Chen et al. (2000), Watson et al. (2001) and Shin (2003) measures data warehouse implementation factors and Hwang (2002) measures through development and management practices. Only Wixom (2001) as shown in diagram 2.7 measures both Data warehouse implementation and success factors which can be used as a model for a successful data warehouse implementation. Study shown in all 6 scholars’ review, without having data warehouse implementation and success factors, the consequences of any factors on a data warehouse success cannot be validated.

Table 2.6 Factors for Data Warehouse Success (adapted from Hwang M.I., 2007)

Study	Data Warehouse Success Factors	Data Warehouse Implementation Factors	Results Reported
Watson & Haley (1997)	Focus on user involvement and support by having a clear and understandable business needs. Using methodology and modelling methods in data warehouse by targeting on “clean data”. Thus, support from upper management to contribute on the success.	N/A	Ordered list of success
Chen et al. (2000)	N/A	Focused on exactness and preciseness of User satisfaction by Support and realization of end users needs.	Support for end users affects user satisfaction
Wixom & Watson (2001)	Implementation factors include management support, resources, User participation, team skills, Source systems aand development technology which contribute to the implementatio