web hit counter DCI - Ken Rudin: What's New in Data Warehousing
DCI Logo DCI Header Logo

DCI Home
Event Info
Sign-Up
Exhibitors
I.T. News
Press Room
Find It
Help
 
DCI's Publication Date: January 20, 1997

What's New in Data Warehousing

By Ken Rudin
Managing Director, Emergent Corporation

[Please note: This article has been divided into two sections to assist you in downloading it more quickly. A link to the second half of the article appears at the end of this segment.]

Despite the fact that data warehouses are immensely popular, the truth of the matter is that in a perfect world, data warehouses wouldn't exist. We wouldn't need them. In such a world, all our operational systems would have consistent data. We wouldn't have a point-of-sale system storing data as "Male" and "Female," while the inventory system stores that information as "M" and "F." Nor would we have our General Ledger application storing "26-Nov-1995" while our American inventory system stores "11/26/95" and its European counterpart stores "26/11/95." And we wouldn't be bothered with such annoyances as having "delivery date" mean the promised delivery date in our order entry system, while it means actual delivery date in our package tracking application.

Additionally, our operational systems would all have enough performance capacity to enable users to perform large, ad-hoc queries directly against the data without destroying the performance for the simultaneously occurring OLTP transactions. In a perfect world, this would be possible because all systems would be infinitely fast. And cheap, too.

Those of you waiting for such a world will be waiting a long, long time. Data warehouses exist because none of the above is true. In the real world, data warehouses solve a very real problem. Because they solve a very real problem, they are becoming immensely popular as the basis for sophisticated Decision Support and Strategic Analysis applications in all types of companies, across all industries. And, as their popularity grows exponentially, the market's characteristics change rapidly.

Countless words have been written describing the specific data warehouse products that exist on the market, comparing the features of each. My goal for this article is different. Rather than focus on individual products, it is useful to take a snapshot of the market to discuss where we are now, highlight important issues, and identify trends to define where the data warehouse market will likely be going.

State of the Union

Lies, Damn Lies, and Statistics

The first step is to understand where we are today. Probably the best way to determine the current status of a market is to go talk to that market. For the data warehousing market, survey after survey of end-users all tell the same tale. Data warehousing is big. And growing. According to one recent survey, META Group found that 95 percent of companies surveyed intend to build a data warehouse. Granted, many of these companies have unique ideas about what constitutes a data warehouse, but what's most interesting is that in 1994, that number was only 15 percent.

Various estimates put the size of the data warehouse market at $2 billion (which includes all hardware, database software, and access tools). This number will grow to $8 billion by 1998, which translates into 60 percent average annual growth rate. Additionally, this $8 billion will be complemented by another $5 billion in integration services.

Not only is the market growing, but individual data warehouse applications are also growing. Again drawing on the recent META survey, 60 percent of those surveyed expected their data warehouses to grow beyond 50 GB by mid-1996, and at the same time they will need to support in excess of 50 users.

Still, those pondering building a data warehouse should exercise a healthy dose of caution -- less than 15 percent are currently in full production. There are many reasons for this. One reason is simply that the exponential growth rate of new data warehouse projects implies that the majority of these projects are only recently started. However, another set of reasons is a little more troubling. Corporations are learning that large data warehouse projects cost more dollars and take more time than originally thought. On average, corporations spend $2 million in hardware and $1 million in software and services, and the project takes 12 to 18 months.

Hot Industries

Although data warehousing is a highly horizontal application relevant to all industries, certain industries are farther along in their deployment of these applications than are others. For example, the retail industry is completely overrun with data warehouse applications. Because it's such a nice fit for that industry, multidimensional modeling and the start schema have become the design of choice for these retail systems. This design has one large main "fact" table which holds information on such things as product shipments and product revenue, and includes foreign key indexes into smaller "dimension" tables that hold information on the various products, sales regions, sales people, etc.

Leveraging the successes of the retail industry, other "data rich" industries such as telecommunications, transportation and finance dove head first into data warehousing. Most recently, health care has also joined the pack. These industries have relied heavily on the design techniques that became prominent in retail, most notably the use of the star schema. In fact, many organizations across all industries are under the impression that data warehousing means that a star schema must be involved. But, this is a mistaken impression, and it confuses the conceptual notion of a data warehouse with a particular physical implementation. Some organizations are discovering that dimensional modeling and the star schema don't suit their particular needs very well. These organizations are now experimenting with going back to the more traditional relational design techniques such as logical data modeling.

This shouldn't come as a surprise to anyone. The pundits who wave the star schema banner certainly have provided the world with a valuable approach, but it is not nirvana for everyone. Different organizations have different structures. It would be remarkable if a single schema architecture were optimal for everyone, but it's not possible. Don't be afraid to be a maverick and explore alternatives to the star schema if you find your organization working very hard to force-fit your business into a dimensional model.

Hardware and Software Vendor: Focus Is on Bundled Services

Over the past few years, vendors certainly can't be criticized as having ignored the data warehousing market. The prodigious efforts of their product developers have been outshined only by the efforts of their marketing departments. There has been, and will continue to be, a frenzy of new product releases and acquisitions as each vendor tries to provide the best solution for data warehousing.

When the concept of data warehousing was becoming popular, the early adopters were those companies that had large amounts of resources to expend on having the latest technological solutions. However, we're well past this early adopter phase, and the majority of companies don't have the previously stated $3 million to spend, nor can they afford to wait 12 to18 months.

To address the needs of their new user base, vendors over the past year expanded the focus of their data warehouse offerings. Rather than simply flooding the already confused market with more products, vendors took a step back and looked at what has become possibly the most critical component to selling their solutions: services. Now they all provide services which can help users build small systems on a small budget, but which still provide real value in a short time frame, thereby allowing the users to ask management for further resources to expand the project.

These "Quick Start" (or Fast Start, or Smart Start, or...) programs usually bundle hardware, software and services. They are typically low-risk, fixed-price, fixed-time projects that last anywhere from 15 to 90 days, and anyone contemplating building a data warehouse should take advantage of these programs.

Scalable Data Warehouses -- the Coming Wave

But what good is it to build a small proof-of-concept system if you have to throw it away when you really want to build your full production warehouse? Sure, now you have at least demonstrated the potential value, but still no one wants to try to convince management that they'll now have to wait many months while you throw away the prototype and start work on the real data warehouse. Wouldn't it be better to take what you've built, and scale it up as your needs grow and as resources (slowly) become available?

Even if you decide to start with a relatively large data warehouse (in the 100 GB range), you are still by no means immune to the need to scale up your warehouse. Any self-respecting data warehouse grows in multiple directions simultaneously. When it comes to data warehouses, data begets data, and usage begets usage. As end-users begin to see the value of the system, they realize how adding incremental amounts of data can increase the usefulness of the system. Similarly, word-of-mouth spreads about the warehouse, and the number of users who want access to the system can soar. Because of this, a conservative rule of thumb is that the amount of data in your warehouse will double in a year. Similarly, count on the number of users growing by an order of magnitude within two years, during which time the overall CPU cycles required to process all the queries will grow at least 100 times. A system that can handle this growth with be viable for a longer period of time, thereby increasing your critically important ROI.

To achieve this goal of building a scalable data warehouse, corporations are rapidly turning to scalable/parallel hardware and database platforms as the new foundation of choice. Conventional systems, though they can sometimes handle the large amounts of data, usually cannot meet these scalability requirements, because their processing power is limited to a single CPU, and their I/O capabilities have fixed limits. However, scalable/parallel systems, which includes both hardware and database software components, address these issues directly by allowing users to add multiple processors, additional disks, and additional I/O bandwidth to the system.

Scalable hardware platforms fall into three categories. "Symmetric Multiprocessors" (SMPs) have multiple processors (usually up to a few dozen) all sharing the same physical memory and the same memory bus. However, after a few dozen processors, this single memory bus becomes a bottleneck, so adding additional processors will not increase performance. "Clusters" were designed to address this problem. Once an SMP has reached its limits of scalability, the concept of clustering allows users to connect multiple SMP machines (referred to as "nodes") together with a very fast interconnect which allows the machines to work together, as if they were a single larger machine. However, after clustering a few machines together, this single interconnect eventually also becomes a bottleneck. "Massively Parallel Processors" (MPPs) solve this problem by introducing the concept of a scalable interconnect, meaning that the bandwidth of this interconnect grows each time an additional node is added to the system.

Scalable database software is the second critical component of a scalable data warehouse. The database must be able to take advantage of the multiple processors and multiple nodes within a scalable system. Fortunately, there's no lack of parallel databases. All the major vendors (for example, Oracle, Informix, IBM, Sybase, Teradata and Tandem) have been developing and selling versions of their databases that leverage the underlying scalable hardware.

Once a corporation has decided to use scalable technologies for their warehouse, a question arises: Which hardware and software platforms are right for their needs? Regarding hardware, where possible it is usually desirable to use the most mature technology to solve a problem, since it will be the most stable. However, the more mature hardware technologies are the least scalable. So, the general guideline is to find the most mature technology that will still meet your scalability needs. For example, if you need very little scalability, then maybe you don't need scalable hardware or software at all, and you can build your solution using conventional, very mature mainframe technology. However, few data warehouses are static, so if you will need vast amounts of processing power and scalability, then first look at SMPs. If they meet your needs, then use that platform. If not, look at clusters. If even clusters can't meet your needs, then look at MPPs, which give you maximum performance and scalability, but which are the least mature of the technologies.

Choosing a scalable parallel database requires a different approach. Don't spend all your energy analyzing the architectures of the various vendor databases (i.e. "shared nothing" vs. "shared disk" database architectures). Currently, it is not the architecture that is the most important. Rather, it is the actual implementation that's more important. That is, did the vendor's developers do a good job at building the RDBMS features that your application requires? The only way to discover this is to test your own application. It is highly advisable to build prototypes and to create and run benchmarks as a proof-of-concept for various facets of your data warehouse application. (Often, the vendor's "Quick Start" programs will help you do exactly that.) This process takes some time and money, but will ultimately save you from making painful mistakes.

Next page: Getting Data Out of Your Data Warehouse; Data Marts and Operational Data Warehouses; and Wrapping It All Up.

Ken Rudin is a featured speaker at The DCI Data Warehouse Conference and DCI's Data Warehouse World. He can be reached at krudin@emergent.com.

 
  [home] [event info] [sign up] [exhibit now] [i.t. news] [press room] [find it] [help]

© Copyright 1997 by Digital Consulting, Inc. (508) 470-3880
All event names are trademarks of DCI or its clients.
Comments?
webmaster@dciexpo.com












GPS - Global Positioning System
Free VoIP Calls
Spyware Removal