| |
DCI's
Publication Date: January 20, 1997
What's New in Data
Warehousing
By Ken
Rudin
Managing Director, Emergent Corporation
[Please
note: This article has been divided into
two sections to assist you in downloading it more
quickly. A link to the second half of the article
appears at the end of this segment.]
Despite the fact
that data warehouses are immensely popular, the
truth of the matter is that in a perfect world,
data warehouses wouldn't exist. We wouldn't need
them. In such a world, all our operational
systems would have consistent data. We wouldn't
have a point-of-sale system storing data as
"Male" and "Female," while
the inventory system stores that information as
"M" and "F." Nor would we
have our General Ledger application storing
"26-Nov-1995" while our American
inventory system stores "11/26/95" and
its European counterpart stores
"26/11/95." And we wouldn't be bothered
with such annoyances as having "delivery
date" mean the promised delivery date in our
order entry system, while it means actual
delivery date in our package tracking
application.
Additionally, our
operational systems would all have enough
performance capacity to enable users to perform
large, ad-hoc queries directly against the data
without destroying the performance for the
simultaneously occurring OLTP transactions. In a
perfect world, this would be possible because all
systems would be infinitely fast. And cheap, too.
Those of you
waiting for such a world will be waiting a long,
long time. Data warehouses exist because none of
the above is true. In the real world, data
warehouses solve a very real problem. Because
they solve a very real problem, they are becoming
immensely popular as the basis for sophisticated
Decision Support and Strategic Analysis
applications in all types of companies, across
all industries. And, as their popularity grows
exponentially, the market's characteristics
change rapidly.
Countless words
have been written describing the specific data
warehouse products that exist on the market,
comparing the features of each. My goal for this
article is different. Rather than focus on
individual products, it is useful to take a
snapshot of the market to discuss where we are
now, highlight important issues, and identify
trends to define where the data warehouse market
will likely be going.
State of the
Union
Lies, Damn
Lies, and Statistics
The first step is
to understand where we are today. Probably the
best way to determine the current status of a
market is to go talk to that market. For the data
warehousing market, survey after survey of
end-users all tell the same tale. Data
warehousing is big. And growing. According to one
recent survey, META Group found that 95 percent
of companies surveyed intend to build a data
warehouse. Granted, many of these companies have
unique ideas about what constitutes a data
warehouse, but what's most interesting is that in
1994, that number was only 15 percent.
Various estimates
put the size of the data warehouse market at $2
billion (which includes all hardware, database
software, and access tools). This number will
grow to $8 billion by 1998, which translates into
60 percent average annual growth rate.
Additionally, this $8 billion will be
complemented by another $5 billion in integration
services.
Not only is the
market growing, but individual data warehouse
applications are also growing. Again drawing on
the recent META survey, 60 percent of those
surveyed expected their data warehouses to grow
beyond 50 GB by mid-1996, and at the same time
they will need to support in excess of 50 users.
Still, those
pondering building a data warehouse should
exercise a healthy dose of caution -- less than
15 percent are currently in full production.
There are many reasons for this. One reason is
simply that the exponential growth rate of new
data warehouse projects implies that the majority
of these projects are only recently started.
However, another set of reasons is a little more
troubling. Corporations are learning that large
data warehouse projects cost more dollars and
take more time than originally thought. On
average, corporations spend $2 million in
hardware and $1 million in software and services,
and the project takes 12 to 18 months.
Hot Industries
Although data
warehousing is a highly horizontal application
relevant to all industries, certain industries
are farther along in their deployment of these
applications than are others. For example, the
retail industry is completely overrun with data
warehouse applications. Because it's such a nice
fit for that industry, multidimensional modeling
and the start schema have become the design of
choice for these retail systems. This design has
one large main "fact" table which holds
information on such things as product shipments
and product revenue, and includes foreign key
indexes into smaller "dimension" tables
that hold information on the various products,
sales regions, sales people, etc.
Leveraging the
successes of the retail industry, other
"data rich" industries such as
telecommunications, transportation and finance
dove head first into data warehousing. Most
recently, health care has also joined the pack.
These industries have relied heavily on the
design techniques that became prominent in
retail, most notably the use of the star schema.
In fact, many organizations across all industries
are under the impression that data warehousing
means that a star schema must be involved. But,
this is a mistaken impression, and it confuses
the conceptual notion of a data warehouse with a
particular physical implementation. Some
organizations are discovering that dimensional
modeling and the star schema don't suit their
particular needs very well. These organizations
are now experimenting with going back to the more
traditional relational design techniques such as
logical data modeling.
This shouldn't
come as a surprise to anyone. The pundits who
wave the star schema banner certainly have
provided the world with a valuable approach, but
it is not nirvana for everyone. Different
organizations have different structures. It would
be remarkable if a single schema architecture
were optimal for everyone, but it's not possible.
Don't be afraid to be a maverick and explore
alternatives to the star schema if you find your
organization working very hard to force-fit your
business into a dimensional model.
Hardware and
Software Vendor: Focus Is on Bundled Services
Over the past few
years, vendors certainly can't be criticized as
having ignored the data warehousing market. The
prodigious efforts of their product developers
have been outshined only by the efforts of their
marketing departments. There has been, and will
continue to be, a frenzy of new product releases
and acquisitions as each vendor tries to provide
the best solution for data warehousing.
When the concept
of data warehousing was becoming popular, the
early adopters were those companies that had
large amounts of resources to expend on having
the latest technological solutions. However,
we're well past this early adopter phase, and the
majority of companies don't have the previously
stated $3 million to spend, nor can they afford
to wait 12 to18 months.
To address the
needs of their new user base, vendors over the
past year expanded the focus of their data
warehouse offerings. Rather than simply flooding
the already confused market with more products,
vendors took a step back and looked at what has
become possibly the most critical component to
selling their solutions: services. Now they all
provide services which can help users build small
systems on a small budget, but which still
provide real value in a short time frame, thereby
allowing the users to ask management for further
resources to expand the project.
These "Quick
Start" (or Fast Start, or Smart Start,
or...) programs usually bundle hardware, software
and services. They are typically low-risk,
fixed-price, fixed-time projects that last
anywhere from 15 to 90 days, and anyone
contemplating building a data warehouse should
take advantage of these programs.
Scalable Data
Warehouses -- the Coming Wave
But what good is
it to build a small proof-of-concept system if
you have to throw it away when you really want to
build your full production warehouse? Sure, now
you have at least demonstrated the potential
value, but still no one wants to try to convince
management that they'll now have to wait many
months while you throw away the prototype and
start work on the real data warehouse. Wouldn't
it be better to take what you've built, and scale
it up as your needs grow and as resources
(slowly) become available?
Even if you decide
to start with a relatively large data warehouse
(in the 100 GB range), you are still by no means
immune to the need to scale up your warehouse.
Any self-respecting data warehouse grows in
multiple directions simultaneously. When it comes
to data warehouses, data begets data, and usage
begets usage. As end-users begin to see the value
of the system, they realize how adding
incremental amounts of data can increase the
usefulness of the system. Similarly,
word-of-mouth spreads about the warehouse, and
the number of users who want access to the system
can soar. Because of this, a conservative rule of
thumb is that the amount of data in your
warehouse will double in a year. Similarly, count
on the number of users growing by an order of
magnitude within two years, during which time the
overall CPU cycles required to process all the
queries will grow at least 100 times. A system
that can handle this growth with be viable for a
longer period of time, thereby increasing your
critically important ROI.
To achieve this
goal of building a scalable data warehouse,
corporations are rapidly turning to
scalable/parallel hardware and database platforms
as the new foundation of choice. Conventional
systems, though they can sometimes handle the
large amounts of data, usually cannot meet these
scalability requirements, because their
processing power is limited to a single CPU, and
their I/O capabilities have fixed limits.
However, scalable/parallel systems, which
includes both hardware and database software
components, address these issues directly by
allowing users to add multiple processors,
additional disks, and additional I/O bandwidth to
the system.
Scalable hardware
platforms fall into three categories.
"Symmetric Multiprocessors" (SMPs) have
multiple processors (usually up to a few dozen)
all sharing the same physical memory and the same
memory bus. However, after a few dozen
processors, this single memory bus becomes a
bottleneck, so adding additional processors will
not increase performance. "Clusters"
were designed to address this problem. Once an
SMP has reached its limits of scalability, the
concept of clustering allows users to connect
multiple SMP machines (referred to as
"nodes") together with a very fast
interconnect which allows the machines to work
together, as if they were a single larger
machine. However, after clustering a few machines
together, this single interconnect eventually
also becomes a bottleneck. "Massively
Parallel Processors" (MPPs) solve this
problem by introducing the concept of a scalable
interconnect, meaning that the bandwidth of this
interconnect grows each time an additional node
is added to the system.
Scalable database
software is the second critical component of a
scalable data warehouse. The database must be
able to take advantage of the multiple processors
and multiple nodes within a scalable system.
Fortunately, there's no lack of parallel
databases. All the major vendors (for example,
Oracle, Informix, IBM, Sybase, Teradata and
Tandem) have been developing and selling versions
of their databases that leverage the underlying
scalable hardware.
Once a corporation
has decided to use scalable technologies for
their warehouse, a question arises: Which
hardware and software platforms are right for
their needs? Regarding hardware, where possible
it is usually desirable to use the most mature
technology to solve a problem, since it will be
the most stable. However, the more mature
hardware technologies are the least scalable. So,
the general guideline is to find the most mature
technology that will still meet your scalability
needs. For example, if you need very little
scalability, then maybe you don't need scalable
hardware or software at all, and you can build
your solution using conventional, very mature
mainframe technology. However, few data
warehouses are static, so if you will need vast
amounts of processing power and scalability, then
first look at SMPs. If they meet your needs, then
use that platform. If not, look at clusters. If
even clusters can't meet your needs, then look at
MPPs, which give you maximum performance and
scalability, but which are the least mature of
the technologies.
Choosing a
scalable parallel database requires a different
approach. Don't spend all your energy analyzing
the architectures of the various vendor databases
(i.e. "shared nothing" vs. "shared
disk" database architectures). Currently, it
is not the architecture that is the most
important. Rather, it is the actual
implementation that's more important. That is,
did the vendor's developers do a good job at
building the RDBMS features that your application
requires? The only way to discover this is to
test your own application. It is highly advisable
to build prototypes and to create and run
benchmarks as a proof-of-concept for various
facets of your data warehouse application.
(Often, the vendor's "Quick Start"
programs will help you do exactly that.) This
process takes some time and money, but will
ultimately save you from making painful mistakes.
Next page: Getting Data Out of Your Data
Warehouse; Data Marts and Operational Data
Warehouses; and Wrapping It All Up.
Ken Rudin
is a featured speaker at The
DCI Data Warehouse Conference and DCI's
Data Warehouse World. He can be reached at krudin@emergent.com.
|