| |
What's New
in Data Warehousing - Part 2
By Ken
Rudin
Managing Director, Emergent Corporation
[Please
note: This article has been divided into
two sections to assist you in downloading it more
quickly. A link back to the first half of the
article is available here and at the end of this
segment.]
Getting Data
Out of Your Data Warehouse
Once your data
warehouse is built, all you really have is an
environment that is potentially very valuable.
Until you actually get your data back out of your
system in some meaningful way, that potential
remains untapped. Three classes of end-user
access tools have emerged that help you unlock
the potential value: Query and Reporting tools,
OLAP tools, and Automatic Data Mining.
Query Reporting
Tools
The first class of
tools are the query reporting tools. Often, they
are not particularly targeted at the data
warehouse environment, but are instead generic
tools that allow you to ask fairly structured
questions. These tools are usually uncomplicated,
and are optimal for asking questions such as how
many items you currently have in inventory, or
how many customers have taken advantage of a
sales promotion coupon. The output of these tools
is usually in the form of a report.
OLAP Tools
OLAP tools can
take you a step beyond the power of query
reporting tools. Admittedly, the lines here are
fairly fuzzy, but generally the distinguishing
factor is that with OLAP tools, the data is
represented using a multidimensional model rather
than the more traditional logical data model. The
traditional model defines a database schema that
focuses on modeling a process or a function, and
the information is viewed from the perspective of
being a set of transactions that occurred over
time. In comparison, the multidimensional model
usually defines a star schema (described earlier)
and views data not from the perspective of a
single event, but rather from the perspective of
the cumulative effect of these events over some
period of time, such as weeks, months and years.
Also, with OLAP
tools, the user generally doesn't view the data
in standard report formats, but rather in grids
or cross-tabs that can be pivoted to quickly look
at the data from different perspectives.
Additionally, OLAP involves interactive querying
of the data. The user is able to follow a train
of thought by being able to look at information
at one aggregation level (such as a sales
region), and then drill down into successively
more detail, such as state, city, store and
department.
Be careful not to
get confused here. OLAP tools simply give you an
interesting way to look at your data. They do not
imply how the data is actually stored. Given
that, it's not surprising that there are multiple
ways to store the data. One popular choice
involves storing the data in a dedicated
Multidimensional Databases (MDD). Examples
include Arbor's Essbase, Oracle's Express Server
and Planning Science's Gentia. The other popular
choice, of course, involves storing the data in
relational databases and having an OLAP tool work
directly against the data. This is referred to as
Relational OLAP, or R/OLAP. Examples include
MicroStrategy's DSS Agent, Informix-MetaCube,
Information Advantage's AXSYS, and Platinum's
InfoBeacon. (Some also include Red Brick's
Warehouse in this category, but it isn't really
an OLAP tool. It is a relational database
optimized for performing the types of operations
that R/OLAP tools need.)
Both have numerous
strengths and weaknesses, and many articles have
been written on that topic. Possibly more
importantly, the weaknesses are being rapidly
addressed by the respective vendors. However,
there is one fact that must be taken into
consideration. Currently, data warehouses are
predominantly built using relational databases.
In fact, the association is so strong that for
many people, a data warehouse implies a
relational database. And, if you have a warehouse
built on a relational database, and you want to
perform OLAP analysis against it, then Relational
OLAP is a natural fit. Yes, you'll forego many of
the useful features that currently exist only in
MDDs, and you'll miss out on some of the business
modeling capabilities, but you'll be able to work
directly with your relational warehouse.
This isn't to say
that MDDs can't be a part of your data warehouse
solution. It's just that MDDs currently aren't
well suited for large amounts of data (10 GB is
fine, but 100 GB is stretching their
capabilities). If you really want the
functionality benefits that come with MDD, then
I'd suggest subsetting the data into smaller MDD
based "data marts."
Automatic Data
Mining Tools
As I mentioned
earlier, the potential value of a data warehouse
is only unlocked if you can extract the
information out of it. Essentially, the promise
of data warehousing is twofold: easier access to
consistent data, and the ability to discover
previously hidden information, patterns and
trends about your business. Query reporting tools
and OLAP tools address the first promise, but
because of a subtle issue, do not do very well
with the second promise. The subtle issue I'm
referring to is the fact that using query
reporting or OLAP tools, the answers you get are
only as good as the questions you ask. You will
only find interesting patterns if you're looking
for them.
Think of it this
way -- your data warehouse is a gold mine, but
you'll only find the gold if you know where to
dig. Wouldn't it be nice to have a tool that
automatically looked in all possible locations,
and in doing so uncovered all the hidden gold?
That's exactly what automatic data mining tools
do. And that is the most compelling notion behind
data mining. Previously, the paradigm for getting
information out of your data warehouse was that
an end-user would start with an intuition about
something they wanted to find, and then they
would use various tools to ask questions that
would either verify or contradict their
intuition. Data mining is fundamentally
different: Rather than requiring user intuition
and user initiative, the initiative automatically
comes from the data mining tool itself.
Automatic data
mining is a sophisticated technology that
combines multiple advanced data analysis
techniques, including such things as statistical
analysis, decision trees and neural nets. It
looks through all your data to find trends,
patterns and relationships that you may not have
otherwise noticed. Data mining itself is not new.
For example, for many years, data mining
techniques have been used in retailing to help
identify which items frequently sell together,
which helps retailers identify additional selling
opportunities, or even help them determine where
products should be placed on shelves. Credit card
companies use data mining to detect purchasing
patterns that indicate that a credit card has
been stolen. The financial industry uses data
mining to identify trends in stock prices, or
associations between the movements of the prices
between two or more stocks. Banks are using data
mining techniques to identify which factors are
the most accurate indicators to determine whether
a customer is a good candidate for a mortgage
loan. Telecommunication companies use data mining
to pinpoint customers that are most likely to be
interested in new calling plans, or new types of
services. Insurance companies have used data
mining to detect fraudulent activities by
flagging activities that are outside the norm.
Health care has used data mining to determine
which combinations of treatments are most
successful for a particular health condition. The
list goes on.
If data mining is
not new, then why is interest in it surging?
There are two reasons. First, as data warehouses
get larger, users are being buried by the sheer
volumes of data. Trying to find all the buried
gold by knowing which questions to ask becomes
impossible. Automatically searching for the
nuggets of knowledge allows companies to increase
the ROI of their data warehousing efforts.
Second, the rapidly increasing acceptance of
parallel and scalable machines is contributing to
the data mining surge. Searching through and
performing complex calculations on mountains of
data requires large computational resources.
Parallel platforms, with their ability to support
large numbers of processors and vast amounts of
I/O, provide the necessary resources.
Since this field
is just becoming popular, the market is still too
small to determine who are the
"dominant" tool suppliers. In addition,
the technology is not yet very mature. However,
expect this market to heat up. The potential
returns from using an automatic data mining tool
are just too compelling to ignore.
Data
Visualization
Data warehouses
are also causing a surge in the popularity of
data visualization techniques for looking at
data. Data visualization is not actually a
separate class of tools; rather, it is a method
for displaying the data that is output by any of
the previously mentioned tools. These
visualization techniques are particularly
valuable for viewing data that is queried from a
data warehouse because the end-users of a data
warehouse are usually looking for trends or
patterns, and the human mind is far better
equipped to notice visual trends and patterns
than it is at noticing the same information in a
standard report or cross-tab.
At the most simple
level, one could claim that the various line,
bar, and pie charts with which we are already
familiar constitute data visualization. In
reality, however, data visualization goes far
beyond simple charts. It is a complex technique
that is currently an area of hot research, as
developers try to determine how to best display
complex relationships on a two-dimensional
computer monitor. For example, multidimensional
analysis may look at sales of products by
category by region by month. How can you best
display four dimensions graphically? How about
five? Current techniques are experimenting with
using different colors and different shapes to
signify different dimensional values. For
example, a scatter plot may be enhanced to show
additional dimensions by not just plotting simple
dots on an X, Y (and possibly Z) axis, but by
changing the dots into objects of different
shapes and colors. Additionally, animation can be
used to show how values change over the time
dimension. As you might imagine, data
visualization graphics have the potential to be
incomprehensible (and therefore useless) if not
done well. However, when done properly, an
enormous amount of inter-related data can be
consolidated into an intuitive graphical display.
As an aid to comprehension, the results can be
astonishing.
The Web
A section on
accessing warehouse data would not be complete
without a mention of how the World Wide Web is
changing the landscape for data warehousing as
much as it is changing the landscape for just
about everything these days. At a high level, the
goals of data warehousing and the goal of the Web
are the same: easy access to data. The value of a
data warehouse is maximized when the right data
is in the hands of those who need it, regardless
of when and where they need it. However,
corporations have typically had to struggle with
complex client/server architectures to give
end-users the access they need. Mix in the
additional issues that arise when you try to give
remote users access to the warehouse, and the
result is often a truly complex environment.
Additionally, all users need training on how to
use the client applications.
The Web removes
these issues. While it doesn't necessarily give
you more ways to look at your data, it does give
you more access to your data. Last year, more Web
servers were sold for internal corporate use
(known as intranets) than were sold for external
Internet use, because corporations discovered
that the Web provides an efficient mechanism to
realize the benefits of client/server
architectures. Additionally, the same
applications will work over the Internet,
removing the complexity previously associated
with supporting remote access. And finally, the
client application is the same Web browser that
is used for every other Web application, meaning
that millions of people already know how to use
it. Vendors are quickly providing new Web
development tools that allow Web browsers to
access data warehouses, thereby addressing the
previously discussed needs of the end user:
access to the right data, truly regardless of
when and where they need it.
Divergent
Paths: Data Marts and Operational Data Warehouses
When something as
popular as data warehousing arrives on the scene,
it's human nature to tweak the concept to see how
it can be made even more useful. Though not
everyone will agree, for the most part, data
warehouses were initially conceived as a single
consolidated database, and as read-only. Two
movements are challenging each of those aspects:
Data Marts and Operational Data Warehouses. What
is most interesting is that these trends are
essentially at odds with each other.
Data marts have
many definitions, but the type of data mart I'm
referring to is the one that is created as a
subset of a larger data warehouse. There are many
reasons for creating a data mart. One reason is
simply performance. A marketing user might not
want to have to use a machine which is always
brought to its knees because the people in
finance always run complex queries against the
warehouse. Another reason is simplicity. You
limit the scope of what you're trying to achieve
with a data mart, thereby making the resulting
solution simpler to understand and maintain.
Finally, there is the autonomy issue. Data marts
break up the potentially monolithic data
warehouse, allowing each group of users to
autonomously manipulate just the data they are
most interested in, thereby decentralizing the
data access.
Operational data
warehouses move in the opposite direction. Rather
than portioning off smaller pieces of the data
warehouse, they combine the data warehouse with
the traditional OLTP operational systems. These
systems allow users to analyze the data, and then
take actions based on the analysis in real-time,
all within the same system. For example, you can
look at the purchasing patterns of recent sales,
and based on that analysis, enter transactions to
ship various products to your various stores. Or,
you might analyze stock price trends, and then
issue buy and sell orders all within the same
system. Additionally, the resources expended on
extracting data from the operational system can
be reduced because the warehouse and the
operational system are no longer two separate
systems.
Essentially, data
marts further replicate data and decentralize
access, and operational data warehouses further
consolidate data and centralize access. In the
near future, data marts will flourish, because
people like to have decentralized control over
their own data (for example, people have always
found ways to copy data out of a database and
place it in a local spreadsheet so they can
manipulate the data locally). Automatic tools are
being developed that will help keep the
information in the data marts current and
consistent with the data in the enterprise level
data warehouse. On the other hand, operational
data warehouses will only play a niche role in
the near future. The operational users are
different from the analysis users, and it will
probably stay that way for some time. In
addition, even though this might sound like the
"perfect world" solution, the
challenges of resolving the performance
implications of having critical OLTP applications
share the same platform as resource-hungry DSS
applications are daunting.
Wrapping It All
Up
Data warehouses
are clearly a viable solution to a real problem.
With the database software vendors scrambling to
incorporate specific data warehouse features into
their products, the hardware vendors
incorporating scalable technologies into their
products, the access tools developers adding a
plethora of power features into their products,
and all of them introducing quick start programs,
the data warehouse market will continue to grow
rapidly. It may not be the vision of utopia
described at the start of this article, but at
least we can now get our jobs done a little more
easily.
Return to First Half of Article
Ken Rudin
is a featured speaker at The
DCI Data Warehouse Conference and DCI's
Data Warehouse World. He can be reached at krudin@emergent.com.
|