web hit counter DCI - Ken Rudin: What's New in Data Warehousing
DCI Logo DCI Header Logo

DCI Home
Event Info
Sign-Up
Exhibitors
I.T. News
Press Room
Find It
Help
 

What's New in Data Warehousing - Part 2

By Ken Rudin
Managing Director, Emergent Corporation

[Please note: This article has been divided into two sections to assist you in downloading it more quickly. A link back to the first half of the article is available here and at the end of this segment.]

Getting Data Out of Your Data Warehouse

Once your data warehouse is built, all you really have is an environment that is potentially very valuable. Until you actually get your data back out of your system in some meaningful way, that potential remains untapped. Three classes of end-user access tools have emerged that help you unlock the potential value: Query and Reporting tools, OLAP tools, and Automatic Data Mining.

Query Reporting Tools

The first class of tools are the query reporting tools. Often, they are not particularly targeted at the data warehouse environment, but are instead generic tools that allow you to ask fairly structured questions. These tools are usually uncomplicated, and are optimal for asking questions such as how many items you currently have in inventory, or how many customers have taken advantage of a sales promotion coupon. The output of these tools is usually in the form of a report.

OLAP Tools

OLAP tools can take you a step beyond the power of query reporting tools. Admittedly, the lines here are fairly fuzzy, but generally the distinguishing factor is that with OLAP tools, the data is represented using a multidimensional model rather than the more traditional logical data model. The traditional model defines a database schema that focuses on modeling a process or a function, and the information is viewed from the perspective of being a set of transactions that occurred over time. In comparison, the multidimensional model usually defines a star schema (described earlier) and views data not from the perspective of a single event, but rather from the perspective of the cumulative effect of these events over some period of time, such as weeks, months and years.

Also, with OLAP tools, the user generally doesn't view the data in standard report formats, but rather in grids or cross-tabs that can be pivoted to quickly look at the data from different perspectives. Additionally, OLAP involves interactive querying of the data. The user is able to follow a train of thought by being able to look at information at one aggregation level (such as a sales region), and then drill down into successively more detail, such as state, city, store and department.

Be careful not to get confused here. OLAP tools simply give you an interesting way to look at your data. They do not imply how the data is actually stored. Given that, it's not surprising that there are multiple ways to store the data. One popular choice involves storing the data in a dedicated Multidimensional Databases (MDD). Examples include Arbor's Essbase, Oracle's Express Server and Planning Science's Gentia. The other popular choice, of course, involves storing the data in relational databases and having an OLAP tool work directly against the data. This is referred to as Relational OLAP, or R/OLAP. Examples include MicroStrategy's DSS Agent, Informix-MetaCube, Information Advantage's AXSYS, and Platinum's InfoBeacon. (Some also include Red Brick's Warehouse in this category, but it isn't really an OLAP tool. It is a relational database optimized for performing the types of operations that R/OLAP tools need.)

Both have numerous strengths and weaknesses, and many articles have been written on that topic. Possibly more importantly, the weaknesses are being rapidly addressed by the respective vendors. However, there is one fact that must be taken into consideration. Currently, data warehouses are predominantly built using relational databases. In fact, the association is so strong that for many people, a data warehouse implies a relational database. And, if you have a warehouse built on a relational database, and you want to perform OLAP analysis against it, then Relational OLAP is a natural fit. Yes, you'll forego many of the useful features that currently exist only in MDDs, and you'll miss out on some of the business modeling capabilities, but you'll be able to work directly with your relational warehouse.

This isn't to say that MDDs can't be a part of your data warehouse solution. It's just that MDDs currently aren't well suited for large amounts of data (10 GB is fine, but 100 GB is stretching their capabilities). If you really want the functionality benefits that come with MDD, then I'd suggest subsetting the data into smaller MDD based "data marts."

Automatic Data Mining Tools

As I mentioned earlier, the potential value of a data warehouse is only unlocked if you can extract the information out of it. Essentially, the promise of data warehousing is twofold: easier access to consistent data, and the ability to discover previously hidden information, patterns and trends about your business. Query reporting tools and OLAP tools address the first promise, but because of a subtle issue, do not do very well with the second promise. The subtle issue I'm referring to is the fact that using query reporting or OLAP tools, the answers you get are only as good as the questions you ask. You will only find interesting patterns if you're looking for them.

Think of it this way -- your data warehouse is a gold mine, but you'll only find the gold if you know where to dig. Wouldn't it be nice to have a tool that automatically looked in all possible locations, and in doing so uncovered all the hidden gold? That's exactly what automatic data mining tools do. And that is the most compelling notion behind data mining. Previously, the paradigm for getting information out of your data warehouse was that an end-user would start with an intuition about something they wanted to find, and then they would use various tools to ask questions that would either verify or contradict their intuition. Data mining is fundamentally different: Rather than requiring user intuition and user initiative, the initiative automatically comes from the data mining tool itself.

Automatic data mining is a sophisticated technology that combines multiple advanced data analysis techniques, including such things as statistical analysis, decision trees and neural nets. It looks through all your data to find trends, patterns and relationships that you may not have otherwise noticed. Data mining itself is not new. For example, for many years, data mining techniques have been used in retailing to help identify which items frequently sell together, which helps retailers identify additional selling opportunities, or even help them determine where products should be placed on shelves. Credit card companies use data mining to detect purchasing patterns that indicate that a credit card has been stolen. The financial industry uses data mining to identify trends in stock prices, or associations between the movements of the prices between two or more stocks. Banks are using data mining techniques to identify which factors are the most accurate indicators to determine whether a customer is a good candidate for a mortgage loan. Telecommunication companies use data mining to pinpoint customers that are most likely to be interested in new calling plans, or new types of services. Insurance companies have used data mining to detect fraudulent activities by flagging activities that are outside the norm. Health care has used data mining to determine which combinations of treatments are most successful for a particular health condition. The list goes on.

If data mining is not new, then why is interest in it surging? There are two reasons. First, as data warehouses get larger, users are being buried by the sheer volumes of data. Trying to find all the buried gold by knowing which questions to ask becomes impossible. Automatically searching for the nuggets of knowledge allows companies to increase the ROI of their data warehousing efforts. Second, the rapidly increasing acceptance of parallel and scalable machines is contributing to the data mining surge. Searching through and performing complex calculations on mountains of data requires large computational resources. Parallel platforms, with their ability to support large numbers of processors and vast amounts of I/O, provide the necessary resources.

Since this field is just becoming popular, the market is still too small to determine who are the "dominant" tool suppliers. In addition, the technology is not yet very mature. However, expect this market to heat up. The potential returns from using an automatic data mining tool are just too compelling to ignore.

Data Visualization

Data warehouses are also causing a surge in the popularity of data visualization techniques for looking at data. Data visualization is not actually a separate class of tools; rather, it is a method for displaying the data that is output by any of the previously mentioned tools. These visualization techniques are particularly valuable for viewing data that is queried from a data warehouse because the end-users of a data warehouse are usually looking for trends or patterns, and the human mind is far better equipped to notice visual trends and patterns than it is at noticing the same information in a standard report or cross-tab.

At the most simple level, one could claim that the various line, bar, and pie charts with which we are already familiar constitute data visualization. In reality, however, data visualization goes far beyond simple charts. It is a complex technique that is currently an area of hot research, as developers try to determine how to best display complex relationships on a two-dimensional computer monitor. For example, multidimensional analysis may look at sales of products by category by region by month. How can you best display four dimensions graphically? How about five? Current techniques are experimenting with using different colors and different shapes to signify different dimensional values. For example, a scatter plot may be enhanced to show additional dimensions by not just plotting simple dots on an X, Y (and possibly Z) axis, but by changing the dots into objects of different shapes and colors. Additionally, animation can be used to show how values change over the time dimension. As you might imagine, data visualization graphics have the potential to be incomprehensible (and therefore useless) if not done well. However, when done properly, an enormous amount of inter-related data can be consolidated into an intuitive graphical display. As an aid to comprehension, the results can be astonishing.

The Web

A section on accessing warehouse data would not be complete without a mention of how the World Wide Web is changing the landscape for data warehousing as much as it is changing the landscape for just about everything these days. At a high level, the goals of data warehousing and the goal of the Web are the same: easy access to data. The value of a data warehouse is maximized when the right data is in the hands of those who need it, regardless of when and where they need it. However, corporations have typically had to struggle with complex client/server architectures to give end-users the access they need. Mix in the additional issues that arise when you try to give remote users access to the warehouse, and the result is often a truly complex environment. Additionally, all users need training on how to use the client applications.

The Web removes these issues. While it doesn't necessarily give you more ways to look at your data, it does give you more access to your data. Last year, more Web servers were sold for internal corporate use (known as intranets) than were sold for external Internet use, because corporations discovered that the Web provides an efficient mechanism to realize the benefits of client/server architectures. Additionally, the same applications will work over the Internet, removing the complexity previously associated with supporting remote access. And finally, the client application is the same Web browser that is used for every other Web application, meaning that millions of people already know how to use it. Vendors are quickly providing new Web development tools that allow Web browsers to access data warehouses, thereby addressing the previously discussed needs of the end user: access to the right data, truly regardless of when and where they need it.

Divergent Paths: Data Marts and Operational Data Warehouses

When something as popular as data warehousing arrives on the scene, it's human nature to tweak the concept to see how it can be made even more useful. Though not everyone will agree, for the most part, data warehouses were initially conceived as a single consolidated database, and as read-only. Two movements are challenging each of those aspects: Data Marts and Operational Data Warehouses. What is most interesting is that these trends are essentially at odds with each other.

Data marts have many definitions, but the type of data mart I'm referring to is the one that is created as a subset of a larger data warehouse. There are many reasons for creating a data mart. One reason is simply performance. A marketing user might not want to have to use a machine which is always brought to its knees because the people in finance always run complex queries against the warehouse. Another reason is simplicity. You limit the scope of what you're trying to achieve with a data mart, thereby making the resulting solution simpler to understand and maintain. Finally, there is the autonomy issue. Data marts break up the potentially monolithic data warehouse, allowing each group of users to autonomously manipulate just the data they are most interested in, thereby decentralizing the data access.

Operational data warehouses move in the opposite direction. Rather than portioning off smaller pieces of the data warehouse, they combine the data warehouse with the traditional OLTP operational systems. These systems allow users to analyze the data, and then take actions based on the analysis in real-time, all within the same system. For example, you can look at the purchasing patterns of recent sales, and based on that analysis, enter transactions to ship various products to your various stores. Or, you might analyze stock price trends, and then issue buy and sell orders all within the same system. Additionally, the resources expended on extracting data from the operational system can be reduced because the warehouse and the operational system are no longer two separate systems.

Essentially, data marts further replicate data and decentralize access, and operational data warehouses further consolidate data and centralize access. In the near future, data marts will flourish, because people like to have decentralized control over their own data (for example, people have always found ways to copy data out of a database and place it in a local spreadsheet so they can manipulate the data locally). Automatic tools are being developed that will help keep the information in the data marts current and consistent with the data in the enterprise level data warehouse. On the other hand, operational data warehouses will only play a niche role in the near future. The operational users are different from the analysis users, and it will probably stay that way for some time. In addition, even though this might sound like the "perfect world" solution, the challenges of resolving the performance implications of having critical OLTP applications share the same platform as resource-hungry DSS applications are daunting.

Wrapping It All Up

Data warehouses are clearly a viable solution to a real problem. With the database software vendors scrambling to incorporate specific data warehouse features into their products, the hardware vendors incorporating scalable technologies into their products, the access tools developers adding a plethora of power features into their products, and all of them introducing quick start programs, the data warehouse market will continue to grow rapidly. It may not be the vision of utopia described at the start of this article, but at least we can now get our jobs done a little more easily.

Return to First Half of Article

Ken Rudin is a featured speaker at The DCI Data Warehouse Conference and DCI's Data Warehouse World. He can be reached at krudin@emergent.com.

 
  [home] [event info] [sign up] [exhibit now] [i.t. news] [press room] [find it] [help]

© Copyright 1997 by Digital Consulting, Inc. (508) 470-3880
All event names are trademarks of DCI or its clients.
Comments?
webmaster@dciexpo.com