
(Please Note: The following article is quite lengthy.
It has been divided into three sections to assist you in
downloading the article more quickly. To help you navigate the
article more easily, the links in the Table of Contents below
will transport you to different segments of the article. You will
also find links back to this Table of Contents after each
segment. It is suggested that you print the entire article out in
order to read it more easily.)
About the Author
Introduction
Two Different Approaches: Warehousing versus
Replication for Transactions
Distributed DBMS
TP-R as Compared with Distributed DBMS
The Crucial Role of a 2 Phase Commit Definition
Replication and 2-Phase Commit
Introduction
to DSS-R Approaches and Table Copying
DSS-R
Schema
Cascading
Replicates
Optimization
- Push & Pull
TP-R
Replication: Peer to Peer & Master/Slave Approaches
Collisions
with a Peer to Peer Architecture
TP-R
and Fault Tolerance
Transparency
& Richness of Function
The
TP-R Schema
Replication
Timing
Database
Configuration & d.b.a. Utilities
Replication
into Heterogeneous DBMS
Summary
of Replication Benefits
Better
Response Time from Local Data
Replication
for Hot Standby Backup
Data
Availability Such as Separate Servers for Separate Functions
Splitting
the Workload for Capacity Relief
Non-Stop
Processing & System Fault Tolerance
Conclusion
It's
a Complex Environment
Your
Database Administrator is a Key Resource
Your
Approach Should be Cost and Benefit Based
Keep
it Simple, Especially at First
About the Author
Dr. George Schussel, DCI's founder, is Chairman of Database &
Client/Server World and a world-renowned authority on information
systems and client/server technology. He is a lecturer, writer
and consultant to the computer industry and major user companies
on data processing issues.
Replication, or the copying of data in databases to multiple
locations to support distributed applications, is an important
new tool for businesses in building competitive service
advantages. New replicator facilities from several vendors are
making this technology much more useful and practical than it's
been in the past. In this article we will go into enough detail
on replication for the reader to understand the importance of
replication, its benefits and some of the related technical
issues.
Buying trends today clearly indicate that companies want their
applications to be open and distributed closer to the line of
business. This means that databases supporting those companies
have to migrate to this same open, distributed world. As
distributed operational applications become more widely used
across large enterprises there is going to be a requirement for
increasing numbers of data copies to support timely local
response. This is because the propagation uncertainties and costs
associated with real time networks and/or distributed DBMS
solutions are a headache to deal with.
Replication provides users with their own local copies of data.
These local, updatable data copies can support increased
localized processing, reduced network traffic, easy scalability
and cheaper approaches for distributed, non-stop processing.
While replication or data copying can clearly provide users with
local and therefore much quicker access to data, the challenge is
to provide these copies to users so that the overall systems
operate with the same integrity and management capacity that is
available with a monolithic, central model. For example, if the
same inventory records exist on two different systems in two
different locations, say New York and Chicago, the system needs
to insure that the same product isn't sold to two separate
customers.
Replication is the best current solution for many applications
because it can be cheaper and more reliable than the alternative
of a distributed DBMS engine. A distributed DBMS uses a 2-phase
commit to couple together all updates to all locations
participating in an update. This becomes difficult as the number
of participating nodes increases. With 50 or more nodes a tightly
coupled 2-phase commit process for updating is probably
impractical. A replication approach uncouples the applications
from the underlying multiple data copies and allows the
applications to proceed while behind the scene the replication
server handles the coordination of multiple updates. The
difference in approaches between replication and distributed DBMS
approaches will be discussed in detail below.
There are many different approaches to replication, each well suited to solving certain classes of problems. The different types of technologies, in fact, span a scale of approaches as is illustrated as in Figure 1. On the right side are classes of technologies that are appropriate for supporting operational systems whose principal role is allowing real time transaction processing in widely distributed locations. On the left side of this scale are approaches that are well suited for supporting decision making, browsing and research on LAN based PC's or other platforms.
The type of
replication strategy that is appropriate is very problem or
application dependent. Decision support applications are often
well supported by technologies that employ table copying or
snapshot technologies. These technologies (the term
"warehouse" is often applied here) can support multiple
schema's or data views and are normally set up so that the copies
are "read only". For simplicity in this paper, these
approaches will be referred to as warehouse or (DSS-R, Decision
Support Systems - Replication).
Warehousing applications are usually characterized by a need for
data copies that are consistent for a single point in time; that
point doesn't necessarily have to be the current time and
sometimes it's preferable that it's not. In period accounting or
trend analysis, for example, a stable data source is essential.
In decision support systems (DSS) one usually needs history for a
series of data values. Multidimensional or matrix representation
of the data with one of the axes being time is frequently used.
The data, when it's presented to the user, cannot be encoded but
has to be in a form that is comfortable and familiar to users.
GUI forms of presentation are becoming a requirement for
warehouse applications. In order to make the data most useful to
the end user, it frequently requires the system to perform
derivation, aggregation and transformation functions to the raw
data.
Database copies that support warehouse applications are usually
read only. Updates, as they occur, are performed to the source
production system database from which the target warehouse
database copy was created. It is possible, however, to have an
environment where updates can be processed against both
production and warehouse databases. This is done by keeping the
two in a synchronous state with a 2-phase commit update against
both source and target data. Normally this is not a good idea
because of specialized tuning for the read only copy that allows
it to perform better in decision support. Transaction processing
updates will likely interfere with its job efficiency.
IBM is probably the leader in offering technology to support data
warehousing. DEC, Hewlett Packard and Information Builders are
other companies that offer important technology for supporting
data warehouse approaches.
At the other end of the replication technology continuum are
replication approaches that are designed to replace and improve
on distributed on-line or distributed DBMS technologies. These
approaches have to offer real time updating against copies of
data that may be located in many locations. The basic approach
used by replication servers here is to uncouple the distribution
of data copies from the originating application. The application
is allowed to update its local copy and when that is completed it
proceeds to the next transaction. Asynchronously, the replication
server, then propagates the changed data to its other locations.
This type of replication is appropriate for a production system.
It normally requires a single global schema. These systems will
be referred to as (TP-R, Transaction Processing - Replication).
TP-R requires a very different technology than warehousing.
Production systems need the current state of data, not its
history. For efficiency purposes, at input and in processing, the
data is frequently encoded. Each production location does not
necessarily need access to all of the global data. Subsetting by
region, for example, is common. Any node must allow updates to
production data. The propagation of update copies to secondary
locations should be done as soon as possible. That propagation,
then, is done in near real time with a separate 2-phase commit to
each target copy location. Such a system can maintain transaction
consistency for updates that span multiple tables at one (or
more) target sites.
The leaders in TP-R approaches are Sybase and CA-Ingres. Sybase's
architecture is built around a master/slave concept. CA-Ingres is
based on a peer to peer model. Both of these approaches will be
discussed below.
Between the two extremes of DSS-R and TP-R there are many
possibilities of combining features and functions for a
customized distributed solution. When considering replication
options the user needs to consider requirements for currency,
local updates, data enhancement and history maintenance, among
other considerations. In the interest of keeping this article of
readable length we will concentrate on replication and
distributed DBMS issues from the two ends of the continuum scale
shown in Figure 1.
It's useful to understand something about distributed DBMS
technologies before analyzing replication, because the approaches
are very closely related. The concepts behind distributed DBMS
were pioneered during the late 1970's in the IBM research project
R*Star. IBM's subsequent delivery of distributed DBMS products
has been part of a 10 year evolving technology known as DRDA
(distributed relational data architecture). DRDA at this time is
largely an approach for integrating data sets across the
different versions of DB2 that run on AIX, OS/2, OS/400, VM and
MVS. DRDA has been published and IBM encourages other DBMS
vendors to participate as client or server sites.
The first well-publicized distributed DBMS product was
INGRES/Star, announced in 1987. Oracle also announced distributed
DBMS capabilities in 1987, but largely as a marketing ploy. The
first Oracle product to reasonably support distributed database
processing is Oracle 7, which has been in the market since 1993.
A true distributed DBMS, as defined by most industry consultants,
requires the system to support updates at any node on the
network. A short summary of Chris Date's requirements for the
functions that should be supported by a distributed DBMS is
provided in Figure 2.
If one compares TP-R to distributed DBMS, the main difference
is in the relationship of the application to the various
distributed updates. With distributed DBMS there is a single
2-phase commit that synchronously locks all of the data copies
until all locations respond with a "committed" message.
With TP-R replication this is replaced by "n" separate
two phase commits, where "n" is the number of separate
data locations.
Replicated database nodes are less synchronized than data copies
maintained by a distributed DBMS, of course. Conversely, they
offer faster overall system processing, faster local commits of
transactions and the potential for significantly reduced network
traffic. All of this, however, is at the risk of collisions
between different data servers.
A collision is when different physical copies are updated by
different applications during the same interval of time. A DBMS
replication server needs to provide software to aid in the
resolution of such collisions. Software can detect a collision
and provide notification. It also can follow any business rules
that have been set up to resolve such an occurrence. Collision
resolution is discussed later in this article.
The replication approach is more fault tolerant than distributed
database and therefore more appropriate for many applications. In
a replication approach the timing between the changes at the
different nodes is managed through mail or store and forward
approaches rather than through locked multi-site transactions.
Once the application updates its local data, it is de coupled
from the replication engine which has the responsibility for
propagating the copies of the changed data to other locations. A
transaction managed through a replication approach is considered
successful if it is committed at one site (in a peer to peer
system) or at the master site (in a master/slave approach).
Replication cannot be used where absolute data synchronization is
required for the application. Examples of such applications would
be financial trading and banking funds transfer. Other
applications such as order processing, and hotel or airline
reservations might be handled with replication approaches. After
all, airlines and hotels overbook intentionally. If the
application can deal with some inconsistency among the different
data nodes for short periods of time, then replication should be
considered as an alternative.
The essence (and the bane) of distributed database is the 2-phase commit. What the 2-phase commit accomplishes is a synchronized locking of all pieces of a transaction. The result, then is an atomic action when the transaction is spread over multiple locations and processors. A 2-phase commit allows a distributed transaction to be processed with the same data integrity as a transaction that is processed entirely within a single computer database.
In Figure 3, an example is provided showing the need for a
technology like 2-phase commit. Things start off when application
1 updates the lower left database. It does that by reading the
before image of the data to be modified, changing it and holding
locks on the data, plus preparing a log entry in the lower left
database. That application then goes on to successfully
accomplish the same process with the center lower database. But
as application 1 tries to complete its updates by updating the
database at the top right, it finds out that another application
(2) has already modified part of the data that is to be updated
by application 1. In other words the data read now doesn't agree
with the values that were read in the first two updates. This is
a "collision" and the end result for a distributed DBMS
is that the first two pending updates have failed, the locks are
released and the transaction is rejected.
Figure 4 details the procedure followed by a distributed DBMS in
a 2-phase commit. When a synchronized 2-phase commit is combined
with data locking, logging and recovery, the necessary
ingredients for building a distributed database with absolute
data synchronization are in place. However, because any failure
in the network or any of the local participating databases causes
the entire transaction to fail, this approach to distributed
computing is very intolerant to failure.
Because of this intolerance, distributed DBMS are not typically
used to create and manage replicates. The distributed DBMS is
more useful where data integrity across multiple sites must be
guaranteed. In these environments the real failure would be to
permit updating some nodes in the presence of outages of others.
All modern distributed DBMS products offer methods for
implementing a 2-phase commit. However, the degree of automation
support is different from different vendors. For example, IBM,
Oracle and CA-Ingres offer high level (transparent to the
application) approaches to implementing 2-phase commit. The
Sybase replicator takes more programming to implement in that it
requires the user to handle some of the "handshaking"
issues, by, for example, coding DBLib or RPC calls into the
application. If the application environment requires transaction
rollback from time to time, this additional programming can be
difficult to handle properly.
Distributed DBMS 2-phase commit procedures and implementations
are proprietary as there is no standard established for how it
should work. There is an XA standard from X/Open which has been
implemented in several transaction monitors, but it hasn't been
implemented as part of any vendor's DBMS technology. IBM and
Sybase have published their proprietary protocols and procedures
for 2-phase commit. Some other vendors, like CA-Ingres and
Digital, have taken advantage of this information by including
support for a 2-phase commit process to extend from their own
systems to those of IBM and Sybase also.
Replication offers an asynchronous approach to updating
copies. Asynchronous as defined above means that the distributing
of the updates to secondary sites has been uncoupled from the
primary update. The process which transmits the updates has to be
reliable (insuring that the copies get to the targets) and valid
(insuring that the necessary integrity is maintained at the
target). DSS-R and TP-R approaches can use the same approach for
reliability, but typically use different technologies for
validity/integrity.
In both approaches, the process for insuring that no copy
information is lost is to use a 2-phase commit protocol for the
changed data transmission.
In TP-R environments the integrity of data at the target site
must be maintained by applying copy updates one transaction at a
time. The changed data from one user transaction can span
multiple tables at each update location. At each site, then, all
or none of the updates should be applied. This way the data stays
consistent across all tables at all times.
In contrast, DSS-R approaches apply updates table by table. All
tables that may have been affected by one transaction aren't
committed in the same unit of work under DSS-R.
The DSS-R approach is usually far more efficient in computer and
network resources, especially since it allows for the net result
of a series of updates to be transmitted rather than the
propagation of all the individual changes themselves. However,
this "netting out" isn't appropriate for transaction
based environments.
In both replication approaches, unlike for a distributed DBMS, it
is not necessary for the 2 phase commits that distribute the
copies to be part of the original (application driven)
transaction. An example of a TP-R implementation approach for
this is:
It is interesting to consider the above scenario and how it
compares to a true distributed DBMS solution based on synchronous
updating approaches. The first key point is that the application
isn't blocked by a problem that is related to distributed
transactions. It performs a local update, which is quick, while
the DBMS manages the distribution asynchronously.
The second key point, of course, is that there is a latency
between the updates performed at the primary and subsequent
copies. This raises application issues which the user needs to be
able to live with. These points will be discussed below under the
topic of collisions.
A third key point is that in larger and/or less reliable
environments a distributed DBMS approach just wouldn't work,
while replication's architecture can. Imagine a situation with
100 target database nodes, only 90 of which are available. A
replication server would perform 100 separate 2 phase commit
transactions, each with two branches. A distributed DBMS would
attempt to perform one 2 phase commit transaction with 100
branches. Even if all 100 nodes were on-line, the distributed
DBMS would hold locks on all 100 targets until all 100 were
willing to commit. In an unreliable WAN scenario (e.g.,
developing countries) or any situation with many nodes, clearly
the distributed DBMS solution just flat doesn't work.
If you use a replication server to support operational systems,
the application view of data at the different locations should be
1) each logically consistent within self, but 2) possibly out of
phase with each other for some period of time. The differences in
data among different nodes are all transitory and get reconciled
over time.
©Copyright 1996 by Digital Consulting, Inc.
All Event names are trademarks of DCI or their clients.
Comments? webmaster@dciexpo.com