web hit counter DCI: Sue Mellen - Applying a Filter to the Information Stream
 
 

Publication Date: November 8, 1996
Related articles - Keeping an Electronic Eye on Business and 10 Web Sites for Business-Related Information

Applying a Filter to the Information Stream

By Sue Mellen

It has been said that the Internet is a huge parking lot where someone has dumped all of the books from all of the libraries in the world. How do you find the information you need in the middle of those stacks without squandering time? Some companies are turning to news and information filtering products that turn all that data into neat little packages of information.

It only makes sense that, in this age of "info-glut," information filtering has become big business, with the leaders in the industry relying on a combination of database technology and human expertise. Two companies that have built their reputations in the area are OneSource Information Services of Cambridge, Mass., and Individual, Inc. of Burlington, Mass.

From the Lotus Position

An independent company since 1993, OneSource claims more than 1,000 corporate customers, each with multiple users. The company first saw light as Lotus One Source, born out of Lotus Development Corp.’s 1987 acquisitions of two early innovators in CD-ROM information storage: Datatext Inc., which was dedicated to gathering and compressing text-related information, and Isis Corp., where the focus was on harvesting and organizing numeric data.

Dan Schimmel, OneSource president and CEO, says his company’s family tree provides a competitive edge in the number-crunching '90s. Thanks to the marriage of two such disparate entities as Isis and Datatext—a highly unusual move at the time—the company has long experience in integrating text and numerical data from many different sources, he says.

"A OneSource user has the numbers along with text-related analysis, often from an entirely different source, to help him understand what the figures mean," Schimmel says.

Information reaches the company through a series of licensing agreements with "brand-name providers," according to vice president Jimmy Becker. Data sources include Moody’s Investors Services; Dow Jones & Co., Inc.; Newsline; and Prompt., a database of news abstracts on new products, acquisitions, users, technology and business ventures.

OneSource uses a filtering system it calls Master Entity Vocabulary (MEV) that has at its core an Oracle database of 150,000-plus company names. When information comes into the system, it is first converted to a common format using a program called a "loader," then matched against the database of corporate entities to determine relevancy. At this point, the company’s three database drivers come into play. These consist of a specialized numeric engine capable of pulling figures from articles and reports; a full-text engine that indexes text; and a real-time news processor that takes constant news feeds from providers and matches them with appropriate corporate entities.

Human intervention also plays a part in the MEV process, with five full-time editors and a number of consultants bolstering the technical component of the system. Editors check incoming data for new companies or changes in the way data providers identify companies, then match them against the MEV to be sure the system recognizes the revised data. In the case of new information sources, human editors wait to see how many entities their computerized colleague will recognize, manually matching any orphans left in the data.

The biggest news at OneSource is its addition of a Web-based delivery system. On Oct. 15 the company announced deployment of OneSource.com, a line of products employing Internet tools to function on corporate intranets. Initially, the company is offering two Web-ready commodities: Account Manager for sales professionals and Business Browser directed toward corporate researchers. Two other products—Insurance Analyst for the data-hungry insurance industry, and UK Business Browser—are scheduled for launch by the end of the year.

The company continues to offer its products in other formats including CD-ROM—hearkening back to its roots as a composite of the two CD-ROM pioneers—and Lotus Notes. But Becker says the Web format offers significant advantages, including timeliness and portability. "CD- ROMs just don’t work very well when you're out on the road making sales calls," he says. But he adds that the CD-ROM format is still the format of choice for some users. "We’re absolutely committed to the CD-ROM platform. Some people need the in-depth, custom reporting CDs allow."

A SMART Use of Technology

Individual Inc., founded in 1989, claims 280,000 readers worldwide. The company already has a significant presence on the Web in NewsPage, an online information service boasting more than 25,000 pages of news related to various topics and industries. The service gets more than four million hits a week and feeds information to more than 200,000 users. Two other Individual products, First! and First! Alert, send to a user’s system packages of breaking news on pre-selected topics or companies. First! subscribers get information by 8 every morning, with First! Alert customers getting bulletins throughout the day.

Individual employs a proprietary sorting system it calls SMART (System for Manipulation and Retrieval of Text) technology, developed by the late Dr. Gerard Salton of Cornell University. The system has three key components used to filter incoming data: a thesaurus, the core SMART engine, and the Post Processor; with a staff of 30-plus editorial managers or "domain experts" overseeing the entire process.

"We’ve hired experts in telecommunications, information technology, aerospace, health care, energy, finance, and automotive, just to name a few keys industries," says Richard C. Vancil, Individual’s vice president of marketing. He explains that the domain experts manage customer profiles with an expert in health care, for example, making sure that hospitals and physicians practices have the right recipe of news and information.

Individual gets daily information feeds via leased telephone lines, satellite dish reception and dial-up modem, then formats each story into the universal format required by the core SMART filtering engine. After a built-in Story Editor eliminates long-winded or error-filled articles, information goes on to the system’s thesaurus, which adds semantic equivalents of important words. It is also designed to recognize critical words that may be used infrequently in a story.

"If, for instance, a user is interested in local area networks, he’ll get information about LANs. Using traditional Boolean search methods, the word or phrase in the query would actually have to appear in the article," says Vancil.

The core SMART technology basically assigns values to text, creating algorithms based on the frequency, placement and relative importance of terms. The system uses a similar process to assign values to a user’s query, and selects stories that have values falling within a pre-determined distance from those in a user’s query.

Whenever a story falls within the specified distance, it is picked out and passed on to the Post Processor to make sure it really fits the customer's area of interest. At that point, an editor can intervene to either shrink or increase the distance based on a customer profile, so that more or fewer stories are passed on for final filtering. This closing process employs pseudo-Boolean (the company calls it "fuzzy Boolean") technology to cull any mis-hits.

Finally, the information is delivered in a variety of formats including fax, e-mail, intranet or enterprise-wide feed for groupware platforms such as Lotus Notes. An acquisition in June 1996 added FreeLoader, Inc. to Individual's mix. FreeLoader offers an off-line browsing application that enables users to retrieve and store Web pages on their hard disk for later viewing (see related story).

Growing Market

The business world's continuing demand for information is certain to promote growth in companies like OneSource and Individual. According to the research company SIMBA Information, Inc., business information earnings reached $25.4 billion in 1994. And according to IDC/Link Resources, news filtering services represent a growing segment of that market, with earnings expected to hit $85 million in 1996 and $185 million by 1999.

Sue Mellen writes from Tyngsboro, Mass.


DCI's Database & Client/Server World focuses on a wide variety of database applications and issues. Please see our latest on-line brochure for conference, exposition and registration information.

Related articles - Keeping an Electronic Eye on Business and 10 Web Sites for Business-Related Information


 
[Home] [Events] [Find It] [Sign Up] [IT News] [Support] [What's New] [Brochures]
©Copyright 1997 by Digital Consulting, Inc. (508) 470-3880
All event names are trademarks of DCI or its clients.
Comments?
webmaster@dciexpo.com