This research provides a systematic approach for integrating Web systems through linking interrelated elements and functions. The infrastructure generates the vast majority of link anchors and links automatically through the use of structural relationship rules, in addition to lexical analysis.
digital library, service integration, automatic link generation, collaborative filtering, lexical analysis
This research provides a general method for integrating Web systems through linking the interrelated elements and functions. While our approach is a general one, we shall illustrate it using digital libraries as our sample domain.
The purpose of the Digital Library Service Integration project (DLSI) is to automatically generate links for digital library collections to related collections and services. Collections are libraries of computerized documents. Services include searching, providing annotations and peer review. Figure 1 presents an example of what users would see.
DLSI supplements collections by linking them automatically to relevant services and related collections. DLSI supplements services by automatically giving relevant objects in collections (and other services) direct access to these services. Users see a totally integrated environment, using their system just as before. However, they will see additional link anchors, and when clicking on one, DLSI will present a list of supplemental links. DLSI will filter and rank order this set of generated links to user preferences and tasks.
The DLSI infrastructure provides a systematic approach for integrating digital library systems, and by extension, any other information system with a Web interface. Systems generally require no changes to integrate with DLSI.
Figure 1: Mockup of a document with DLSI support. DLSI automatically adds link anchors, including an icon in the top right-hand corner for the document as a whole. Choosing one prompts DLSI to generate a list of links. The figure superimposes two possible sets of links for different elements: the concept "Plant Pathology" and the document as a whole. Each link shows a descriptive label, and the system to which it leads.
Figure 2: DLSI Architecture. DLSI is within the shaded area. The dashed paths indicate that once integrated, collections and services can share features through DLSI links automatically. Integrated systems also continue to operate independently of DLSI.
Figure 2 presents the DLSI integration infrastructure. To integrate a system, an analyst must write a wrapper, initiate communications between the system and its wrapper, and define relationship rules. (The DLSI Integration Manager module manages the relationship rules.)
(1) Develop a Wrapper: The wrapper's main task is to parse the display screens that appear on the user's Web browser to identify the "elements of interest" that DLSI will make into link anchors. First, wrappers will parse the display based on an understanding of the structure of its content. Second, DLSI will parse the display content using lexical analysis to identify additional elements of interest. If a service can operate on an element, DLSI will generate a link anchor over the element. Among the links generated for that anchor will be a link leading directly to that service's feature.
(2) Develop Relationship Rules: Relationship rules specify the "structural relationships" for automatically generating links for recognized object types within the system being integrated.
(3) Initiate Communications: Several possible ways exist to ensure information passes between the system being integrated and the wrapper.
Most other kinds of information systems could be integrated in the same manner as digital library collections and services.
We need to emphasize that DLSI generates the vast majority of link anchors and links automatically. If a system can operate on an element, DLSI will generate a link leading directly to that system's feature. For example, if there were a discussion thread about a course, any time that course's identifier would appear in a screen or document, DLSI would automatically detect this and add an anchor over the course identifier.
DLSI typically generates link anchors in two ways. First, "wrappers" parse screens and documents based on an understanding of the structure of the system's displays (i.e., using form templates, XML markup or parsing rules). Most anchors are identified in this manner.
Second, DLSI parses the screen and document content using lexical analysis to identify additional anchors. DLSI generates links automatically based on relationship rules.
Relationship rules define which relationships (links) should be available for which kinds of elements. For example, in Figure 1, the relationship rule underlying the first concept link would include the following parameters:
Because they operate at the "class" or "kind of element" level, each relationship rule works for every element of that class. E.g., the rule above applies to any "concept" found in any document displayed.
Each relationship rule represents a single relationship for a single element class. As elements can have many relationships, each element class can have several relationship rules. Each element instance triggers the same set of relationship rules, assuming conditions are satisfied for each. In Figure 1, nine relationship rules triggered for the "concept" element (or more rules triggered, but the filtering mechanism produced this customized list).
The DLSI Integration Manager uses the relationship rules to determine which elements in a display will have links. The Integration Manager then creates an integrated HTML or XML document consisting of the original display output together with DLSI's anchors, which it will send to the user's browser. When the user selects an anchor, DLSI will use the relationship rules to generate a list of relevant links. When the user selects one, the Integration Manager passes the appropriate information to the appropriate collection or service for that link.
The Integration Manager is built upon the Dynamic Hypermedia Engine project [1, 2, 3].
DLSI wrappers perform lexical analysis when they parse documents and display screens to determine additional "elements of interest," which the Integration Manager will supplement with DLSI link anchors. Our Noun Phrase Extractor works this way: Tokenization is first performed on the document or display screen. We then use the Wordnet lexical database [http://www.cogsci.princeton.edu/~wn/] to assign part-of-speech tags to tokens. Finally, a morphological and syntactic rule base is used to parse sentences and extract noun phrases. The Noun Phrase Extractor extracts noun phrases in their root forms (this takes care of morphological changes) from returned documents. These root form noun phrases are then separated into two lists of phrases: those that are in the master thesaurus file and those that are not. Any found in the master thesaurus will be made into supplemental link anchors. Keywords and key phrases from participating collections and services also will be added to this integrated master file.
The number of potential links that DLSI could generate for a particular element on a screen could vary from several to well over a hundred, resulting in the well-known hypermedia problem of cognitive overload. With a large number of links, filtering and ordering them is critical for effective use. Filtering and rank ordering in DLSI poses several challenges. First, it should be customized to each user's needs. Second, it should dynamically re-organize as the users advance through the system. Third, for the same user, support for multiple needs must be possible. A user may have several different tasks (needs) and the links should be re-organized depending on the user's current task.
DLSI incorporates collaborative filtering to filter information based on people's evaluations or behaviors. It generates recommendations using the following algorithm [4, 5, 6]:
This research's primary contribution is providing a relatively straightforward, sustainable infrastructure for integrating information systems. Other contributions include:
We gratefully acknowledge support by the NSF under grants EISA-9818309, EIA-0083758, IIS-0135531 and DUE-0226075. DLSI is part of the National Science Digital Library project (http://www.nsdl.org).