A set of services are defined to facilitate data access and semantic negotiation between data sources based on independent vocabularies. The services separate the mapping function between vocabularies from the retrieval function for a data source and these from the actual transformation mechanism between sets of data elements from the vocabularies. The services are designed to leverage existing efforts and products generated as part of a typical integration, with a goal of maximizing the visibility of and flexibility to substitute between semantic relationships while minimizing additional work needed to implement such a system.
data access, semantic mapping, service architecture
An application developer creates the algorithms and data processing necessary to accomplish a task and then the next level developer or user must satisfy the implied or documented interfaces through which the capabilities of the application can be utilized. Concentrating on the data access process (although the concepts can be extended to data write and direct communications between components), if we refer to the data user as the target and the source is the resource that supplies information which when processed will satisfy the target, then in order to access information to support an application a data user must define
o the data needs in the target vocabulary,
o a source which can provide values to meet the data needs,
o the specific source data elements (in the source vocabulary) whose values are needed,
o the means to retrieve these data element values from the source,
o the processing which the source values must undergo to generate the required target values.
While in theory this process is made easier by a common vocabulary, in practice if the source and target were developed independently, the negotiation process is initially done manually to ensure that the semantic details are consistent at both ends. The following describes an architecture which is designed to enable integration of components and data sources which are described in terms of multiple, overlapping vocabularies and to demonstrate a modular use within a service-based paradigm. The goals of the architecture are (1) to enable transparent access to content without requiring user processing of semantic interchange or the specifics of content access, (2) to accomplish the access using mechanisms which are externally visible to inference engines, and (3) to implement the access in a way which provides value to the user with a minimum amount of additional effort.
The following assumes that the integration steps outlined above are routinely done and the goals of an improved process are to minimize the effort to make data connections while maximizing the amount of interface details which can be consistently reused and captured for use by external inference engines. The components and their corresponding data flows are shown in Figure 1 and described in the following subsections.
Figure 1. Process flow for simple data access
As a precursor to component descriptions, several terms are defined to establish concepts which are used to describe the workings of the basic system shown in Figure 1.
Data user: an entity (human, machine, software) which has a need for data and a context in which that data will be used. Data users may share common vocabularies or portions of vocabularies or may have independent vocabularies whose data elements are related through semantic associations.
Semantic Association: a relationship between named data elements, especially data elements from different vocabularies. In its simplest form (and the one illustrated in Figure 1), a semantic association can define arithmetic and/or logical operations to be performed on the values corresponding to one set of names in order to generate values corresponding to a second set of names. The processing which makes up a semantic association is not constrained to being one-to-one equivalences between data elements, and may include algebraic, conditional, and/or probabilistic relationships. The association is the single place within the architecture where domain semantic relationships are explicitly defined. Note, in the following, the use of "association" refers to semantic association unless otherwise specified.
Metadata: that set of descriptive properties which (1) uniquely characterize an object and allows a user (human or machine) to discriminate between one object and another, and (2) describe how the object and its contents can be accessed in either a read or write mode. Metadata includes what the object is, where it is located, and how to make use of it. It may include the calling argument to methods which act on the content of an instance of the object, including accessing it from its native storage format.
The Data Access Service is the central component in mediating and coordinating a data access request. The service (1) takes as input a list of names corresponding to target entities for which values are needed, (2) calls upon the Semantic Mapping Service to identify (one or more) associations to (one or more) data sources and the corresponding data source elements needed to generate these values, (3) retrieves data element values from the data source, (4) invokes the processing of the associations (including passing of data element values retrieved from the data source) to generate values for each name on the target request list, and (5) returns the target values to the entity which invoked the Data Access Service. The Data Access Service will invoke Data Access Methods to extract values from corresponding data sources for use as input to the associations.
The Semantic Map is a repository of semantic associations between named data elements. It is assumed that a value information source (identified by a source identifier and the appropriate data elements at that source) is processed by an association to generate values for some target set of data elements (identified by a target identifier and the appropriate data elements of the target). In general, this can be represented as
association
(source:source_elements) --------------> (target:target_elements)
For the simple data access case currently being described, the target and target_elements are the search criteria and the source, source_elements, and the association are the return values. In other uses of the Semantic Map information, other combinations of entities in this relation can be supplied as search criteria with the remaining entities comprising the return information. The data element names themselves are not required to convey any semantic content and the Semantic Map knows no details of any association, other than that one exists. Associations (or pointers to associations) are information inputs to the Semantic Map, not a part of its structure or implementation. Thus, associations and the associated data sources can be created or modified without requiring modification of the Semantic Map infrastructure itself.
The use of the Semantic Map allows each data user and each data source to maintain its own vocabulary as best suited for its own objectives. By identifying "who I am" (see Figure 1), the data user specifies which vocabulary (or for XML purposes, which namespace) is of interest, and by specifying "what I need", the data user identifies specific data elements within that vocabulary (namespace). At this point, the Semantic Map will not contain information about a data source's structure, only the names of the data elements which map to other data sources and the corresponding names of data elements at those other sources. (See below for discussion of the Data Access Method as relates to knowledge of the data source structure.) For the Semantic Map, it makes no difference if the mapping is to a single object model or a dozen. The association maintains considerable flexibility in relating different vocabularies and encapsulates the changes needed to capture modifications to vocabularies or data field relationships.
A Data Access Method is an executable corresponding to a data source and which extracts data values from the data source. A Data Access Method understands the means by which data values are stored and/or generated at the data source and can accomplish access of values for a given list of data elements known to the data source. It is assumed that the Data Access Method takes as its arguments a list of data element names and returns values corresponding to those names. The actual processing may be a simple SQL call to a database or it may be significant processing and authentication to satisfy access privileges. In any case, the access is accomplished without any knowledge of how the accessed values will eventually be used, unless this is required as part of user authorization.
An initial prototype of the Data Access Service and Semantic Map has been built and it adequately demonstrates feasibility of the concepts. Design has proceeded for dealing with multiple mappings, multiple Data Access Methods, and multiple associations, and the concepts have also been extended to data write scenarios. A next generation prototype is being planned which incorporates independent variables for use in differentiating specific data element values (e.g. which database record) within the data source and use of IDs to indicate user preferences in data sources, semantic mappings and Data Access Methods. Work is also planned for investigating whether the architecture has any limitations when dealing with complex data types. Finally, preliminary design has looked at using the Semantic Map to facilitate component-to-component communications when the components have their own respective vocabularies.