|
|||
|
Development of Algorithms and Tools to Enable Semi-automatic Discovery and Extraction of Enterprise Knowledge in Legacy Information Sources The goal of the computer science effort is to develop algorithms and tools to semi-automatically discover and extract enterprise knowledge in legacy sources in order to support the efficient generation of source wrappers. Wrappers are essential in enabling the exchange of data between heterogeneous systems, which may employ different data models, representations, and query languages to manage their data. Existing wrapper development tools require significant programmatic set-up with limited reusability of code. Given the diversity and number of information sources available today, the time and investment needed to establish connections to legacy sources has imposed severe limitations on the scalability and maintainability of wrappers. These requirements have generally acted as a significant barrier to information integration in enterprise development (e.g., supply chain automation). Efforts are under way to develop languages and tools for describing resources in a way that can be processed by computers (e.g., Semantic Web). However, they do not address the problem of how to collect this enterprise knowledge, or how to maintain it efficiently for the continuously increasing number of legacy sources. Our approach to wrapper development significantly extends current techniques by reducing the dependency on human input. Specifically, our three-pronged approach produces a detailed description of the legacy source, including entities, relationships, application-specific meanings of the entities and relationships, business rules, constraints, etc. First, schema information is extracted using a data reverse engineering algorithm. The schema is then semantically enhanced based on clues extracted from application code using code mining and pattern matching techniques. Additional business rules, which may be encoded in the data stored in the legacy source, can be discovered with the help of our situational knowledge miner. We collectively refer to the extracted information as enterprise knowledge, which forms a knowledgebase that serves as input for subsequent schema matching and wrapper development processes. These components are currently under development. The extracted business rules, available source data as well as information about source capabilities are used to configure a value-added decision support and analysis component inside the wrapper. This component performs knowledge composition including basic mediation tasks and post-processing of the extracted data. Although knowledge extraction proceeds automatically, extraction results are monitored and validated by domain experts, particularly when extracting knowledge from poorly formed database specifications often found in older legacy systems. Our extraction algorithms support step-wise refinement of the extracted knowledge and the wrapper configuration to improve the quality of the generated wrapper. We
have built an initial SEEK prototype to demonstrate feasibility of our
approach. Extensions to the core SEEK technologies include research in
evolutionary algorithms to enable near fully automatic connection to
potentially thousands of sources in near real-time. Further extensions
include bootstrapping of the wrapper configuration process to enhance
the scope and quality of data extraction. Bootstrapping is directed by
end-users who aid discovery of source data through improved methods of
knowledge representation.
|
|
Copyright © 2002
![]()