Intelligent data analysis
The Intelligent Data Analysis and Graphical Models Research Unit investigates and develops methods for intelligent data analysis and probabilistic and fuzzy reasoning. It focuses its research on probabilistic methods (for example, probabilistic graphical models and Bayesian and expectation maximization clustering), possibilistic and fuzzy methods (for example, possibilistic graphical models and fuzzy clustering), and frequent pattern mining methods (for example, frequent item set and sequence mining as well as frequent subgraph mining).
Strong emphasis is placed on open source implementations of the developed methods, with the objective of integrating them under a graphical user interface. This user interface also provides preprocessing and visualization modules, so that they are easily available in industrial applications as well as easily extendable by user-programmed functionality.
Mining Graph Databases and Molecular Data Sets
The computer-aided analysis of molecular databases plays an increasingly important role in drug discovery as well as compound synthesis prediction. One of its most prominent goals is to find discriminative substructures of molecules, which are frequent in the set of (already known) active molecules, but rare in the set of (already known) inactive molecules and thus discriminate between the two classes. The rationale underlying this approach is that the discriminative fragments may be the key structures that determine whether a molecule is active or not (or can be synthesized or not).
The research unit works on methods to find frequent (approximate) substructures in molecules in order to help biochemists to identify promising drug candidates and effective substructures. Earlier work in this direction led to the MoSS/MoFa algorithm and its various extensions, which has been implemented in Java and has been applied successfully to several freely available molecular data sets.
Future research is planned to extend the handling of wildcard atoms and thus to allow for more flexible approximate matching. In addition, other properties of molecules than just the connection structure (for example, the 3D structure and binding angles, charge distribution, solubility etc.) have to be taken into account in order to make the output more useful for (bio)chemists.
Mining (Approximate) Frequent Item Sets and Sequences
Frequent item set mining is an active area of research in which a large number of algorithms have been developed. The research unit focuses on finding approximate frequent item sets and sequences in noisy and unreliable data. Algorithms for this task have applications in analyzing alarm sequences in telecommunication networks.
The core idea of finding approximate frequent item sets is to allow for certain editing operations on the transactions of the database to mine (for example, insertion, replacement, reordering etc.) In this way frequent patterns can be found that otherwise would be lost due to noise and lossy transmission of the transaction data. Earlier work along these lines already yielded the relx algorithm, which allows for insertions at user-specified costs.
Prototype-based Classification and Clustering
Fuzzy clustering and expectation maximization often show superior performance compared to classical crisp clustering algorithms. Especially the more sophisticated variants, which allow for shape and size parameters, can find cluster structures that are difficult to capture with classical methods, which are restricted to spherical and equally sized clusters. However, this comes at the price of higher execution times and lower robustness.
The research unit works on accelerating the clustering process, improving the robustness of more sophisticated algorithms (while still allowing cluster shapes and sizes, but constraining them with different regularization approaches) and on methods to determine the number of clusters (especially resampling based methods as they are currently the most promising approach).
Graphical Models for Planning
Graphical models provide excellent means to structure and represent the knowledge necessary for planning purposes, for example, demand planning a production piece in situations of technical interaction between the parts. In this area the research unit is working with ISC Gebhardt that is responsible for the development and implementation of the planning system in the demand for Volkswagen.
The research unit works on learning of (parts of) graphic models from historical data (under specified restrictions by technical rules and marketing) as well as the revision of knowledge and the identification and elimination of inconsistencies in graphical models.
Graphical Models for Diagnosis
Graphical models have a long tradition of being used for diagnosis purposes, because they are one of the best-founded and most consistent approaches to handling uncertainty about system states and their dependences. However, methods and tools for the (semi-)automatic construction of a diagnosis system based on a graphical model from a technical description of the system are still missing.
The research unit focuses on applying graphical models to identify so called "soft faults" (that is, deviations from nominal values) in analog electrical circuits and other technical devices. Earlier work in this direction produced some highly initial ideas for the diagnosis of analog electrical circuits, but several challenging problems remain to be solved.
InfoMiner - An Interactive Tool for Data Analysis
In order to make the developed methods easily usable for non-experts, the research unit strives to implement at least all data analysis methods under a graphical user interface that is based on data streams (pipes and filters architecture). A prototypical implementation already exists and will be improved and extended by the members of the research team.