Multi-objective Graph-based Data Mining. Design and Mining of Visual Science Maps
In spite of the fast and huge development experienced by the data mining and knowledge discovery field in the last few years, current tools and techniques to examine the content of large databases are still hampered by their inability to support searches based on criteria that are meaningful to users of those repositories. These shortcomings are particularly evident in data banks storing representations of structural objects, such as biological networks, scientific data, satellite maps, organizations’ technical documentation, semantic web, CAD circuits, and many others.
Recently, conceptual clustering techniques have been successfully applied to structural data to uncover objects or concepts that relates objects (i.e., subgraphs that represent associations of features), thus creating the area of graph-based data mining and knowledge discovery. However, they still carry the problem that the formulation of the search problem in a graph-based structure would result in the generation of many substructures with small extent as it is easier to explain or model match smaller data subsets than those that constitute a significant portion of the dataset. For this reason, any successful methodology should also consider additional, conflicting criteria to extract better defined concepts based on the size of the substructure being explained, the number of retrieved substructures, and their diversity. To deal with this problem, we have proposed a methodology called Evolutionary Multi-Objective Conceptual Clustering (EMO-CC) that uses multi-objective and multimodal EAs (specifically, genetic programming) to evaluate concepts or substructures based on the former conflicting criteria, and thus, to retrieve a Pareto set of non-dominated, meaningful substructures from structural databases in a single run. This approach allows us to uncover cohesive clusters comprising even an small number of observations (and not only the most frequent ones, as usual) that describe the underlying phenomena from different angles, revealing novel information that otherwise would be concealed by uninformative frequent descriptions.
EMO-CC was proposed for Bioinformatics problems but it is easily applicable to different domains by customizing the optimization objectives used to evaluate interesting substructures. Our current main application field is the mining of visual science maps (scientograms), a very novel, useful tool for the analysis of scientific information, which is built from co-citation information using classical methods from the field of bibliometrics such as citation analysis, and social networks analysis and information visualization techniques. We are currently performing scientogram mining in order to analyze and compare the structure of scientific fields and research fronts in maps of the same (taken at different periods of time) or different domains (looking for similarities between different countries scientific productions) using EMO-CC and other graph-based data mining techniques.
This work is done in cooperation with the Scimago research group (http://www.scimago.es/) headed by Prof. Felix de Moya at the University of Granada, which have developed two very ambitious projects called The Atlas of Science (http://www.atlasofscience.net/), to create a web-based information system achieving a graphic representation of all the IberoAmerican Science Research; and the SCImago Journal & Country Rank (http://www.scimagojr.com/), which provides new scientific journal quality indicators to assess and analyze scientific domains. Apart from the scientogram mining approach, we have developed several methods for scientogram design in the latter two web information systems, such as new network pruning algorithms significantly reducing the Pathfinder run time in order to allow us to generate scientograms of very large scientific domains (even of the whole World production) in an on-line fashion, and new network visualization approaches achieving closer representations to human beings’ understanding.
On the other hand, we are proposing new graph-based data mining methods based on the use of multi-objective ant colony optimization algorithms, well suited to explore graph structures. Besides, we are starting to apply these methods to other domains and problems such as the customization of services to users through mobile devices using context information (within a Spain’s Ministry of Science and Innovation CENIT research project).
Fuzzy Classifier Derivation for High Dimensional Problems with Good Interpretability-Accuracy Trade-off
System identification involves the use of mathematical tools and algorithms to build dynamical models describing the behaviour of real-world systems from measured data. There are always two conflicting requirements in the modelling process: the model capability to express the behaviour of the real system in an understandable way (interpretability) and its capability to faithfully represent the real system (accuracy). To obtain high degrees of interpretability and accuracy is a contradictory purpose and, in practice, one of the two properties prevails over the other.
FSs have demonstrated their outstanding capability as system identification and control tools. FL has proven its ability to generate different kinds of fuzzy models/classifiers/controllers, with a different accuracy-comprehensibility trade-off, and to permit the incorporation of human expert knowledge; as well as to integrate numerical and symbolic processing into a common scheme.
We are world-wide recognized experts on the design of FSs by means of EAs, the so called genetic fuzzy systems. Among other real-world applications, we have used them to build fuzzy models for the estimation of maintenance costs of electricity distribution networks in Asturias (outperforming other approaches such as neural networks and classical and symbolic regression), in collaboration with the head of the “Metrología y Modelos” research group (http://www.uniovi.es/vicinves/Web_investigacion/unidades/gruposInv/DptoInformatica/Metrologia/main.htm) from the University of Oviedo, Dr. Luciano Sánchez. Moreover, they were also applied to derive fuzzy logic controllers for HVAC systems for large buildings, simultaneously optimizing several design criteria.
We are developing a new approach to design ensembles of fuzzy classifiers able to deal with high dimensional problems, by considering data re-sampling and feature selection techniques, and multi-criteria EAs for the individual classifier selection to get an appropriate accuracy-interpretability trade-off. Our next step will be to apply it to the classification of medical images.
Soft Computing for Image Processing. Forensic Identification and Medical Image Registration
In the last few years there is an increasing interest on using soft computing (SC) techniques to solve image processing real-world problems covering a wide range of domains. In particular, one of the application fields that has suffered a large development is that of image registration (IR), which is a fundamental task in computer vision used to achieving the best fitting/overlaying between two (or more) different images taken under different conditions (at different times, using different sensors, from different viewpoints, or a combination of them). Over the years, it has been applied to a broad range of situations from remote sensing to medical images or artificial vision and CAD systems, and different techniques have been independently studied resulting in a large body of research.
In this way, evolutionary IR is a very promising application area nowadays. Thanks to their global optimization techniques nature, EAs aim at solving the drawbacks presented by traditional IR methods, which usually get stuck in local optima when dealing with large misalignments between the images to be registered. Our team has developed a large number of robust evolutionary IR approaches able to overcome the latter problems based on the use of advanced EAs including domain knowledge. They have achieved a successful performance on both medical IR (human MR and CT images) and on 3D model reconstruction (range images).
Specifically, we are currently extending the latter methods, hybridized with FL, to deal with a challenging real-world problem on the field of forensic medicine, in cooperation with the Physical Anthropology Lab of the University of Granada headed by the prestigious forensic anthropologist Dr. Miguel Botella. Our main objective is to develop an intelligent system to assist the forensic anthropologist in the identification of a missing person by a usual technique called photographic supra-projection, based on overlaying a scanned 3D model of the skull found against a face photo to try to establish whether this is the same person through the partial matching of two sets of radiometric points. While EAs and image processing techniques will be used to automatically build the skull 3D model and perform the skull-face superimposition, FL will be considered for the landmark matching and to suggest the final identification decision to the forensic expert.
Link to the project website
We also coordinate a Marie Curie International Training Network entitled “MIBISOC: Medical Imaging Using Bio-inspired and Soft Computing” which has been recent granted by the European Commission within the Seventh Framework Programme (FP7-PEOPLE-ITN-2008).
Real-World Applications of Single and Multi-objective Metaheuristics
Many complex combinatorial and numerical optimization problems arise in human activities, such as Economics (e.g., portfolio selection), Industry (e.g., scheduling or logistics), or Engineering (e.g., routing). The impracticability to get optimal solutions for these kinds of problems in reasonable time using classical algorithmic techniques has caused the successful development of different approximate algorithm methodologies -called metaheuristics- in the last two decades, able to quickly provide high quality solutions to them. Their success when solving a large number of real-world optimization problems is due both to the powerful heuristic search they apply in complex, ill-defined solution spaces of huge dimension, and to their flexibility, which allows them to handle problem restrictions in an easier way or to be able to simultaneously optimize multiple, conflicting objectives, which are usually present in these problems.
Metaheuristics constitute a very diverse family of optimization algorithms. Our staff owns a large expertise on the single- and multi-objective variants of many of them, mainly on EAs, ant colony optimization (ACO), scatter search, simulated annealing, tabu search, GRASP and iterated local search. We have both used them in different applications such as medical IR, bioinformatics (genetic regulatory networks knowledge discovery), or information retrieval, as well as we designed new hybrid designs in the field of ACO aiming to obtain better performing algorithms.
Currently, we are applying multi-objective ACO algorithms to solve a challenging real-world problem, the time and space assembly line problem (TSALBP), which involves to achieve optimal assignments of a subset of tasks to each station of the assembly line of a plant with respect to two or three conflicting objectives to be minimised: its cycle time, its number of stations and their area. This framework emerged thanks to the observation of a real automotive industry plant belonging to Nissan and located in Barcelona, Spain, as this research is being performed in collaboration with the Nissan Endowed Chair of the Technical University of Catalonia (http://www.nissanchair.com/), headed by Prof. Joaquín Bautista.
In the short future, we aim to use multi-objective ACO algorithms and EAs to solve some other applications such as a real-world problem from CajAstur (the Asturias’ savings bank) related to the logistics of cash distribution to the different bank offices, in collaboration with Prof. Rafael Martí from the University of Valencia (http://www.uv.es/rmarti/), as well as vehicle routing problems with an Asturias’ small and medium enterprise.