Data mining – why KEM?

You are here: Home » Technology » Data mining – why KEM?

ap-header-partners-services

What is KEM®?

KEM® (Knowledge Extraction and Management) is a proprietary data mining tool based on Galois Lattices (also known as Formal Concept Analysis or FCA). KEM® uses association rules to fully explore complex datasets in order to reveal hidden relationships and to derive new hypotheses[1]. The technology originated from the laboratory of Professor Jean Sallantin, of the Artificial Intelligence Group at the CNRS (Centre National de Recherche Scientifique) in Montpellier, France. KEM® is the result of 10 years of further development for the life sciences at Ariana by experts in logic, drug development, biomarker research and development, statistics, chemistry, and biology[2,3].

 

Figure 1

Figure 1. Nonlinear, “oriented” relations (implications)

KEM® can identify all relevant relationships between variables

In comparison with statistical approaches, KEM® can identify all relevant relationships between variables, even if only weakly correlated.

Traditional analytical approaches for the identification of multivariate classifiers often start with a univariate analysis on all features, the identification of markers that allow class discrimination, and the use of optimization algorithms such as Random Forest, SVM or Neural Networks to find the optimal combination of these markers.

 

 

 

KEM uses FCA to extract significant associations between data in the form of uni-­directional implications. All attributes (descriptors) in the data matrix are first transformed (or discretized) into binary variables. The discretization is based on either the data distribution or a semantic content (i.e. the thresholds are medically meaningful and actionnable). Multiple binning schemes for each variable are considered.

In many cases, relations between variables often only exist within particular, specific ranges rather than globally. These relationships are often dismissed as weak correlations by more traditional approaches when, in fact, the relationships are both relevant and meaningful.

Figure 1 shows a weak correlation between A and B (R2 of 0.51), whereas KEM identifies a strong, one-way relation that links high values (tertiles) of B with A: if B_high then A_high. This relation is uni-directional since the opposite relation if A_high then B_high is not true.

 

KEM® delivers unsupervised, unbiased, total data exploration

Figure 2b

Figure 2. Unsupervised, unbiased, total data exploration.

A second important difference between KEM and statistical approaches is its ability to identify relationships between descriptors, as opposed to between descriptors and endpoints only. Figure 2 shows the distribution of three parameters across a set of 1,000 patients. The distribution of each individual parameter is normal. However, plotting Cholesterol against Age shows that lower right quadrant contains only Males (represented by the red circles). This is a hidden relationship between descriptors that KEM identifies systematically. The support for the relation is 219 patients;; i.e. almost ¼ of the dataset.

This strong implication is likely to be missed by standard, supervised methods. In addition, the analysis does not perform an optimization of the binning bounadaries of descriptors towards a given endpoint (supervised). This avoids many of the over-­fitting issues, since there is no arbitrary optimization. This can be compared with Random Forest, for example, where you may get rules that indicate that BMI should be < 17.003, which is unlikely to hold any medical meaning.

 
 

KEM® generates compact, interpretable signatures with high specificity and sensitivity

Figure 3

Figure 3. Example of a signature identified with KEM.

To generate signatures from the uncovered logical associations and perform class prediction all equivalent combinations are first identified as “AND” clauses. They contribute to the increase of the specificity of the signature. Multiple ANDs are then combined through OR clauses, increasing sensitivity. In the FCA paradigm, there is no need for global convergence; hence the system can identify multiple local minima that are connected via ORs, enabling the effective and efficient analysis of heterogeneous data. An example of a signature is shown in Figure 3.

The signatures identified by KEM are comprised of a non-linear combination of AND and OR functions. The result is usually a much more compact signature.

Furthermore, the signatures are easily interpretable.

Compared to other methods, KEM is less sensitive to dataset imbalance in terms of positives vs. negatives, or in situations where the number of descriptors exceeds the number of samples by orders of magnitude (a common problem in clinical and biomarker datasets).

The KEM approach identifies all possible signatures. The ranking and filtering are performed explicitly based on classification performance, signature simplicity (length), as well as any additional criteria such as biological knowledge (pathways enrichment) or clinical relevance (targeting specific patient subgroups) etc.

 

References

[1] Afshar M, Lanoue A, Sallantin J, Multiobjective/Multicriteria Optimization and Decision Support in Drug Discovery Comprehensive Medical Chemistry II, Vol. 4, pp. 767-­‐774,(2006)

[2] Dartnell C, Martin E, Hagège H, Sallantin J, Human Discovery and Machine Learning International Journal of Cognitive Informatics and Natural Intelligence, Vol. N/A, pp. N/A,(2008)

[3] For a recent review see Domenach, Florent; Ignatov, Dmitry I.; Poelmans, Jonas (Eds.) 2012 Formal Concept Analysis, 10th International Conference, ICFCA 2012, Leuven, Belgium, May 7-­‐10, 2012. Proceedings Series: Lecture Notes in Computer Science, Vol. 7278 Subseries: Lecture Notes in Artificial Intelligence