Decision tree insights analytics (DTIA) tool: an analytic framework to identify insights from large data records across fields of science
Research output: Contribution to journal › Article › peer-review
Standard Standard
In: Machine Learning: Science and Technology, Vol. 5, No. 4, 045004, 07.10.2024.
Research output: Contribution to journal › Article › peer-review
HarvardHarvard
APA
CBE
MLA
VancouverVancouver
Author
RIS
TY - JOUR
T1 - Decision tree insights analytics (DTIA) tool: an analytic framework to identify insights from large data records across fields of science
AU - Hossny, Karim
AU - Hossny, Mohammed
AU - Cougnoux, Antony
AU - Mahmoud, Loay
AU - Villanueva, Walter
PY - 2024/10/7
Y1 - 2024/10/7
N2 - Supervised machine learning (SML) techniques have been developed since the 1960s. Most of their applications were oriented towards developing models capable of predicting numerical values or categorical output based on a set of input variables (input features). Recently, SML models' interpretability and explainability were extensively studied to have confidence in the models' decisions. In this work, we propose a new deployment method named Decision Tree Insights Analytics (DTIA) that shifts the purpose of using decision tree classification from having a model capable of differentiating the different categorical outputs based on the input features to systematically finding the associations between inputs and outputs. DTIA can reveal interesting areas in the feature space, leading to the development of research questions and the discovery of new associations that might have been overlooked earlier. We applied the method to three case studies: (1) nuclear reactor accident propagation, (2) single-cell RNA sequencing of Niemann-Pick disease type C1 in mice, and (3) bulk RNA sequencing for breast cancer staging in humans. The developed method provided insights into the first two. On the other hand, it showed some of the method's limitations in the third case study. Finally, we presented how the DTIA's insights are more agreeable with the abstract information gain calculations and provide more in-depth information that can help derive more profound physical meaning compared to the random forest's feature importance attribute and K-means clustering for feature ranking.
AB - Supervised machine learning (SML) techniques have been developed since the 1960s. Most of their applications were oriented towards developing models capable of predicting numerical values or categorical output based on a set of input variables (input features). Recently, SML models' interpretability and explainability were extensively studied to have confidence in the models' decisions. In this work, we propose a new deployment method named Decision Tree Insights Analytics (DTIA) that shifts the purpose of using decision tree classification from having a model capable of differentiating the different categorical outputs based on the input features to systematically finding the associations between inputs and outputs. DTIA can reveal interesting areas in the feature space, leading to the development of research questions and the discovery of new associations that might have been overlooked earlier. We applied the method to three case studies: (1) nuclear reactor accident propagation, (2) single-cell RNA sequencing of Niemann-Pick disease type C1 in mice, and (3) bulk RNA sequencing for breast cancer staging in humans. The developed method provided insights into the first two. On the other hand, it showed some of the method's limitations in the third case study. Finally, we presented how the DTIA's insights are more agreeable with the abstract information gain calculations and provide more in-depth information that can help derive more profound physical meaning compared to the random forest's feature importance attribute and K-means clustering for feature ranking.
U2 - 10.1088/2632-2153/ad7f23
DO - 10.1088/2632-2153/ad7f23
M3 - Article
VL - 5
JO - Machine Learning: Science and Technology
JF - Machine Learning: Science and Technology
SN - 2632-2153
IS - 4
M1 - 045004
ER -