We are happy to announce that our paper, "Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP" has been published in the Journal Empirical Software Engineering.
The study investigates correlations between the correctness and quality of two explainable AI methods for the decision-making process of a seBERT model on whether an issue describes a bug. It is based on data from a thorough qualitative rating process, during which the three authors each rated LIME and SHAP explanations with four categories for predictions on 3090 issues.
While we hypothesized that explanations of correct predictions of bugs would have a higher quality than correct predictions of non-bugs, this was not the case. The model instead found explainable signals for both categories. Further, the project from which the issue was sampled did not act as a confounder.
Finally, an investigation into the difference in quality between the two explainable AI methods showed that SHAP (in the deep SHAP variant) outperformed LIME due to its lower ambiguity and higher contextuality. We also conclude that rating explanations of issue classifications is a highly subjective topic, as the raters often had different perspectives on the matter.