Deep Learning

What Are We Really Testing in Mutation Testing for Machine Learning? A Critical Reflection
Mutation testing is a well-established technique for assessing a test suite’s effectiveness by injecting artificial faults into production code. In recent years, mutation testing has been extended to machine learning (ML) systems and deep learning (DL) in particular. Researchers have proposed approaches, tools, and statistically sound heuristics to determine whether mutants in DL systems are killed or not. However, as we will argue in this work, questions can be raised to what extent currently used mutation testing techniques in DL are actually in line with the classical interpretation of mutation testing. As we will discuss, in current approaches, the distinction between production and test code is blurry, the realism of mutation operators can be challenged, and generally, the degree to which the hypotheses underlying classical mutation testing (competent programmer hypothesis and coupling effect hypothesis) are followed lacks focus and explicit mappings. In this paper, we observe that ML model development follows a test-driven development (TDD) process, where data points (test data) with labels (implicit assertions) correspond to test cases in traditional software. Based on this perspective, we critically revisit existing mutation operators for ML, the mutation testing paradigm for ML, and its fundamental hypotheses. Based on our observations, we propose several action points for better alignment of mutation testing techniques for ML with paradigms and vocabularies of classical mutation testing.
DeepTC-Enhancer: Improving the Readability of Automatically Generated Tests
Automated test case generation tools have been successfully pro- posed to reduce the amount of human and infrastructure resources required to write and run test cases. However, recent studies demonstrate that the readability of generated tests is very limited due to (i) uninformative identifiers and (ii) lack of proper documentation. Prior studies proposed techniques to improve test readability by either generating natural language summaries or meaningful methods names. While these approaches are shown to improve test readability, they are also affected by two limitations: (1) generated summaries are often perceived as too verbose and redundant by developers, and (2) readable tests require both proper method names but also meaningful identifiers (within-method readability). In this work, we combine template based methods and Deep Learning (DL) approaches to automatically generate test case scenarios (elicited from natural language patterns of test case statements) as well as to train DL models on path-based representations of source code to generate meaningful identifier names. Our ap- proach, called DeepTC-Enhancer , recommends documentation and identifier names with the ultimate goal of enhancing readability of automatically generated test cases. An empirical evaluation with 36 external and internal developers shows that (1) DeepTC-Enhancer outperforms significantly the baseline approach for generating summaries and performs equally with the baseline approach for test case renaming, (2) the transformation proposed by DeepTC-Enhancer result in a significant increase in readability of automatically generated test cases, and (3) there is a significant difference in the feature preferences between external and internal developers.