2 | CISELab

Cyber-Physical Systems (CPSs) have gained traction in recent years. A major non-functional quality of CPS is performance since it affects both usability and security. This critical quality attribute depends on the specialized hardware, simulation engines, and environmental factors that characterize the system under analysis. While a large body of research exists on performance issues in general, studies focusing on performance-related issues for CPSs are scarce. The goal of this paper is to build a taxonomy of performance issues in CPSs. To this aim, we present two empirical studies aimed at categorizing common performance issues (Study I) and helping developers detect them (Study II). In the first study, we examined commit messages and code changes in the history of 14 GitHub-hosted open-source CPS projects to identify commits that report and fix self-admitted performance issues. We manually analyzed 2699 commits, labeled them, and grouped the reported performance issues into antipatterns. We detected instances of three previously reported Software Performance Antipatterns (SPAs) for CPSs. Importantly, we also identified new SPAs for CPSs not described earlier in the literature. Furthermore, most performance issues identified in this study fall into two new antipattern categories: Hard Coded Fine Tuning (399 of 646) and Magical Waiting Number (150 of 646). In the second study, we introduce static analysis techniques for automatically detecting these two new antipatterns; we implemented them in a tool called AP-Spotter. We analyzed 9 open-source CPS projects not utilized to build the SPAs taxonomy to benchmark AP-Spotter. Our results show that AP-Spotter achieves 62.04% precision in detecting the antipatterns

Imara van Dinten, Pouria Derakhshanfar, Annibale Panichella, Andy Zaidman

JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

Xavier Devroey, Alessio Gambi, Juan Pablo Galeotti, René Just, Fitsum Kifetew, Annibale Panichella, Sebastiano Panichella

Continuous Integration and Delivery practices for Cyber- Physical systems: An interview-based study

Fiorella Zampetti, Damian Tamburri, Sebastiano Panichella, Annibale Panichella, Gerardo Canfora, Massimiliano di Penta

Generating Class-Level Integration Tests Using Call Site Information

Abstract: Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated tests with high code coverage could be ineffective, i.

Pouria Derakhshanfar, Xavier Devroey, Annibale Panichella, Andy Zaidman, Arie van Deursen

Test Smells 20 Years Later: Detectability, Validity, and Reliability

Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools’ detection accuracy and investigate the prevalence and detectability of test smells more broadly. Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools – one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells. Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool’s detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.

Annibale Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, Vincent Hellendoorn

Single and Multi-objective Test Cases Prioritization for Self-driving Cars in Virtual Environments

Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, Annibale Panichella

Large scale inverse design of planar on-chip mode sorter

Giuseppe Di Domenico, Dror Weisman, Annibale Panichella, Dolev Roitman, Ady Arie

How to Kill Them All: An Exploratory Study on the Impact of Code Observability on Mutation Testing

Mutation testing is well-known for its efficacy in assessing test quality, and starting to be applied in the industry. However, what should a developer do when confronted with a low mutation score? Should the test suite be plainly reinforced to increase the mutation score, or should the production code be improved as well, to make the creation of better tests possible? In this paper, we aim to provide a new perspective to developers that enables them to understand and reason about the mutation score in the light of testability and observability. First, we investigate whether testability and observability metrics are correlated with the mutation score on six open-source Java projects. We observe a correlation between observability metrics and the mutation score, e.g., test directness, which measures the extent to which the production code is tested directly, seems to be an essential factor. Based on our insights from the correlation study, we propose a number of ‘‘mutation score anti-patterns’’’, enabling software engineers to refactor their existing code or add tests to improve the mutation score. In doing so, we observe that relatively simple refactoring operations enable an improvement or increase in the mutation score.

Qianqian Zhou, Andy Zaidman, Annibale Panichella

Serverless Testing: Tool Vendors' and Experts' Point of View

Serverless architecture is an emerging design style for cloud-based software systems. Testing serverless applications plays an important role in software quality assurance. However, currently, there is no consensus on how to test and debug such systems properly. Moreover, the current lack of mature tooling is a central challenge. We designed and conducted three interviews among two tools vendor leaders in the serverless domain (Epsagon and Thundra) and one expert in the field (Yan Cui), investigating the good and bad practices and several open issues. The current status of testing and debugging in serverless-based applications depicted by the experts helped us to highlight issues and challenges that need to be deeply investigated.

Valentina Lenarduzzi, Annibale Panichella