Research
The research at our chair rests on three pillars: 1) understanding software engineering through data from software repositories; 2) understanding and improving the quality of machine learning models; and 3) the application of machine learning models.
Mining Software Repositories
Our chair is interested in the collection and analysis of data from software repositories. Within our research, we care for data quality and often combine automated data collection with manual validation and qualitative analysis. A major aspect of our research focuses on understanding bugs: how are they introduced, how are they reported, and how they are fixed. The analysis of patterns helps us to understand how development process may be improved to deal with the unavoidable problem of bugs more efficiently. Beyond bugs, we also consider the impact of static analysis on code quality, the usage of software tests, or investigate how data protection and ethical issues are reported, discussed, and resolved. Moreover, our work intersects directly with our interests in AI, by triangulating ethical and privacy issues with AI aspects and through the collection of best practices from Q&A sites.
Quality Assurance for Machine Learning / Artificial Intelligence
The rise of machine learning from a research topic to an enabling technology of innovative products means that we need to view machine learning software differently: instead of a tool for researchers for studying phenomena, it is now widely used with the goal to develop software products. This raises the bar for the required quality: whereas a crash within a research project might be annoying, it can be catastrophic for businesses or even threaten human lives. Depending on the application, non-functional properties like safety, security, performance, and robustness are vital. In our past work, we focused mostly on the quality of machine learning libraries and found them to be surprisingly brittle (e.g., vulnerable to very large input values).
Within our current work, we want to validate of capabilities of large language models. We already investigated how good such models are for handling knowledge in different domains to understand the impact of domain-specific pre-training and how good ChatGPT is at generating essays in comparison to students, in the first large-scale study on the topic. Our current goal is as important, as it is ambitious: develop techniques to validate capabilities (e.g., types of reasoning, encoding of language aspects), identify the parts of the neural networks that encode these techniques, and validate their causal responsibility for this functionality. This will pave the way for the segmentation of large neural networks from general purpose into smaller networks with specific capabilities, as well as raise our understanding of how such technologies work.
AI applications
At the intersection of repository mining and machine learning, we have the application of machine learning (and other AI) techniques to software engineering problems. We already worked on topics like usage-based testing or the prediction of buggy code based on past bug patterns. Currently, our focus is on aiding software developers through Natural Language Processing (NLP) methods. For example, we consider the analysis of reported issues (bugs, feature requests) and requirements in a proprietary context and determine if (and how) they can be classified, e.g., to understand the type, assign them to teams, or similar.
Our work on AI applications is not limited to software engineering: while we understand and drive software engineering use cases ourselves, we also often collaborate with others to solve problems from other domains (e.g., aeroacoustics, bioinformatics). In such settings, our collaborators provide the domain knowledge and we provide the data analysis and tool building know how. Our work on such problems is not benchmark-driven, but rather application-driven: we start by considering what is required for a use case and our goal is not to achieve a certain level of accuracy (although a good model is obviously important), but to understand which model qualities are important for a use case and how to measure if a model is fit-for-purpose. Consequently, the techniques we apply are diverse and range from density-based clustering, over random forest, to large-language models like RoBERTa. Such collaborations are crucial for us, as they help us to shape our work on machine learning quality: we learn what is important for others directly, what they expect from machine learning tools, and can derive suitable research goals.