Please note: This PhD seminar will take place online.
Yiwen Dong, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Chengnian Sun
Type inference is a crucial task for reusing online code snippets, often found on platforms like StackOverflow, which frequently lack essential type information such as fully qualified names (FQNs) and required libraries. Recent studies have leveraged Large Language Models (LLMs) to perform type inference for such code snippets, demonstrating promising performance. However, these results are potentially affected by a data leakage issue, as the benchmark suite (StatType-SO) used for evaluation has been publicly available on GitHub since 2017. Thus it remains uncertain whether the LLMs’ strong performance stems from their ability to understand the semantics of the code snippets or merely from retrieving the ground truth from their training data.
To comprehensively assess the type inference capabilities of LLMs on Java code snippets and identify potential limitations of LLMs, we conducted a detailed evaluation. Utilizing Thalia, a program synthesis technique, we created ThaliaType—a new, previously unreleased dataset designed for type inference evaluation comprising of 300 unseen code snippets. As a baseline, we leveraged StarCoder2:15b, a LLM with an open source training set which we confirmed to been trained on code snippets in StatType-SO. Our results revealed that all popular state-of-the-art LLMs exhibit significant drops in performance similar to StarCoder2:15b when generalizing to unseen code snippets with up to 59% decrease in precision and up to 72% decrease in recall. To further investigate the limitations of LLMs in understanding the execution semantics of code snippets, we developed semantic-preserving code transformations. Our results showed that LLMs are able to maintain consistent performance despite simple transformations on both StatType-SO and ThaliaType. However, when transformations are combined, these effects are inconsistent between StatType-SO and ThaliaType with all LLMs demonstrating a degradation in performance on StatType-SO, where the performance is stable on ThaliaType and sometimes even improved compared to the original non-transformed versions. A potential explanation is that the combined transformations made it harder for LLMs to recall data leaked from StatType-SO during training and need to rely on analysis of the execution semantics of the code snippet. Thus the performance with all transformations applied on StatType-SO may be closer to the true performance of LLMs for type inference.
Our findings highlight the crucial need for carefully designed and rigorously evaluated benchmarks to accurately assess the true capabilities of LLMs for type inference tasks. To ensure reliable and rigorous evaluations of LLMs’ type inference capabilities, future research should prioritize the use of unseen benchmarks rather than StatType-SO alone.