May 25 2023
Code based Language Models (LMs) have shown very promising results in the
field of software engineering with applications such as code refinement, code
completion and generation. However, the task of time and space complexity
classification from code has not been extensively explored due to a lack of
datasets, with prior endeavors being limited to Java. In this project, we aim
to address these gaps by creating a labelled dataset of code snippets spanning
multiple languages (Python and C++ datasets currently, with C, C#, and
JavaScript datasets being released shortly). We find that existing time
complexity calculation libraries and tools only apply to a limited number of
use-cases. The lack of a well-defined rule based system motivates the
application of several recently proposed code-based LMs. We demonstrate the
effectiveness of dead code elimination and increasing the maximum sequence
length of LMs. In addition to time complexity, we propose to use LMs to find
space complexities from code, and to the best of our knowledge, this is the
first attempt to do so. Furthermore, we introduce a novel code comprehension
task, called cross-language transfer, where we fine-tune the LM on one language
and run inference on another. Finally, we visualize the activation of the
attention fed classification head of our LMs using Non-negative Matrix
Factorization (NMF) to interpret our results.