On using distributed representations of source code for the detection of C security vulnerabilities.

David Coimbra,Sofia Reis,Rui Abreu,Corina S. Pasareanu,Hakan Erdogmus

On using distributed representations of source code for the detection of C security vulnerabilities.

2021

David Coimbra
Sofia Reis
Rui Abreu
Corina S. Pasareanu
Hakan Erdogmus

This paper presents an evaluation of the code representation model Code2vec when trained on the task of detecting security vulnerabilities in C source code. We leverage the open-source library astminer to extract path-contexts from the abstract syntax trees of a corpus of labeled C functions. Code2vec is trained on the resulting path-contexts with the task of classifying a function as vulnerable or non-vulnerable. Using the CodeXGLUE benchmark, we show that the accuracy of Code2vec for this task is comparable to simple transformer-based methods such as pre-trained RoBERTa, and outperforms more naive NLP-based methods. We achieved an accuracy of 61.43% while maintaining low computational requirements relative to larger models.

Keywords:

Code (cryptography)
Computer science
transformer
task
Machine learning
Function (engineering)
Leverage (statistics)
Abstract syntax
Benchmark (computing)
Artificial intelligence
Source code

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations