Viistomin July 10, 2019 at 10:00

C # or Java? TypeScript or JavaScript? The classification of programming languages based on machine learning

Original author: Kavita Ganesan, Romano Foti

Transfer

GitHub has more than 300 programming languages, ranging from well-known languages such as Python, Java, and Javascript to esoteric languages such as Befunge , known only to small groups of people.

Top 10 programming languages hosted on GitHub by the number of repositories

One of the problems GitHub faces is recognizing different programming languages. When some code is placed in the repository, recognition of its type is very important. This is necessary for reasons of search, vulnerability alerts, syntax highlighting, as well as structural representation of repository content to users.

At first glance, language recognition is a simple task, but it is not quite so. LinguistIs the tool we are currently using to define a programming language on GitHub. Linguist is a Ruby application that uses a variety of language recognition strategies, including name information and file extensions. In addition, it takes into account Vim or Emacs models, as well as the contents at the top of the file (shebang). Linguist handles linguistic ambiguity heuristically and, if this does not work out, then uses a naive Bayesian classifier trained on a small sample of data.

Although Linguist predicts quite well at the file level (84% accuracy), everything breaks when files are named strangely, and even more so when files have no extensions. This makes Linguist useless for content such as GitHub Gists or code snippets in README, errors, and pull requests.

In order to make the language definition clearer in the long run, we have developed a machine learning classifier called OctoLingua. It is based on the Artificial Neural Network (ANN) architecture, which can handle language prediction in non-trivial scenarios. The current version of the model can make predictions for the top 50 programming languages on GitHub and surpasses Linguist in accuracy.

More details about OctoLingua

OctoLingua was written from scratch in Python, Keras with the TensorFlow backend - it was created to be accurate, reliable and easy to maintain. In this part, we will talk about our data sources, model architecture, and OctoLingua performance tests. We will also talk about the process of adding the ability to recognize a new language.

Data sources

The current version of OctoLingua has been trained on files obtained from Rosetta Code and from a set of internal crowdsource repositories. We have limited our set of languages to the 50 most popular on GitHub.

Rosetta Code was an excellent starting dataset because it contained source code written to perform the same task, but in different programming languages. For example, the code for generating Fibonacci numberswas introduced in C, C ++, CoffeeScript, D, Java, Julia and others. However, the coverage of the languages was heterogeneous: for some programming languages, there were only a few files with code, for others, the files contained simply too little code. Therefore, it was necessary to supplement our training dataset with some additional sources, and thereby significantly improve the coverage of languages and the effectiveness of the final model.

Our process of adding a new language is not fully automated. We programmatically compile source code from public repositories on GitHub. We select only those repositories that meet the minimum qualification criteria, such as the minimum number of forks covering the target language and covering specific file extensions. At this stage of data collection, we define the main language of the repository using the classification from Linguist.

Symptoms: Based on previous knowledge

Traditionally, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are used to solve text classification problems using neural networks. However, differences in programming languages in vocabulary, file extensions, structure, style of importing libraries and other details forced us to come up with a different approach that uses all this information, extracting some signs in a tabular form for training our classifier. Attributes are retrieved as follows:

Top 5 Special Characters in a File
Top 20 characters in a file
File extension
The presence of specific special characters that are used in the source code of files, such as the colon, curly braces, semicolons

Model Artificial Neural Network (ANN)

We use the above factors as input for a two-layer neural network built using Keras with a Tensorflow backend.

The diagram below shows that the feature extraction step creates an n-dimensional table entry for our classifier. As the information moves through the layers of our network, it is ordered by dropping out, and the result is a 51-dimensional output, which represents the likelihood that this code is written in each of the top 50 languages on GitHub. It also shows the likelihood that the code is not written in any of the 50 languages.

ANN structure of the source model (50 languages + 1 for “other”)

We used 90% of our source database for training. Also, at the training step of the model, part of the file extensions was removed so that the model could learn exactly from the vocabulary of files, and not from their extensions, which predict the programming language so well.

Performance test

OctoLingua vs Linguist

In the table below, we show the F1 Score (harmonic mean between accuracy and completeness) for OctoLingua and Linguist calculated on the same test set (10% of the volume of our original data source).

Three tests are shown here. In the first test, the data set was not touched at all; in the second, file extensions were deleted; in the third, the file extensions were mixed in order to confuse the classifier (for example, a Java file could have the extension “.txt”, and a Python file could have the extension “.java”.

The intuition behind shuffling or deleting file extensions in our test suite is to evaluate the reliability of OctoLingua in classifying files when a key tag is deleted or misleading. A classifier that is not very dependent on the extension would be extremely useful for classifying logs and code snippets, because in these cases people usually do not provide accurate information about the extension (for example, many code-related logs have the txt extension.)

The table below shows how OctoLingua has good performance in various conditions, when we assumed that the model learns mainly from the vocabulary of the code, and not from meta-information (for example, the file extension). At the same time, Linguist determines the language erroneously, as soon as information about the correct file extension was missing.

OctoLingua vs. Linguist performance on the same test suite

The effect of removing file extensions when training a model

As mentioned earlier, during training, we removed a certain percentage of file extensions from the data to make the model learn from the vocabulary of files. The table below shows the performance of our model with various proportions of file extensions deleted during training.

OctoLingua performance with different percentage of deleted file extensions

Please note that a model trained on files with extensions is significantly less effective on test files without extensions or with mixed extensions than on ordinary test data. On the other hand, when a model is trained on a data set in which some of the file extensions are deleted, the model’s performance does not decrease much on the modified test set. This confirms that removing extensions from part of the files during training encourages our classifier to learn more from the vocabulary of the code. It also shows that the file extension tended to dominate and prevented weighting from the featured content.

New language support

Adding a new language to OctoLingua is a fairly simple process. It begins with searching and obtaining a large number of files in a new language (we can do this programmatically, as described in the section “Data Sources”). These files are divided into training and test suites, and then pass through our preprocessor and feature extractor. A new dataset is added to the existing pool. The test kit allows us to make sure that the accuracy of our model remains acceptable.

Adding a new language to OctoLingua

Our plans

OctoLingua is currently at the “advanced prototyping stage”. Our language classification mechanism is already reliable, but does not yet support all the programming languages available on GitHub. In addition to expanding language support, which is not so difficult, we strive to provide language detection with various levels of detail of the code. Our current implementation already allows us, with a slight modification of our machine learning mechanism, to classify code fragments. Also, it does not seem difficult to bring the model to the stage at which it can reliably detect and classify embedded languages.

We are also considering publishing the source code for our model, but we need a request from the community.

Conclusion

Our goal in the development of OctoLingua is to create a service that provides reliable definition of the language by the source code at different levels of detail: from the level of files or code fragments to the potential definition and classification of the language at the line level. All our work on this service is aimed at supporting developers in their daily development work, as well as creating conditions for writing high-quality code.

If you are interested in contributing to our work, please feel free to contact us on Twitter @github !

Tags:

C # or Java? TypeScript or JavaScript? The classification of programming languages ​​based on machine learning