What programming language to choose for working with data?
A novice data scientist has the opportunity to choose one of the many programming languages that will help him quickly master this science.
However, no one will tell you exactly which programming language is best suited for this purpose. Your success as a specialist in this field will depend on many factors and today we will try to consider them, and at the end of the article you can vote for the programming language that you think is most suitable for working with data.
Be prepared for the fact that as you deepen into the field of data science, you will have to "reinvent the wheel" again and again. In addition, you will need to perfectly master the various software packages and modules for your chosen programming language. How well you can learn all this depends, first of all, on the availability of subject-oriented software packages for the selected language class.
Leading data specialist has good comprehensive programming skills, as well as the ability to carry out calculations and analyze. Most of the day-to-day work in the field of data science is aimed at finding and processing source data or adjusting data. Unfortunately, no newfound packages for machine learning will help you for these purposes.
In the fast-paced world of commercial data science, there are many opportunities to quickly get the job you want. Nevertheless, it is precisely thanks to the rapid development of the field of data science that it is constantly accompanied by technical problems, and only persistent practice will be able to minimize such shortcomings.
In some cases, it is very important to optimize the performance of your code, especially when working with large volumes of critical data. However, compiled languages are usually much faster than interpreted ones. Similarly, statically typed languages are significantly more fault tolerant than dynamically typed ones. Thus, the only compromise is performance degradation.
To some extent, each of the programming languages presented below has one parameter in each of two groups: universality - specificity; performance is convenience.
Given these basic principles, let's look at some of the most popular programming languages that are used in data science. All information about the programming languages below is based on my own observations and experience, as well as the experience of my friends and colleagues.
R, which is a direct descendant of the older programming language S, was released back in 1995 and has since become more sophisticated. Written in languages such as C and Fortran, this project is today supported by the R Foundation for Statistical Computing.
- An excellent set of high-quality open source domain-specific packages. R has at its disposal packages for almost any quantitative and statistical application that you can only imagine. This includes neural networks, nonlinear regression, phylogenetics, the construction of complex diagrams, graphs and much, much more.
- Together with the basic installation in the appendage, we are given the opportunity to install extensive built-in functions and methods. In addition, R handles matrix algebra data perfectly.
- The ability to visualize data is an important advantage along with the ability to use various libraries, for example ggplot2.
- Low productivity. There is nothing to say: R is not a fast PL.
- Specificity. R is great for statistical research and data science, but it's not so good when it comes to programming for general purposes.
- Other features. R has several unusual features that can confuse programmers who are used to working with other languages: indexing starts at 1, the use of several assignment operators, and unconventional data structures.
Our verdict is ideal for the initial goals of
R - a powerful language that has a huge selection of applications for collecting statistical data and data visualization, and the fact that it is an open source YP allows it to attract a large number of fans among developers. Due to its effectiveness for the initial purposes, this programming language managed to achieve wide popularity.
In 1991, Guido van Rossum introduced the Python programming language. Since then, this language has become extremely popular general-purpose language and is widely used in the community of data experts. Currently, major versions are Python 3.6 and Python 2.7.
- Python is a very popular, widely used general-purpose programming language. It has an extensive set of specially designed modules and is widely used by developers. Many online services provide an API for Python.
- Python is very easy to learn. The low entry threshold makes it an ideal first language for those involved in programming.
- Software packages like pandas, scikit-learn, and Tensorflow make Python a reliable option for modern machine learning applications.
- Type safety. Python is a dynamically typed language, which means you have to be careful when working with it. Type mismatch errors (for example, passing a string as an argument to a method that expects an integer) may occur from time to time.
- For example, if there are specific goals for statistical analysis and data analysis, an extensive set of R language packages gives it an advantage over Python. In addition, there are faster and safer alternatives to Python among programming languages.
Our verdict is convenient in all respects.
Python is a good option for the purposes of data science, and this statement is true for both entry-level and advanced levels of work in this area. Most of the data science is centered around the ETL process (extraction-transformation-loading). This feature makes Python an ideal programming language for such purposes. Libraries such as Google's Tensorflow make Python a very interesting machine learning language.
SQLimg align = "center" src = " habrastorage.org/web/f7e/2cf/42d/f7e2cf42d60b4b8fa6f442504828fe57.png " />
SQL ("structured query language") defines, manages and queries relational databases. The language appeared in 1974 and has since undergone many modifications, but its basic principles remain unchanged.
There are free and paid options.
- It is very effective when working with queries, updates, and also when processing relational databases.
- The declarative syntax makes SQL a very readable language. There is no uncertainty about what the SELECT name FROM users WHERE age> 18 should do!
- SQL is very often used in various applications, so familiarity with it can be very useful. Modules such as SQLAlchemy simplify SQL integration with other languages.
- SQL syntax may seem like a pretty daunting task for anyone accustomed to imperative programming.
- There are many different variations of SQL, such as PostgreSQL, SQLite, MariaDB. They are all quite different, so there can be no talk of any compatibility.
Our verdict is effective, despite the fact that
SQL is more useful as a language for data processing than as an advanced analytical tool. Nevertheless, so many processes in the field of data science depend on ETL, and the longevity and effectiveness of SQL once again testifies that every data scientist should know such a PL.
Java is an extremely popular general-purpose language that runs on the Java Virtual Machine (JVM). It is an abstract computing system that provides seamless portability between platforms. It is currently supported by Oracle.
8th version - free
- Universality. Many modern systems and applications are developed using the Java language. The huge advantage of such a PL is the ability to integrate the methods of data science directly into the existing code base.
- Strict typing. Providing type safety is not an empty phrase for Java, and when developing mission-critical applications for working with big data, this feature is more important than ever.
- Java is a high-performance, compiled general-purpose language. This makes it suitable for writing efficient production ETL code, as well as machine learning algorithms using computing tools.
- The “verbosity” of the Java language makes it not the best option for conducting special analyzes and developing more specialized statistical applications.
- Java does not have a large number of libraries for advanced statistical methods compared to some domain-specific languages, such as R.
Our verdict is a serious contender for the title of the best language for working in the field of data science.
There is much that can be said in favor of learning Java as a language for working in the field of data science. Many companies will appreciate the possibility of seamless integration of ready-made software product code into their own code base, and Java performance and type safety are its indisputable advantages. However, the disadvantages of such a language include the fact that it does not have sets of specific packages that are available for other languages. Despite this drawback, Java is a programming language that you should definitely pay attention to, especially if you already know R or Python.
The Jala-based Scala programming language was developed by Martin Odersky in 2004. This is a language with several paradigms that allows you to use both object-oriented and functional approaches. In addition, Apache Spark's cluster computing structure is written in Scala.
- Using Scala and Spark, you have the opportunity to work with high-performance cluster computing. Scala is the perfect choice for those who work with large amounts of data.
- Multiparadigmatic. For programmers working with Scala, both object-oriented and functional programming paradigms are available.
- Scala compiles to Java bytecode and runs on the JVM. This allows it to interact with the Java language, making Scala a very powerful general-purpose language. In addition, it is also well suited for work in the field of data science.
- If you are just about to work with Scala, then be prepared to pretty much break your head. It is best to download sbt and configure an IDE, such as Eclipse or IntelliJ, using a special Scala plugin.
- It is believed that the syntax and type system of Scala are quite complex. Thus, programmers who are used to working with dynamic languages, such as Python, will have a hard time.
Our verdict is ideal for working with big data.
If you decide to use cluster computing for working with big data, then the Scala + Spark pair is an ideal solution. Moreover, if you already have experience with Java and other statically typed programming languages, then you will certainly appreciate these features of Scala. However, if your application has nothing to do with large amounts of data that can justify adding all Scala components to work with, you are more likely to achieve better performance using other languages such as R or Python.
Released just over 5 years ago, Julia impressed the world of computational methods. The language achieved such popularity due to the fact that several large organizations, including some in the financial industry, almost immediately began to use it for their own purposes.
- Julia is a compiled JIT language (just-in-time) that makes it possible to achieve good performance. This language is quite simple, it provides the possibility of dynamic typing and scripting of an interpreted language such as Python.
- Julia was designed for numerical analysis; it can also be considered as a general-purpose programming language.
- Readability. Many programmers working with this language believe that this feature is its greatest advantage.
- Immaturity. Since Julia is a fairly new language, some developers experience instability while working with its packages. However, the basic means of the language are considered stable.
- Another sign of immaturity of the language is a limited number of software packages, as well as a small number of fans among developers. Unlike the well-established R and Python, the Julia programming language does not have a large number of software packages (for now).
Our verdict is a language that will still show itself.
Yes, the main problem of the Julia language is its youth, but you cannot blame it. Since Julia was created only recently, it still cannot compete with its main competitors, Python and R. Be patient and you will understand that there are many reasons to pay close attention to this language, which will certainly take outstanding steps in the near future.
MATLAB is a recognized language for numerical calculations, used both for scientific purposes and in industry. It was developed and licensed by MathWorks, a company established in 1984 whose main goal was to commercialize software.
Prices vary depending on the language option you choose.
- MATLAB, designed for numerical computing, is well-suited to use quantitative analysis with complex mathematical requirements such as signal processing, Fourier transforms, matrix algebra, and image processing.
- Data visualization. MATLAB has a number of built-in graphing and charting capabilities.
- MATLAB is often found in many undergraduate courses in the exact sciences, such as physics, engineering, and applied mathematics. Therefore, it is widely used in these areas.
- Paid license. Regardless of the option you choose (for scientific, personal or company goals) you will have to fork out for an expensive license. Our tip: pay attention to the free alternative - Octave.
- MATLAB is not the best general-purpose programming language.
Our verdict is the best option for purposes requiring significant mathematical calculations.
Due to its wide use in various quantitative calculations, both for scientific purposes and for industry, MATLAB has become a worthy option for application in the field of data science. It will come in handy for you, if for your daily goals you need intensive, advanced mathematical functionality, which is exactly what MATLAB was designed for.
There are other popular PLs that may be of interest to data professionals. This section provides a brief overview.
Often, C ++ is not used in the field of data science. However, it has lightning fast performance and widespread popularity. The main reason C ++ has not gained popularity in the field of data science is its inefficiency for this purpose.
As one of the forum participants wrote:
“Suppose you are writing code to conduct some kind of special analysis, which is likely to be run only once. So, would you rather spend 30 minutes on creating a program that will work for 10 seconds or spend 10 minutes on a program that will work for 1 minute? ”
And this guy is right! However, C ++ will be a great choice for low-level machine learning algorithms.
Our verdict is not the best choice for everyday work, but when it comes to productivity ...
- Too early for him to enter the game (Node.js is only 8 years old!) ...
Perl is known as the “Swiss Army Knife of Programming Languages” because of its versatility as a general-purpose scripting language. It has a lot in common with Python, being a dynamically typed scripting language. But he is still very far from the popularity that Python has in the field of data science.
This is a bit surprising, given its application in areas in which methods of quantitative analysis are used, for example, in bioinformatics. As for data science, Perl has several drawbacks: it will not be able to quickly become popular in this area, and its syntax is considered unfriendly. In addition, on the part of its developers, there have been no attempts to create libraries that could be used in the field of data science. And as you and I know: often everything is decided by the right actions at the right time.
Our verdict is a useful general-purpose scripting language, but with its help you certainly will not get a job as a data specialist ...
Ruby is another dynamically typed, interpreted general-purpose language. However, it seems that its creators have no desire to make it suitable for work in the field of data science, as is the case with Python.
This may seem strange, but all of the above is somehow connected with the dominant position of Python in the field of scientific research, as well as with the positive reviews of people writing in this language. The more people choose Python, the more modules and frameworks are developed for it, and the more programmers give their preference to Python. The SciRuby project was created in order to implement the functionality of scientific computing, for example, matrix algebra, into Ruby. But despite all these efforts, Python is still the leader at the moment.
Наш вердикт – не совсем правильный выбор для науки о данных, но в вашем резюме знание Ruby не помешает
Well, here we are and have reviewed a short guide to programming languages that are the closest to the field of data science. An important point here is an understanding of what you need more: the specificity or universality of the language, its convenience or effectiveness.
I regularly use R, Python, and SQL, as my current job is mainly focused on developing existing data pipelines and ETL processes. These languages combine the right balance of generality and efficiency for this work with the ability to use more advanced R statistical packages when necessary.
However, you may already have a pretty good hand in Java, or you are eager to try out Scala to work with big data, or maybe you are crazy about the Julia project.
Or maybe you crammed MATLAB in pairs at the institute or would you mind giving SciRuby a chance to show off? Yes, you can have hundreds of different reasons! If so, then leave your comment below - because it is really important for us to know the opinion of each of you!
Thanks for attention!
- Marketing for your project on Reddit, Medium and Bitcointalk .
Only registered users can participate in the survey. Please come in.
Which of the programming languages presented in the article is the most suitable for working with data?
- 29% R 202
- 62.3% Python 433
- 22.4% SQL 156
- 9.9% Java 69
- 10% Scala 70
- 4.3% Julia 30
- 7% MATLAB 49
- 9.4% C ++ 66
- 1.7% Perl 12
- 1.7% Ruby 12
- 10% other 70