
Interpreter bottlenecks
This note is intended for young programmers who have been using or are just beginning to use interpreted programming languages for some time, but have not yet studied the working principle of the language itself.
Nowadays, due to potentially not bad salaries and office type work, programming has become quite popular among young people. In addition, programming languages that are not difficult enough for initial development are in demand: JavaScript, PHP, Perl, Python, Java, C #, Basic, ... (as you can see they are all of the same family - interpreters). As a result, a fairly large number of workers in this industry appeared who did not specifically study programming anywhere. I needed a programmer in the language “X”, bought the book “X in 2 weeks” and after 3 weeks - we are already writing some kind of project on “X”. And after a few thousand lines of code or after the database is overgrown with real data, the project begins to mercilessly slow down. You can, of course, “go play the drums” while the iron reaches your project,
What is usually the main problem? Usually in the absence of understanding: what actually happens when the “Y” command is executed. Programming languages as languages of communication - the same thing can be explained in different words. But in the case of computers, it would be better if the explanation is as concise as possible. “Better” in the sense of “speed of execution.” Moreover, laconicism should be at the level of a language that is understandable to the central processor, and not to you. I mean that the brevity of the name of the function that you call in a particular programming language does not affect performance (there are exceptions); performance is affected by what this function does in its bowels. And for this, it’s worth understanding how the computer actually works with numbers, strings, arrays, functions, and so on.
So that this note does not grow too much, I will not describe here what and how it works at a low level. This explanation may vary in some languages. You can search for detailed information on the Internet, books or figure it out yourself. If desired, I can answer the questions in the comments and put them all in a separate note. In the meantime, I will explain only “where the legs grow” from common omissions and what you should pay attention to.
First, let's look at the classes of programming languages. I would break them into 3 groups (maybe they break up that way):
Assembler is, by and large, the language of the central processor itself. What a programmer writes on it, although it writes in a human-readable syntax, the output remains the same, only a processor-friendly syntax. Some assembler compilers, in addition to this simple translation, also analyze your code and try to optimize it, but this does not change the essence. If you know how to program in assembler, it means that you know all the nuances of the hardware and, therefore, you have the opportunity to implement any task with the most optimal solution.
Compilers- these are more friendly to the programmer languages: C, C ++, Pascal, ... It is much easier to write anything due to the fact that such things as conditional jumps, loops, working with variables and functions are displayed in the syntax of the language. As a result, it is no longer necessary to write many CPU instructions to implement a complex cycle. Plus, they have implemented various designs that the CPU (central processing unit) is not aware of at all. But which make it much easier to structure the logic of the program (classes, objects, records, arrays, ...). When compiling, the program is translated into a language that is understandable to a particular CPU. Since the Assembler and the language that the CPU understands are essentially the same thing, you can always transfer the compiled program to Assembler (disassembly). Translating a compiled program into a compiled language is a much more complicated task, since some processor language constructs cannot always be properly transferred to the simplified syntax of compiled languages. In addition, all the names of variables and functions are lost during compilation, and it is not possible to restore them (except when the program is compiled in debug mode).
Interpreters - these languages represent the highest evolutionary stage: JavaScript, PHP, Perl, Python, Java, C #, Basic ... Their peculiarity lies in their potential independence from the application execution platform.
Programs written in assemblers can only work on the type of processors for which they are written, since they were written by the commands of these processors and others simply will not understand them.
Programs written in compiled languages work only on those platforms for which they were compiled. Theoretically, the program can be compiled for different platforms, but in practice, if necessary, even at the programming stage it is necessary to take into account the features of all platforms for which it is supposed to compile the program.
The programs that were written in interpreted languages are executed by a certain program-layer, which reads your code in real time and translates it into a language that is understandable by the CPU. As a result, interpreter developers have taken up the issue of portability of your application. Now they have to make this layer for different systems so that your program works on all of them. But since all systems are quite different, it is not always possible to realize absolute independence. If under Linux there is a “Z” function, but under Windows it does not exist, then you will either have to do without it, or your program will work only under Linux (for example, functions for working with the file system).
The main disadvantage of interpreted languages is their speed of execution. It is quite obvious that a program compiled in a language that the CPU understands is immediately processed by the CPU, while a program written in an interpreted language must first be recognized and translated into a language that the CPU understands, and only then does the CPU begin to execute it. Modern interpreters have acquired a number of measures to combat this shortcoming. In addition to a sufficiently high-quality optimizer and caching system, they translate your program into bytecode (either in real time or simulating compilation). Now the interlayer program does not need to recognize your “handwritten text” every time. It does this either only 1 time or does not do it at all (if the program has already been converted to bytecode). Instead of your “manuscript”, it works with the bytecode of your program. The bytecode is very similar to the CPU language, but it is not the CPU language (it is more platform independent). It still needs to be translated into the language of the CPU. Therefore, it is obvious that rumors about Java, which runs faster than C ++, are noticeably exaggerated. And this will remain so until the processors learn to understand Java bytecode.
Now, after a short general description of the work of interpreters, I would like to mention 3 topics that you can skip when writing a small project, but which, at times, can give a significant performance boost when they are understood and used correctly.
Programming languages are not only syntax. It is also a set of ready-made libraries of functions for working with various data and devices. In compiled languages, they are not particularly different, but in the interpreters there is a difference. In some languages, such as Java, these functions are written in the language of the interpreter itself. And in some, such as JavaScript or PHP, in compiled languages, that is, they, at the time of program execution, are already compiled and do not require additional processing . Thus, their call will not require any additional processing, as a result of which their execution will be much faster than if you write the same thing in this interpreted language. thereforeif you have the ability to perform this kind of built-in function, even if it does something superfluous, but solves your problem, try to use it instead of writing your own complex or not-so-constructs. For example, to split a string into a set of substrings with a difficult condition, it is better to use a regular expression than to write your own loop with manual processing of the same string.
In addition to a set of function libraries, some enthusiasts also try to transform the syntax and logic of languages, introducing some ideas that simplify something when working with structures and / or data (jQuery, LINQ, ORM, ...). If the language is compiled, then it is not so scary. But in the interpreters, blind immersion in third-party abstract functions is detrimental. Yes, often with such converters it is really more convenient, but this convenience is almost always achieved due to the speed of work. Just look at the source code of these “helpers” and make sure that sometimes it’s much more efficient to call a couple of functions built into the language that perform exactly what you need than one universal third-party, which inside will execute a “ton” of code before it understands what you want from it and, finally, do it. For example, in JavaScript to receive all DIVs, you can directly call the built-in function “document.getElementsByTagName (“ DIV ”)”, which will immediately return what you need, or call the beautiful jQuery function “$ (“ DIV ”)" , which will perform a couple of regular expressions, several checks, “manual” joining of arrays and only after that will return the required one.
And finally, the last thing I wanted to focus on is working with strings. In interpreted languages, working with strings has become so transparent that the fact that these are some of the most resource-intensive operations is absolutely not obvious. This fact is usually known only to those who have worked with them at least in compiled languages manually (they also have functions to facilitate this work). The problem is that with almost any operation with strings (creating a string, concatenating strings, splitting into substrings, deleting a substring, replacing a substring), the search for free space in memory, the required length, for a new string, and copying the resulting data to a new location . Even such simple, at first glance, operations such as searching by a string, with the advent of such complex formats as UTF-8, are not particularly fast. Compared to working in ASCII format. thereforeDo not abuse lines where you can do without them . For example, associative arrays - if you can get around a numbered array, get around!
It is worth noting that in a function that does almost nothing, you may not feel the difference in performance between the optimized code and the whip code, with modern processors. The difference will be more obvious in places where the quick code is executed many times (in a loop, in a frequently called function) or where there is a lot of such code.
Good luck!
Nowadays, due to potentially not bad salaries and office type work, programming has become quite popular among young people. In addition, programming languages that are not difficult enough for initial development are in demand: JavaScript, PHP, Perl, Python, Java, C #, Basic, ... (as you can see they are all of the same family - interpreters). As a result, a fairly large number of workers in this industry appeared who did not specifically study programming anywhere. I needed a programmer in the language “X”, bought the book “X in 2 weeks” and after 3 weeks - we are already writing some kind of project on “X”. And after a few thousand lines of code or after the database is overgrown with real data, the project begins to mercilessly slow down. You can, of course, “go play the drums” while the iron reaches your project,
What is usually the main problem? Usually in the absence of understanding: what actually happens when the “Y” command is executed. Programming languages as languages of communication - the same thing can be explained in different words. But in the case of computers, it would be better if the explanation is as concise as possible. “Better” in the sense of “speed of execution.” Moreover, laconicism should be at the level of a language that is understandable to the central processor, and not to you. I mean that the brevity of the name of the function that you call in a particular programming language does not affect performance (there are exceptions); performance is affected by what this function does in its bowels. And for this, it’s worth understanding how the computer actually works with numbers, strings, arrays, functions, and so on.
So that this note does not grow too much, I will not describe here what and how it works at a low level. This explanation may vary in some languages. You can search for detailed information on the Internet, books or figure it out yourself. If desired, I can answer the questions in the comments and put them all in a separate note. In the meantime, I will explain only “where the legs grow” from common omissions and what you should pay attention to.
First, let's look at the classes of programming languages. I would break them into 3 groups (maybe they break up that way):
- Assembler
- Compilers
- Interpreters
Assembler is, by and large, the language of the central processor itself. What a programmer writes on it, although it writes in a human-readable syntax, the output remains the same, only a processor-friendly syntax. Some assembler compilers, in addition to this simple translation, also analyze your code and try to optimize it, but this does not change the essence. If you know how to program in assembler, it means that you know all the nuances of the hardware and, therefore, you have the opportunity to implement any task with the most optimal solution.
Compilers- these are more friendly to the programmer languages: C, C ++, Pascal, ... It is much easier to write anything due to the fact that such things as conditional jumps, loops, working with variables and functions are displayed in the syntax of the language. As a result, it is no longer necessary to write many CPU instructions to implement a complex cycle. Plus, they have implemented various designs that the CPU (central processing unit) is not aware of at all. But which make it much easier to structure the logic of the program (classes, objects, records, arrays, ...). When compiling, the program is translated into a language that is understandable to a particular CPU. Since the Assembler and the language that the CPU understands are essentially the same thing, you can always transfer the compiled program to Assembler (disassembly). Translating a compiled program into a compiled language is a much more complicated task, since some processor language constructs cannot always be properly transferred to the simplified syntax of compiled languages. In addition, all the names of variables and functions are lost during compilation, and it is not possible to restore them (except when the program is compiled in debug mode).
Interpreters - these languages represent the highest evolutionary stage: JavaScript, PHP, Perl, Python, Java, C #, Basic ... Their peculiarity lies in their potential independence from the application execution platform.
Programs written in assemblers can only work on the type of processors for which they are written, since they were written by the commands of these processors and others simply will not understand them.
Programs written in compiled languages work only on those platforms for which they were compiled. Theoretically, the program can be compiled for different platforms, but in practice, if necessary, even at the programming stage it is necessary to take into account the features of all platforms for which it is supposed to compile the program.
The programs that were written in interpreted languages are executed by a certain program-layer, which reads your code in real time and translates it into a language that is understandable by the CPU. As a result, interpreter developers have taken up the issue of portability of your application. Now they have to make this layer for different systems so that your program works on all of them. But since all systems are quite different, it is not always possible to realize absolute independence. If under Linux there is a “Z” function, but under Windows it does not exist, then you will either have to do without it, or your program will work only under Linux (for example, functions for working with the file system).
The main disadvantage of interpreted languages is their speed of execution. It is quite obvious that a program compiled in a language that the CPU understands is immediately processed by the CPU, while a program written in an interpreted language must first be recognized and translated into a language that the CPU understands, and only then does the CPU begin to execute it. Modern interpreters have acquired a number of measures to combat this shortcoming. In addition to a sufficiently high-quality optimizer and caching system, they translate your program into bytecode (either in real time or simulating compilation). Now the interlayer program does not need to recognize your “handwritten text” every time. It does this either only 1 time or does not do it at all (if the program has already been converted to bytecode). Instead of your “manuscript”, it works with the bytecode of your program. The bytecode is very similar to the CPU language, but it is not the CPU language (it is more platform independent). It still needs to be translated into the language of the CPU. Therefore, it is obvious that rumors about Java, which runs faster than C ++, are noticeably exaggerated. And this will remain so until the processors learn to understand Java bytecode.
Now, after a short general description of the work of interpreters, I would like to mention 3 topics that you can skip when writing a small project, but which, at times, can give a significant performance boost when they are understood and used correctly.
Language features that are already compiled
Programming languages are not only syntax. It is also a set of ready-made libraries of functions for working with various data and devices. In compiled languages, they are not particularly different, but in the interpreters there is a difference. In some languages, such as Java, these functions are written in the language of the interpreter itself. And in some, such as JavaScript or PHP, in compiled languages, that is, they, at the time of program execution, are already compiled and do not require additional processing . Thus, their call will not require any additional processing, as a result of which their execution will be much faster than if you write the same thing in this interpreted language. thereforeif you have the ability to perform this kind of built-in function, even if it does something superfluous, but solves your problem, try to use it instead of writing your own complex or not-so-constructs. For example, to split a string into a set of substrings with a difficult condition, it is better to use a regular expression than to write your own loop with manual processing of the same string.
Complex but easy to use frameworks
In addition to a set of function libraries, some enthusiasts also try to transform the syntax and logic of languages, introducing some ideas that simplify something when working with structures and / or data (jQuery, LINQ, ORM, ...). If the language is compiled, then it is not so scary. But in the interpreters, blind immersion in third-party abstract functions is detrimental. Yes, often with such converters it is really more convenient, but this convenience is almost always achieved due to the speed of work. Just look at the source code of these “helpers” and make sure that sometimes it’s much more efficient to call a couple of functions built into the language that perform exactly what you need than one universal third-party, which inside will execute a “ton” of code before it understands what you want from it and, finally, do it. For example, in JavaScript to receive all DIVs, you can directly call the built-in function “document.getElementsByTagName (“ DIV ”)”, which will immediately return what you need, or call the beautiful jQuery function “$ (“ DIV ”)" , which will perform a couple of regular expressions, several checks, “manual” joining of arrays and only after that will return the required one.
Work with strings
And finally, the last thing I wanted to focus on is working with strings. In interpreted languages, working with strings has become so transparent that the fact that these are some of the most resource-intensive operations is absolutely not obvious. This fact is usually known only to those who have worked with them at least in compiled languages manually (they also have functions to facilitate this work). The problem is that with almost any operation with strings (creating a string, concatenating strings, splitting into substrings, deleting a substring, replacing a substring), the search for free space in memory, the required length, for a new string, and copying the resulting data to a new location . Even such simple, at first glance, operations such as searching by a string, with the advent of such complex formats as UTF-8, are not particularly fast. Compared to working in ASCII format. thereforeDo not abuse lines where you can do without them . For example, associative arrays - if you can get around a numbered array, get around!
It is worth noting that in a function that does almost nothing, you may not feel the difference in performance between the optimized code and the whip code, with modern processors. The difference will be more obvious in places where the quick code is executed many times (in a loop, in a frequently called function) or where there is a lot of such code.
Good luck!