A little about lines in C, or several options to optimize non-optimizable
Habra, hello!
Not so long ago, a rather interesting incident occurred with me, in which one of the teachers of a computer science college was involved.
The conversation about Linux programming slowly turned to the fact that this person began to argue that the complexity of system programming was actually greatly exaggerated. That the C language is simple as a match, in fact, like the Linux kernel (in his words).
I had with me a laptop with Linux, on which there was a gentleman's set of utilities for development in the C language (gcc, vim, make, valgrind, gdb). I don’t remember what goal we set ourselves then, but after a couple of minutes my opponent was behind this laptop, completely ready to solve the problem.
And literally on the very first lines, he made a serious mistake when allocating memory under ... a line.
buffer - a stack variable in which data from the keyboard was entered.
I think there will definitely be people who ask: “Is there something wrong with this?”
Believe me, maybe.
And what exactly - read on the cut.
If you know, scroll to the next header.
A string in C is an array of characters, which should always end with a good '\ 0' - the character of the end of the string. Lines on the stack (static) are declared like this:
n is the size of the character array, the same as the length of the string.
Assignment {0} - "zeroing" the line (optional, you can declare without it). The result is the same as the execution of the memset (str, 0, sizeof (str)) and bzero (str, sizeof (str)) functions. It is used to prevent garbage from lying in uninitialized variables.
Also on the stack, you can immediately initialize the line:
In addition to this, you can declare a string as a pointer and allocate memory for it on the heap (heap):
size - the number of bytes that we allocate for the string. Such lines are called dynamic (due to the fact that the desired size is calculated dynamically + the allocated memory size can be increased at any time using the realloc () function).
In the case of the stack variable, I used the notation n to determine the size of the array; in the case of the variable on the heap, I used the notation size. And this perfectly reflects the true essence of the difference between the announcement on the stack and the announcement with memory allocation on the heap, because n is usually used when talking about the number of elements. And size is a completely different story ...
I think. enough for now. Move on.
In my previous article, I also mentioned him. Valgrind ( once - a wiki article , two - a small how-to ) is a very useful program that helps a programmer track memory leaks and context errors - these are the things that most often pop up when working with strings.
Let's look at a small listing that implements something similar to the program I mentioned, and run it through valgrind:
And, in fact, the result of the program:
So far, nothing unusual. Now let's run this program with valgrind!
== 3892 == All heap blocks were freed - no leaks are possible - there are no leaks , and it pleases. But it is worth lowering your eyes a little lower (although, I want to note, this is only the result, the basic information is a little different):
== 3892 == ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
3 errors. In 2 contexts. In such a simple program. How!?
Yes, very simple. The whole “trick” is that the strlen function does not take into account the end of line character - '\ 0'. Even if it is explicitly specified in the input line (#define HELLO_STRING "Hello, Habr! \ N \ 0"), it will be ignored.
A little higher than the result of the program execution, the line -> Hello, Habr!There is a detailed report of what and where our precious valgrind did not like. I propose to independently look at these lines and draw conclusions.
Actually, the correct version of the program will look like this:
We pass through valgrind:
Excellent. There are no errors, +1 bytes of allocated memory helped solve the problem.
Interestingly, in most cases both the first and second programs will work the same way, but if the memory allocated for the line into which the end character did not fit was not zero, then the printf () function, when such a line is output, will also remove all garbage after of this line - everything will be displayed until the line ending character appears on the path of printf ().
However, you know, (strlen (str) + 1) is such a solution. We face 2 problems:
Let's come up with a solution that will satisfy us and valgrind.
The function has one interesting feature - in any case, it returns the size of the generated string (without taking into account the character of the end of the line). If the string is empty, then 0 is returned.
One of the problems I have described using strlen is related to the sprintf () and snprintf () functions. Suppose we need to write something to the string str. The final line contains the values of other variables. Our entry should be something like this:
The question arises: how to determine how much memory should be allocated for the string str?
const char * s does not imply that the string passed to s may be a format string with a variable number of arguments.
Here we will be helped by that useful property of the function snprintf (), about which I spoke above. Let's look at the code for the following program:
Run the program in valgrind:
Excellent. Support for the arguments we have. Due to the fact that we pass a zero as the second argument to the snprintf () function, writing to the null pointer will never lead to Seagfault. However, despite this, the function will still return the size necessary for the string.
But on the other hand, we had to create an additional variable, and the design
looks even worse than with strlen ().
In general, + sizeof ('\ 0') can be removed if you explicitly specify '\ 0' at the end of the format string (size_t needed_mem = snprintf (NULL, 0, “Hello,% s! \ N \ 0 ”, “Habr”) ;), but this is by no means always possible (depending on the string processing mechanism, we may allocate an extra byte).
Something needs to be done. I thought a little and decided that now is the time to appeal to the wisdom of the ancients. We describe the macro function that will call snprintf () with a null pointer as the first argument, and zero as the second. And we won’t forget about the end of the line!
Yes, it might be news to someone, but macros in C support a variable number of arguments, and an ellipsis tells the preprocessor that the specified macro function argument (in this case, args) corresponds to several real arguments.
Let's test our solution in practice:
We start with valgrund:
Yes, there are no errors. Everything is correct. And valgrind is happy, and the programmer can finally go to sleep.
But finally, I’ll say something else. In case we need to allocate memory for any line (even with arguments) there is already a fully working ready-made solution .
This is the asprintf function:
As the first argument, it takes a pointer to a string (** strp) and allocates memory to a dereferenced pointer.
Our program written using asprintf () will look like this:
And, actually, in valgrind:
Everything is fine, but, as you see, the memory of everything was allocated more, and allocs are now three, not two. On weak embedded systems, the use of this feature is undesirable.
In addition, if we write man asprintf in the console, we will see:
This makes it clear that this feature is only available in the GNU source.
In conclusion, I want to say that working with strings in C is a very complicated topic that has a number of nuances. For example, to write “safe” code when dynamically allocating memory, it is recommended that you use the calloc () function instead of malloc () - calloc clogs the allocated memory with zeros. Well, or after allocating memory, use the memset () function. Otherwise, the garbage that originally lay on the allocated memory may cause questions when debugging, and sometimes when working with a string.
More than half of my familiar C programmers (most of them beginners), who solved the problem of allocating memory for strings at my request, did so in the end, which ultimately led to context errors. In one case - even to a memory leak (well, a person forgot to do free (str), with whom he doesn’t happen). In fact, this prompted me to create this creation that you just read.
I hope someone will find this article helpful. Why did I put all this into trouble - no language is simple. Everywhere has its own subtleties. And the more subtleties of the language you know, the better your code.
I believe that after reading this article your code will become a little better :)
Good luck, Habr!
Not so long ago, a rather interesting incident occurred with me, in which one of the teachers of a computer science college was involved. The conversation about Linux programming slowly turned to the fact that this person began to argue that the complexity of system programming was actually greatly exaggerated. That the C language is simple as a match, in fact, like the Linux kernel (in his words).
I had with me a laptop with Linux, on which there was a gentleman's set of utilities for development in the C language (gcc, vim, make, valgrind, gdb). I don’t remember what goal we set ourselves then, but after a couple of minutes my opponent was behind this laptop, completely ready to solve the problem.
And literally on the very first lines, he made a serious mistake when allocating memory under ... a line.
char *str = (char *)malloc(sizeof(char) * strlen(buffer));buffer - a stack variable in which data from the keyboard was entered.
I think there will definitely be people who ask: “Is there something wrong with this?”
Believe me, maybe.
And what exactly - read on the cut.
A little theory - a kind of FaceWithout.
If you know, scroll to the next header.
A string in C is an array of characters, which should always end with a good '\ 0' - the character of the end of the string. Lines on the stack (static) are declared like this:
char str[n] = { 0 }; n is the size of the character array, the same as the length of the string.
Assignment {0} - "zeroing" the line (optional, you can declare without it). The result is the same as the execution of the memset (str, 0, sizeof (str)) and bzero (str, sizeof (str)) functions. It is used to prevent garbage from lying in uninitialized variables.
Also on the stack, you can immediately initialize the line:
char buf[BUFSIZE] = "default buffer text\n";In addition to this, you can declare a string as a pointer and allocate memory for it on the heap (heap):
char *str = malloc(size);size - the number of bytes that we allocate for the string. Such lines are called dynamic (due to the fact that the desired size is calculated dynamically + the allocated memory size can be increased at any time using the realloc () function).
In the case of the stack variable, I used the notation n to determine the size of the array; in the case of the variable on the heap, I used the notation size. And this perfectly reflects the true essence of the difference between the announcement on the stack and the announcement with memory allocation on the heap, because n is usually used when talking about the number of elements. And size is a completely different story ...
I think. enough for now. Move on.
Valgrind will help us
In my previous article, I also mentioned him. Valgrind ( once - a wiki article , two - a small how-to ) is a very useful program that helps a programmer track memory leaks and context errors - these are the things that most often pop up when working with strings.
Let's look at a small listing that implements something similar to the program I mentioned, and run it through valgrind:
#include
#include
#include
#define HELLO_STRING "Hello, Habr!\n"
void main() {
char *str = malloc(sizeof(char) * strlen(HELLO_STRING));
strcpy(str, HELLO_STRING);
printf("->\t%s", str);
free(str);
} And, in fact, the result of the program:
[indever@localhost public]$ gcc main.c
[indever@localhost public]$ ./a.out
-> Hello, Habr!
So far, nothing unusual. Now let's run this program with valgrind!
[indever@localhost public]$ valgrind --tool=memcheck ./a.out
==3892== Memcheck, a memory error detector
==3892== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3892== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==3892== Command: ./a.out
==3892==
==3892== Invalid write of size 2
==3892== at 0x4005B4: main (in /home/indever/prg/C/public/a.out)
==3892== Address 0x520004c is 12 bytes inside a block of size 13 alloc'd
==3892== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==3892== by 0x400597: main (in /home/indever/prg/C/public/a.out)
==3892==
==3892== Invalid read of size 1
==3892== at 0x4C30BC4: strlen (vg_replace_strmem.c:454)
==3892== by 0x4E89AD0: vfprintf (in /usr/lib64/libc-2.24.so)
==3892== by 0x4E90718: printf (in /usr/lib64/libc-2.24.so)
==3892== by 0x4005CF: main (in /home/indever/prg/C/public/a.out)
==3892== Address 0x520004d is 0 bytes after a block of size 13 alloc'd
==3892== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==3892== by 0x400597: main (in /home/indever/prg/C/public/a.out)
==3892==
-> Hello, Habr!
==3892==
==3892== HEAP SUMMARY:
==3892== in use at exit: 0 bytes in 0 blocks
==3892== total heap usage: 2 allocs, 2 frees, 1,037 bytes allocated
==3892==
==3892== All heap blocks were freed -- no leaks are possible
==3892==
==3892== For counts of detected and suppressed errors, rerun with: -v
==3892== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
== 3892 == All heap blocks were freed - no leaks are possible - there are no leaks , and it pleases. But it is worth lowering your eyes a little lower (although, I want to note, this is only the result, the basic information is a little different):
== 3892 == ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)
3 errors. In 2 contexts. In such a simple program. How!?
Yes, very simple. The whole “trick” is that the strlen function does not take into account the end of line character - '\ 0'. Even if it is explicitly specified in the input line (#define HELLO_STRING "Hello, Habr! \ N \ 0"), it will be ignored.
A little higher than the result of the program execution, the line -> Hello, Habr!There is a detailed report of what and where our precious valgrind did not like. I propose to independently look at these lines and draw conclusions.
Actually, the correct version of the program will look like this:
#include
#include
#include
#define HELLO_STRING "Hello, Habr!\n"
void main() {
char *str = malloc(sizeof(char) * (strlen(HELLO_STRING) + 1));
strcpy(str, HELLO_STRING);
printf("->\t%s", str);
free(str);
} We pass through valgrind:
[indever@localhost public]$ valgrind --tool=memcheck ./a.out
-> Hello, Habr!
==3435==
==3435== HEAP SUMMARY:
==3435== in use at exit: 0 bytes in 0 blocks
==3435== total heap usage: 2 allocs, 2 frees, 1,038 bytes allocated
==3435==
==3435== All heap blocks were freed -- no leaks are possible
==3435==
==3435== For counts of detected and suppressed errors, rerun with: -v
==3435== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Excellent. There are no errors, +1 bytes of allocated memory helped solve the problem.
Interestingly, in most cases both the first and second programs will work the same way, but if the memory allocated for the line into which the end character did not fit was not zero, then the printf () function, when such a line is output, will also remove all garbage after of this line - everything will be displayed until the line ending character appears on the path of printf ().
However, you know, (strlen (str) + 1) is such a solution. We face 2 problems:
- And if we need to allocate memory for the line formed with, for example, s (n) printf (..)? We do not support the arguments.
- Appearance. A line with a variable declaration looks just awful. Some guys to malloc also (char *) manage to fasten as if they write under the pluses. In a program where it is regularly required to process strings, it makes sense to find a more elegant solution.
Let's come up with a solution that will satisfy us and valgrind.
snprintf ()
int snprintf(char *str, size_t size, const char *format, ...);- function - sprintf extension, which formats the string and writes it according to the pointer passed as the first argument. It differs from sprintf () in that no more bytes are written to str than specified in size. The function has one interesting feature - in any case, it returns the size of the generated string (without taking into account the character of the end of the line). If the string is empty, then 0 is returned.
One of the problems I have described using strlen is related to the sprintf () and snprintf () functions. Suppose we need to write something to the string str. The final line contains the values of other variables. Our entry should be something like this:
char * str = /* тут аллоцируем память */;
sprintf(str, "Hello, %s\n", "Habr!");The question arises: how to determine how much memory should be allocated for the string str?
char * str = malloc(sizeof(char) * (strlen(str, "Hello, %s\n", "Habr!") + 1));- it will not work. The prototype of the strlen () function looks like this:#include
size_t strlen(const char *s); const char * s does not imply that the string passed to s may be a format string with a variable number of arguments.
Here we will be helped by that useful property of the function snprintf (), about which I spoke above. Let's look at the code for the following program:
#include
#include
#include
void main() {
/* Т.к. snprintf() не учитывает символ конца строки, прибавляем его размер к результату */
size_t needed_mem = snprintf(NULL, 0, "Hello, %s!\n", "Habr") + sizeof('\0');
char *str = malloc(needed_mem);
snprintf(str, needed_mem, "Hello, %s!\n", "Habr");
printf("->\t%s", str);
free(str);
} Run the program in valgrind:
[indever@localhost public]$ valgrind --tool=memcheck ./a.out
-> Hello, Habr!
==4132==
==4132== HEAP SUMMARY:
==4132== in use at exit: 0 bytes in 0 blocks
==4132== total heap usage: 2 allocs, 2 frees, 1,041 bytes allocated
==4132==
==4132== All heap blocks were freed -- no leaks are possible
==4132==
==4132== For counts of detected and suppressed errors, rerun with: -v
==4132== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
[indever@localhost public]$
Excellent. Support for the arguments we have. Due to the fact that we pass a zero as the second argument to the snprintf () function, writing to the null pointer will never lead to Seagfault. However, despite this, the function will still return the size necessary for the string.
But on the other hand, we had to create an additional variable, and the design
size_t needed_mem = snprintf(NULL, 0, "Hello, %s!\n", "Habr") + sizeof('\0');looks even worse than with strlen ().
In general, + sizeof ('\ 0') can be removed if you explicitly specify '\ 0' at the end of the format string (size_t needed_mem = snprintf (NULL, 0, “Hello,% s! \ N \ 0 ”, “Habr”) ;), but this is by no means always possible (depending on the string processing mechanism, we may allocate an extra byte).
Something needs to be done. I thought a little and decided that now is the time to appeal to the wisdom of the ancients. We describe the macro function that will call snprintf () with a null pointer as the first argument, and zero as the second. And we won’t forget about the end of the line!
#define strsize(args...) snprintf(NULL, 0, args) + sizeof('\0')Yes, it might be news to someone, but macros in C support a variable number of arguments, and an ellipsis tells the preprocessor that the specified macro function argument (in this case, args) corresponds to several real arguments.
Let's test our solution in practice:
#include
#include
#include
#define strsize(args...) snprintf(NULL, 0, args) + sizeof('\0')
void main() {
char *str = malloc(strsize("Hello, %s\n", "Habr!"));
sprintf(str, "Hello, %s\n", "Habr!");
printf("->\t%s", str);
free(str);
} We start with valgrund:
[indever@localhost public]$ valgrind --tool=memcheck ./a.out
-> Hello, Habr!
==6432==
==6432== HEAP SUMMARY:
==6432== in use at exit: 0 bytes in 0 blocks
==6432== total heap usage: 2 allocs, 2 frees, 1,041 bytes allocated
==6432==
==6432== All heap blocks were freed -- no leaks are possible
==6432==
==6432== For counts of detected and suppressed errors, rerun with: -v
==6432== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Yes, there are no errors. Everything is correct. And valgrind is happy, and the programmer can finally go to sleep.
But finally, I’ll say something else. In case we need to allocate memory for any line (even with arguments) there is already a fully working ready-made solution .
This is the asprintf function:
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include
int asprintf(char **strp, const char *fmt, ...);
As the first argument, it takes a pointer to a string (** strp) and allocates memory to a dereferenced pointer.
Our program written using asprintf () will look like this:
#include
#include
#include
void main() {
char *str;
asprintf(&str, "Hello, %s!\n", "Habr");
printf("->\t%s", str);
free(str);
} And, actually, in valgrind:
[indever@localhost public]$ valgrind --tool=memcheck ./a.out
-> Hello, Habr!
==6674==
==6674== HEAP SUMMARY:
==6674== in use at exit: 0 bytes in 0 blocks
==6674== total heap usage: 3 allocs, 3 frees, 1,138 bytes allocated
==6674==
==6674== All heap blocks were freed -- no leaks are possible
==6674==
==6674== For counts of detected and suppressed errors, rerun with: -v
==6674== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Everything is fine, but, as you see, the memory of everything was allocated more, and allocs are now three, not two. On weak embedded systems, the use of this feature is undesirable.
In addition, if we write man asprintf in the console, we will see:
CONFORMING TO
These functions are GNU extensions, not in C or POSIX. They are also available under *BSD. The FreeBSD implementation sets strp to NULL on error.This makes it clear that this feature is only available in the GNU source.
Conclusion
In conclusion, I want to say that working with strings in C is a very complicated topic that has a number of nuances. For example, to write “safe” code when dynamically allocating memory, it is recommended that you use the calloc () function instead of malloc () - calloc clogs the allocated memory with zeros. Well, or after allocating memory, use the memset () function. Otherwise, the garbage that originally lay on the allocated memory may cause questions when debugging, and sometimes when working with a string.
More than half of my familiar C programmers (most of them beginners), who solved the problem of allocating memory for strings at my request, did so in the end, which ultimately led to context errors. In one case - even to a memory leak (well, a person forgot to do free (str), with whom he doesn’t happen). In fact, this prompted me to create this creation that you just read.
I hope someone will find this article helpful. Why did I put all this into trouble - no language is simple. Everywhere has its own subtleties. And the more subtleties of the language you know, the better your code.
I believe that after reading this article your code will become a little better :)
Good luck, Habr!