
Strings up to 23 characters in Ruby process 1.92 times faster
An interesting fact: in Ruby 1.9.3 with a 64-bit interpreter, processing of strings of 23 characters or less is almost twice as fast as strings of 24 or more characters. In other words, this Ruby code:
... will be processed 1.92 times faster than this:
For the 32-bit Ruby interpreter, the performance margin is around 11/12 characters .
Of course, it's pretty silly to study your code and reduce all lines to 11 or 23 characters. The difference in performance is manifested only in hundreds of thousands of lines. However, those wishing to delve into the insides of the wonderful Ruby language may be wondering why this happens.
The difference in performance can be seen with a simple benchmark:
Here is the result on lines of different lengths.
The plate contains a little more data, but the trend is clear.

Creation time 1 million lines (ms), depending on the length of the line (characters).
Add that the focus only works with the Ruby 1.9.3 interpreter, but not 1.8.
To figure this out, Ruby developer Pat Shaughnessy studied the internal workings of the Ruby Hacking Guide , including Chapter 2 , which covers basic Ruby data types, including strings. After that, he decided to delve into the source code of ruby.h (description of data types) and string.c (implementation of strings). In the C code, there was a clue.
It's all about
The Ruby interpreter distinguishes between three kinds of strings, which can be called this:
Next, the interpreter checks the size of the string. If the value is 23 characters or less, then again the memory from the heap is not allocated for it and it is not called
Here lies the clue. A detailed description of the structure
Here the size of the array is
On a 64-bit machine, sizeof (VALUE) is 8, which leads to a limit of 23 characters.
So, without transferring directly to the structure
str = "1234567890123456789012" + "x"
... will be processed 1.92 times faster than this:
str = "12345678901234567890123" + "x"
For the 32-bit Ruby interpreter, the performance margin is around 11/12 characters .
Of course, it's pretty silly to study your code and reduce all lines to 11 or 23 characters. The difference in performance is manifested only in hundreds of thousands of lines. However, those wishing to delve into the insides of the wonderful Ruby language may be wondering why this happens.
The difference in performance can be seen with a simple benchmark:
require 'benchmark'
ITERATIONS = 1000000
def run(str, bench)
bench.report("#{str.length + 1} chars") do
ITERATIONS.times do
new_string = str + 'x'
end
end
end
Here is the result on lines of different lengths.
user system total real 21 chars 0.250000 0.000000 0.250000 (0.247459) 22 chars 0.250000 0.000000 0.250000 (0.246954) 23 chars 0.250000 0.000000 0.250000 (0.248440) 24 chars 0.480000 0.000000 0.480000 (0.478391) 25 chars 0.480000 0.000000 0.480000 (0.479662) 26 chars 0.480000 0.000000 0.480000 (0.481211) 27 chars 0.490000 0.000000 0.490000 (0.490404)
The plate contains a little more data, but the trend is clear.

Creation time 1 million lines (ms), depending on the length of the line (characters).
Add that the focus only works with the Ruby 1.9.3 interpreter, but not 1.8.
To figure this out, Ruby developer Pat Shaughnessy studied the internal workings of the Ruby Hacking Guide , including Chapter 2 , which covers basic Ruby data types, including strings. After that, he decided to delve into the source code of ruby.h (description of data types) and string.c (implementation of strings). In the C code, there was a clue.
It's all about
malloc
- a standard C function that dynamically allocates memory. In fact, this is a rather resource-intensive operation, because you need to find free memory blocks of the right size in the heap, and also track the release of this block after the operation. The Ruby interpreter distinguishes between three kinds of strings, which can be called this:
- Heap Strings (heap strings)
- Shared Strings (same strings)
- Embedded Strings
RString
, but the function malloc
applies only to the first type of strings (heap strings), but does not apply to identical strings and inline strings, which saves resources and improves performance. How is this optimization going? The Ruby interpreter first checks the string for uniqueness: if it is a copy of an existing string, then there is no need to allocate new memory for it. Such a structure RString
is created the fastest.struct RString {
long len;
char *ptr;
VALUE shared;
};
Next, the interpreter checks the size of the string. If the value is 23 characters or less, then again the memory from the heap is not allocated for it and it is not called
malloc
, but the value is embedded directly into the structure RString
through char ary[]
.struct RString {
char ary[RSTRING_EMBED_LEN_MAX + 1];
}
Here lies the clue. A detailed description of the structure
RString
looks like this.struct RString {
struct RBasic basic;
union {
struct {
long len;
char *ptr;
union {
long capa;
VALUE shared;
} aux;
} heap;
char ary[RSTRING_EMBED_LEN_MAX + 1];
} as;
};
Here the size of the array is
RSTRING_EMBED_LEN_MAX
set as the sum of the len / ptr / capa values, that is, just 24 bytes. Here is the line from ruby.h that defines the value RSTRING_EMBED_LEN_MAX
.#define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))
On a 64-bit machine, sizeof (VALUE) is 8, which leads to a limit of 23 characters.
So, without transferring directly to the structure
RString
, only 23 characters from the string value can fit directly into the structure . If the line exceeds this value, only then the data is placed in the "heap", for which the malloc
corresponding resource-intensive procedures are called and occur. That is why "long" lines are processed more slowly.