alizar January 8, 2012 at 01:20

Strings up to 23 characters in Ruby process 1.92 times faster

An interesting fact: in Ruby 1.9.3 with a 64-bit interpreter, processing of strings of 23 characters or less is almost twice as fast as strings of 24 or more characters. In other words, this Ruby code:

str = "1234567890123456789012" + "x"

... will be processed 1.92 times faster than this:

str = "12345678901234567890123" + "x"

For the 32-bit Ruby interpreter, the performance margin is around 11/12 characters .

Of course, it's pretty silly to study your code and reduce all lines to 11 or 23 characters. The difference in performance is manifested only in hundreds of thousands of lines. However, those wishing to delve into the insides of the wonderful Ruby language may be wondering why this happens.

The difference in performance can be seen with a simple benchmark:

require 'benchmark'
ITERATIONS = 1000000
def run(str, bench)
  bench.report("#{str.length + 1} chars") do
    ITERATIONS.times do
      new_string = str + 'x'
    end
  end
end

Here is the result on lines of different lengths.

    user system total real
21 chars 0.250000 0.000000 0.250000 (0.247459)
22 chars 0.250000 0.000000 0.250000 (0.246954)
23 chars 0.250000 0.000000 0.250000 (0.248440)
24 chars 0.480000 0.000000 0.480000 (0.478391)
25 chars 0.480000 0.000000 0.480000 (0.479662)
26 chars 0.480000 0.000000 0.480000 (0.481211)
27 chars 0.490000 0.000000 0.490000 (0.490404)

The plate contains a little more data, but the trend is clear.

^{Creation time 1 million lines (ms), depending on the length of the line (characters).}

Add that the focus only works with the Ruby 1.9.3 interpreter, but not 1.8.

To figure this out, Ruby developer Pat Shaughnessy studied the internal workings of the Ruby Hacking Guide , including Chapter 2 , which covers basic Ruby data types, including strings. After that, he decided to delve into the source code of ruby.h (description of data types) and string.c (implementation of strings). In the C code, there was a clue.

It's all aboutmalloc- a standard C function that dynamically allocates memory. In fact, this is a rather resource-intensive operation, because you need to find free memory blocks of the right size in the heap, and also track the release of this block after the operation.

The Ruby interpreter distinguishes between three kinds of strings, which can be called this:

Heap Strings (heap strings)
Shared Strings (same strings)
Embedded Strings

A C structure is created for all types of strings RString, but the function mallocapplies only to the first type of strings (heap strings), but does not apply to identical strings and inline strings, which saves resources and improves performance. How is this optimization going? The Ruby interpreter first checks the string for uniqueness: if it is a copy of an existing string, then there is no need to allocate new memory for it. Such a structure RStringis created the fastest.

struct RString {
    long len;
    char *ptr;
    VALUE shared;
};

Next, the interpreter checks the size of the string. If the value is 23 characters or less, then again the memory from the heap is not allocated for it and it is not called malloc, but the value is embedded directly into the structure RStringthrough char ary[].

struct RString {
  char ary[RSTRING_EMBED_LEN_MAX + 1];
}

Here lies the clue. A detailed description of the structure RStringlooks like this.

struct RString {
  struct RBasic basic;
  union {
    struct {
      long len;
      char *ptr;
      union {
        long capa;
        VALUE shared;
      } aux;
    } heap;
    char ary[RSTRING_EMBED_LEN_MAX + 1];
  } as;
};

Here the size of the array is RSTRING_EMBED_LEN_MAXset as the sum of the len / ptr / capa values, that is, just 24 bytes. Here is the line from ruby.h that defines the value RSTRING_EMBED_LEN_MAX.

#define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

On a 64-bit machine, sizeof (VALUE) is 8, which leads to a limit of 23 characters.

So, without transferring directly to the structure RString, only 23 characters from the string value can fit directly into the structure . If the line exceeds this value, only then the data is placed in the "heap", for which the malloccorresponding resource-intensive procedures are called and occur. That is why "long" lines are processed more slowly.

Tags:

Strings up to 23 characters in Ruby process 1.92 times faster

Also popular now: