Splitting a string by separator. A little about CONNECT BY

From the sandbox

Hello!

I work as a PL / SQL developer. There is a task to collect some data for metrics in order to track the loading of systems. There is some function that is called with a parameter consisting of a list of IDs.

1,2,3,4…1001,1002,1003…

The task is as follows. It is necessary to break such a string into elements and write them into an integer collection.

Let's get started

First you need the data to work. We write a function that generates a string with numbers separated by a comma. We will pass an integer argument N to the function - the number of numbers in the sequence.

We will not be wise, the sequence will be done with the type VARCHAR2, not CLOB. Further I will explain why VARCHAR2.

Function code for sequence generation:

CREATEORREPLACEFUNCTION createNumbers(v_N IN PLS_INTEGER) 
RETURN VARCHAR2
IS
  v_str     VARCHAR2(4000) := '0';
  too_small EXCEPTION;
  PRAGMA EXCEPTION_INIT(too_small, -6502);
BEGINFOR i IN1..v_N
  LOOPBEGIN
      v_str := v_str || ',' || TO_CHAR(i);
    EXCEPTION
      WHEN too_small
      THEN
        EXIT;
    END;
  ENDLOOP;  
  RETURN v_str || ',';
END;

Let's return to our task.

The first thing that comes to mind is to loop through the line, with a decrease in the length of the line at each iteration. Since, by the condition of the problem, the result needs to be placed in the collection, we will create the corresponding collection.

DECLARE 
  v_N PLS_INTEGER := 1000;
  v_str VARCHAR2(4000) := createNumbers(v_N => v_N);
  TYPE tab_number IS TABLE OF NUMBER(5);
  t_str tab_number := tab_number();
BEGINWHILE v_str ISNOTNULLLOOP
    t_str.EXTEND;
    t_str(t_str.COUNT) := SUBSTR(v_str, 1, INSTR(v_str, ',', 1) - 1);
    v_str := SUBSTR(v_str, INSTR(v_str, ',', 1) + 1);
  ENDLOOP;
  FOR i IN t_str.FIRST..t_str.LAST
  LOOP
    dbms_output.put_line(t_str(i));
  ENDLOOP;
  t_str.DELETE;
END;

Result: The createNumber () function takes the argument v_N = 1000. In the createNumber () function, you can see the handling of the v_str variable overflow. By a simple calculation one can find out that 4000 bytes is enough for 1021 numbers. Our 1000 easily fit into this size. As you can see, the result is the one that was needed. The string is divided. Even if there is no built-in split () function in Oracle, such as in Java or Python, this option does not suit me, since I think that too much code is written for such a simple task as breaking up a string. At this stage, I wondered, is it possible to break a string only with SQL? I mean not classical SQL, but that SQL which is offered by Oracle. I remembered the construct for constructing hierarchical CONNECT BY queries.

The optional START WITH statement tells Oracle where to start the loop, i.e. which line will be the root. The condition can be almost anything. The condition after CONNECT BY must be specified. Here you need to tell Oracle how long to continue the cycle.

It can be seen that the only important condition for building a hierarchical query is the CONNECT BY operator, the rest is “strung” as needed.

Also, this construct has a pseudo-column level, which returns the nesting level at the current iteration.

At first glance it may seem that this construction is not suitable for splitting a line. This is not quite true. If the condition is correctly set, the recursive traversal can be turned into a cyclic one, as in a while or for loop.

Before writing a query, let's think about a string traversal algorithm. It is necessary, starting from the beginning of the line, to cut off a certain number of characters, to the delimiter character. Above, I wrote about the pseudo-column level. We will use it as the current iteration number.
It turns out something like this:

SELECTSUBSTR(str,
              INSTR(str, ',', 1, level - 1) + 1,
              INSTR(str, ',', 1, level) - INSTR(str, ',', 1, level - 1) + 1))
FROM DUAL;

But if you look closely, you can see that this algorithm does not work at the very first iteration, since the third argument of the INSTR () function cannot equal 0.

Therefore, we add a small condition using the DECODE () function.

SELECTSUBSTR(str, 
              DECODE(level, 1, 1, INSTR(str, ',', 1, level - 1) + 1),
              INSTR(str, ',', 1, level) 
              - DECODE(level, 1, 1, INSTR(str, ',', 1, level - 1) + 1))
FROM DUAL;

Now the very first iteration will work correctly.

It's time to apply the CONNECT BY construct. Plus we will take our line to the top.

WITH 
  TMain AS (SELECT'100,200,300,400,500' || ','ASstrFROM DUAL)
SELECTSUBSTR(str, 
              DECODE(level, 1, 1, INSTR(str, ',', 1, level - 1) + 1),
              INSTR(str, ',', 1, level) 
              - DECODE(level, 1, 1, INSTR(str, ',', 1, level - 1) + 1))
FROM TMain
CONNECTBY NVL(INSTR(str, ',', 1, level), 0) <> 0;

I already wrote that with the right condition, the CONNECT BY construct can behave like a loop. The condition is fulfilled as long as the INSTR () function can find the nth position of the delimiter character, where n is the number of the current iteration, and as we remember the iteration number is the pseudo-column level.

It seems that the problem is solved? Not.

The code may work, but its readability is zero. I was already thinking about going back to the cycle option, but I figured out how to improve the CONNECT BY version.

Oracle has such a powerful tool as regular expressions. Specifically, the regexp_instr () and regexp_substr () functions.

regexp_instr(исходная_строка, шаблон[, начальная_позиция [, вхождение ] ])- the function returns the position of the character at the beginning or end of the match for the template, as well as its counterpart INSTR ().

regexp_substr(исходная_строка, шаблон[, позиция [, вхождение ]])- the function returns a substring that matches the pattern.

Rewrite the query using regular expressions:

WITH 
  TMain AS (SELECT'100,200,300,400,500' || ','ASstrFROM DUAL)
SELECT regexp_substr(str, '[^,]+', 1, level)
FROM TMain
CONNECTBY NVL(regexp_instr(str, '[^,]+', 1, level), 0) <> 0;

The code reads much better than in the previous example. This option suits me.

In the end, it would be logical to give a comparison of the time to parse the line for the three options. Above, I promised to explain why instead of the CLOB type, we will use the VARCHAR2 type. This is needed just to compare the execution time. Because Oracle processes the CLOB type differently than VARCHAR2, which can skew the results.

The code for evaluating the three options:

DECLARE 
  v_N         PLS_INTEGER := 1000;
  v_str       VARCHAR2(32767);
  v_startTime TIMESTAMP(9);
  v_endTime   TIMESTAMP(9);
  TYPE tab_number IS TABLE OF NUMBER(5);
  t_str tab_number := tab_number();
BEGIN
  v_str := createNumbers(v_N => v_N);
  v_startTime := SYSTIMESTAMP; 
  WHILE v_str IS NOT NULL
  LOOP
    t_str.EXTEND;
    t_str(t_str.COUNT) := SUBSTR(v_str, 1, INSTR(v_str, ',', 1) - 1);
    v_str := SUBSTR(v_str, INSTR(v_str, ',', 1) + 1);
  ENDLOOP;
  v_endTime := SYSTIMESTAMP; 
  dbms_output.put_line(v_endTime - v_startTime);
  t_str.DELETE;
  /*---------------------------------------------------------*/
  v_str := createNumbers(v_N => v_N);
  v_startTime := SYSTIMESTAMP; 
  SELECT TO_NUMBER(SUBSTR(v_str, 
                          DECODE(level, 1, 1, INSTR(v_str, ',', 1, level - 1) + 1),
                          INSTR(v_str, ',', 1, level) 
                          - DECODE(level, 1, 1, INSTR(v_str, ',', 1, level - 1) + 1)))
  BULKCOLLECTINTO t_str
  FROM DUAL
  CONNECTBY NVL(INSTR(v_str, ',', 1, level), 0) <> 0;
  v_endTime := SYSTIMESTAMP;
  dbms_output.put_line(v_endTime - v_startTime);
  t_str.DELETE;
  /*---------------------------------------------------------*/
  v_str := createNumbers(v_N => v_N);
  v_startTime := SYSTIMESTAMP;
  SELECT TO_NUMBER(regexp_substr(v_str, '[^,]+', 1, level))
  BULKCOLLECTINTO t_str
  FROM DUAL
  CONNECTBY NVL(regexp_instr(v_str, '[^,]+', 1, level), 0) <> 0;
  v_endTime := SYSTIMESTAMP;
  dbms_output.put_line(v_endTime - v_startTime);
  t_str.DELETE;
END;

The results are summarized in the table:

Splitting option / Number of numbers	ten	100	1000
Cycle	0.05ms	0.15ms	1.52ms
CONNECT BY	0.18ms	0.68ms	18.1ms
CONNECT BY + regular expressions	0.25ms	12.1ms	1s 137ms

Honestly, the result is unexpected for me. I assumed that the third option would be the fastest. Well, it will be a lesson.

Thanks for reading!

List of sources used:

1. Hierarchical (recursive) queries

Tags:

Splitting a string by separator. A little about CONNECT BY

Also popular now: