anjensan November 26, 2012 at 15:12

Clojure - sequences

From the sandbox

Clojure is a dialect of Lisp, so it’s not at all surprising that working with lists has a key place in the language. True, unlike traditional dialects (CL or Scheme), instead of the classic double-slot lists, Clojure uses the Sequence abstraction - the “logical list”. In fact, it is an interface that provides methods for working with immutable, lazy, and possibly infinite sequences. This article describes the internal structure of these entities.

The essence of the sequences

Let's start with the java interfaces. All structures in Clojure implement clojure.lang.Seqable :

public interface Seqable {
    ISeq seq();
}

In turn, clojure.lang.ISeq is defined as:

public interface ISeq extends IPersistentCollection {
    Object first();
    ISeq next();
    ISeq more();
    ISeq cons(Object o);
}

Here you can draw an analogy with Iterable , respectively ISeq - a kind of analogue of Iterator . Their main difference is that ISeq objects are immutable: its methods return a new instance (maybe even a different type) instead of changing the internal state of the object. All sequences in Clojure are also collections at the same time - they implement the inherited interface clojure.lang.IPersistentCollection , and through it Seqable :

public interface IPersistentCollection extends Seqable {
    int count();
    IPersistentCollection cons(Object o);
    IPersistentCollection empty();
    boolean equiv(Object o);
}

Here we see count - counting the number of elements. For sequences, its complexity is obviously O (n). The empty method returns an empty collection of the same type. For sequences, returns () . Well, the cons method adds a new element.

To obtain ISeq objects , the language provides the seq function . If the collection is empty, then nil is returned , if we have a Seqable object , then the seq () method is called on it , otherwise a special wrapper (adapter) is created. Wrappers are supported for arrays, iterators, java collections (including Map ) and objectsCharSequence (strings). It will not work to add your type to this list - it is "sewn" into java code (even protocols will not help). Of course, you should not change arrays (or java collections) if a sequence is created for them. So, the ConcurrentModificationException is simply thrown up.

There are only three basic functions in Clojure for working with sequences:

first - take the first element of the sequence or nil if it is empty;
rest - get the "tail", i.e. a sequence without the first element;
cons - create a new immutable cons cell .

Instead of cons , the conj function is usually used to add new elements . The difference between the two is that the first always creates a new cell in a simply connected list, but the result of the second is specific to each particular type of collection - the element is added to the most suitable place. For cons cells this is the beginning of the list, for vectors it’s the end, for sets the order is not defined at all. It is important to note that if we applied seq to the collection , then we lose information about its type:

(conj [1 2 3] 4)	;=> [1 2 3 4]
(conj (seq [1 2 3]) 4)	;=> '(4 1 2 3)

Unfortunately, the names of methods in java interfaces and the names of functions do not always coincide: the conj function corresponds to the cons method , and rest to the more method . On the other hand, you need to know the names of methods only when writing your collections, and this occupation can hardly be attributed to ordinary.

All standard functions automatically call seq . For example, if we write (cons 1 [2 3]) , then the created cons cell will actually refer not to the vector [1 2] itself , but to the result (seq [1 2]). The same trick is performed when the rest of the functions (map, filter, etc.) work - they all work with sequences, although they accept the collection as parameters.

Actually, only lists and cons cells directly implement ISeq :

(seq? '(1 2))		;=> true
(seq? ())		;=> true
(seq? (cons 1 nil))	;=> true
(seq? (cons 1 [2 3]))	;=> true
(seq? [1 2])		;=> false
(seq? {:a :b})		;=> false
(seq? #{1 2})		;=> false
(seq? nil)		;=> false

Therefore, during iterations over the list, no temporary objects are created. But when we go over the vector, one temporary ISeq instance is created for each element . In fact, from version 1.1 things are a little different, but more on that below.

There is also a “nobody knows” interface clojure.lang.IndexedSeq , sequences for arrays and strings implement it. It defines the index () method to get the number of the current item (i.e. where the head is in the original collection).

(.index (rest (to-array [1 2 3 4])))		;=> 1
(.index (rest (rest (to-array [1 2 3 4]))) 	;=> 2

In practice, this interface is not used anywhere (undocumented feature). However, to come up with his application is really difficult. By the way, clojure.lang.APersistenVector $ Seq objects also implement it. But the trouble here is that now seq for vectors returns instances of clojure.lang.PersistentVector $ ChunkedSeq , which no longer have such a dubious ability.

Lazy sequences

An important feature of sequences in Clojure is their potential laziness. This means that at the current moment of time, not the entire sequence is constructed, and its “tail” has not yet been created. In Clojure, this is often used for streaming I / O.

Laziness is realized very simply. There is a class clojure.lang.LazySeq (which, of course, implements ISeq ). Objects of this class have a reference to a function ( IFn instance ), which, when called, should return a new sequence itself (most likely also lazy, although not necessary). Whenever you access LazySeq methodsa synchronized call to the generating function is made, its result is saved, and the link to the function itself is reset (in order to allow the garbage collector to process it). Obviously, there is no way to get an element of a lazy sequence without calculating all the previous ones. If you need just this behavior, you can use the delay macro .

(nth (map #(print % "") (iterate inc 0)) 5)		;=> 1 2 3 4 5
@(nth (map #(delay (print % "")) (iterate inc 0)) 5)	;=> 5

Generating lazy sequences in your code is ridiculously simple. It is enough to put the code that calls cons inside the lazy-seq macro . But, in fact, even this often does not have to be done - most of the standard functions (with very few exceptions) return just lazy collections, doing all the "dirty" work for us. There is truth and minor nuances here.

So, in its calculation, the lazy sequence actually turns into a simple, simply connected list, and if you store a link to the first element, then there may be memory problems. Classic example:

(def all-odds (filter odd? (range)))
(println (nth all-odds 20000000))

Here we are trying to find the 20,000,000th odd number. And it finds it, but that's just all the intermediate ISeq instances remain in memory. The right decision:

(defn make-all-odds [] (filter odd? (range)))
(println (nth (make-all-odds) 2000000))

The rest function never returns nil ; instead, it can return an empty sequence. If we want to get nil , then use next .

(first '(1 2 3)) 	;=> 1
(rest '(1 2 3)) 	;=> '(2 3)
(rest '(1)) 		;=> ()
(rest ()) 		;=> ()
(next '(1)) 		;=> nil
(next ()) 		;=> nil

This is done for greater laziness. In fact, if the sequence is lazy, then in the general case we do not know whether there are more elements in it or not. To verify this fact with next, we need to calculate at least the first 2 elements: we discard the first, and by the presence of the second we will understand whether nil should be returned . Well, the calculation of 2 elements - this is clearly not what we usually need.

If lazy behavior is not required, then the entire sequence can be fully calculated using the dorun and doall functions . The first returns the original sequence, the second nil.

(dorun 
  (map #(println % "") (range 5)));=> 1 2 3 4 5

Chunked sequences

Lazy collections are a very powerful tool, but in practice we often process the entire collection, and lazy-seq brings tangible overhead. Therefore, in version 1.1, an important optimization was done - chunked seqences. The bottom line is simple - when processing lazy sequences, we are not dealing with individual elements, but with their groups. At the same time, a number of "growth" elements are greedily calculated.

Guess what this code outputs?

(first (map #(print % "") (range 100)))

Answer

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

So, Clojure defines a special interface clojure.lang.IChunkedSeq . There is not much point in describing the java part of this mechanism, so I will omit it. Only sequences for vectors implement the interface. Well, and, as we saw, the result of the range function .

The algorithm for processing such sequences:

check if the source is IChunkedSeq ( chunked-seq function ? );
if not, then we perform the usual version of the algorithm (bitwise processing);
if so, then take the first block from the sequence (function chunk-first );
look at its size ( chunk-size function );
create a new block, we will place the generated elements there ( chunk-buffer function );
actually doing useful work, adding elements to the block ( chunk-append function );
wrap our new block in ChunkedCons (function chunk-cons ).

For an example we will write our implementation # (filter odd?%). A simple option:

(defn filter-odd-1 [a]
  (lazy-seq
    (when-let [s (seq a)]
      (let [f (first s)
            r (rest s)]
        (if (odd? f)
          (cons f (filter-odd-1 r))
          (filter-odd-1 r))))))

Advanced option (with block support):

(defn filter-odd-2 [a]
  (lazy-seq
    (when-let [s (seq a)]
      (if (chunked-seq? s)
        (let [f (chunk-first s)
              c (count f)
              b (chunk-buffer c)]
          (dotimes [i c]
            (let [x (nth f i)] 
              (when (odd? x)
                (chunk-append b x))))
          (chunk-cons (chunk b) (filter-odd-2 (chunk-rest s))))
        (let [f (first s)
              r (rest s)]
          (if (odd? f)
            (cons f (filter-odd-2 r))
            (filter-odd-2 r)))))))

Of course, there was a bit more code, and there was obvious duplication: we had to create two separate implementations. But such is the price of speed. Fortunately, most of the standard features already fully support chunks. I note that the sequence can consist of blocks of various lengths (the blocks do not stick together, but sometimes they are cut, as in the example above), and even be a mixture of ordinary elements and blocks.

What if we want to turn a regular sequence into a block sequence? You just need to convert it to a vector using the vec function . The converse is already a bit more complicated:

(defn seq1 [s]
  (lazy-seq
    (when-let [x (seq s)]
      (cons (first x) (seq1 (rest x))))))
(first (map println (seq1 (range 1000))))	;=> 1

Alternative solution

(defn seq1 [^clojure.lang.ISeq s]
  (reify clojure.lang.ISeq
    (first [_] (.first s))
    (more [_] (seq1 (.more s)))
    (next [_] (let [sn (.next s)] (and sn (seq1 sn))))
    (seq [_] (let [ss (.seq s)] (and ss (seq1 ss))))
    (count [_] (.count s))
    (cons [_ o] (.cons s o))
    (empty [_] (.empty s))
    (equiv [_ o] (.equiv s o))))

It is expected that the reduce function takes advantage of block sequences. So, each block has a special method that performs convolution in a java cycle without creating temporary objects. In general, reduce also has other optimizations, everyone interested can see clojure.core.protocols .

Instead of a conclusion

Clojure sequences are powerful and convenient. Their undoubted plus is that often you don’t even have to think about such “trifles” as laziness or block processing - Clojure tries to do this for us.

Tags:

Clojure - sequences

The essence of the sequences

Lazy sequences

Chunked sequences

Instead of a conclusion

Also popular now: