The intricacies of foreach in PHP

In a recent digest of interesting PHP links, I found a link to Nikita Popov's comment on StackOverflow, where he talks in detail about the “under the hood” mechanism of the foreach control structure.
Since foreach does sometimes work in more than strange ways, I found it helpful to translate this answer.


Attention: this text implies the presence of basic knowledge about the functionality of zval in PHP, in particular, you should know what refcount and is_ref are.
foreach works with entities of different types: with arrays, with simple objects (where available properties are listed) and with Traversable-objects (or rather, objects that have an internal get_iterator handler defined). Here we are mainly talking about arrays, but I will talk about the rest at the very end.

Before you begin, a few words about arrays and their traversal, important for understanding the context.


How array traversal works



Arrays in PHP are ordered hash tables (hash elements are combined into a doubly linked list) and foreach bypasses the array following the specified order.

PHP includes two ways to traverse an array:
  • The first way is an internal array pointer. This pointer is part of the HashTable structure and is just a pointer to the current item in the hash table. The internal array pointer can be changed with impunity, that is, if the current element is deleted, the internal array pointer will be moved to the next.
  • The second iteration mechanism is the external array pointer, called HashPosition. This is almost the same as the internal pointer of the array, but it is not part of the HashTable. This external iteration method is not safe for change. If you delete the item that HashPosition points to, you will remain with a dangling pointer, which will lead to a segmentation error.


Thus, external array pointers can only be used when you are completely sure that no custom code will be executed when going around. And such code may be in the most unexpected place, such as an error handler or destructor. That's why in most cases, PHP has to use an internal pointer instead of an external one. If it were otherwise, PHP could fall due to a segmentation fault as soon as the user starts doing something unusual.

The problem with the internal pointer is that it is part of a HashTable. So when you change it, the HashTable changes with it. And since access to arrays in PHP is done by value (and not by reference), you are forced to copy the array to loop around its elements.

A simple example showing the importance of copying (by the way, is not so rare), this is a nested iteration:

foreach ($array as $a) {
    foreach ($array as $b) {
        // ...
    }
}


Here you want both loops to be independent, and not tricky tossed by a single pointer.

So, we got to the foreach.

Array traversal in foreach



Now you know why foreach has to create a copy of the array before going around it. But this is clearly not the whole story. Whether PHP will make a copy or not depends on several factors:

  • If the iterable array is a reference, copying will not occur, addref will be executed instead:

    $ref =& $array; // $array has is_ref=1 now
    foreach ($array as $val) {
        // ...
    }
    

    Why? Because any change to the array must propagate by reference, including an internal pointer. If foreach made a copy in this case, it would destroy the semantics of the link.
  • If the array has refcount = 1, copying will again fail. refcount = 1 means that the array is not used elsewhere and foreach can use it directly. If refcount is more than one, then the array is divided with other variables and in order to avoid changes, foreach should copy it (regardless of the case of the link described above).
  • If the array is referenced (foreach ($ array as & $ ref)), then - regardless of the copy or non-copy function - the array will become a link.


So this is the first part of the mystery: the copy function. The second part is how the current iteration is performed, and it is also rather strange. The “regular” iteration pattern that you already know (and which is often used in PHP — separate from foreach) looks something like this (pseudocode):

reset();
while (get_current_data(&data) == SUCCESS) {
    code();
    move_forward();
}

foreach iteration looks a little different:

reset();
while (get_current_data(&data) == SUCCESS) {
    move_forward();
    code();
}


The difference is that move_forward () is executed at the beginning, not the end of the loop. Thus, when the user code uses the element $ i, the internal pointer of the array already points to the element $ i + 1.

This mode of operation of foreach is also the reason why the internal pointer of the array moves to the next element if the current one is deleted and not to the previous one (as you might expect). Everything is done so that it works perfectly with foreach (but, obviously, everything else will not work so well, skipping elements).

Code Implications



The first consequence of the above behavior is that foreach copies the iterable array in many cases (slowly). But reject fear: I tried to remove the copy requirement and could not see the acceleration of work anywhere, except for artificial benchmarks (in which iteration was twice as fast). It seems people just don't iterate enough.

The second consequence is that usually there should be no other consequences. The behavior of foreach is basically understandable to the user and just works as it should. You should not worry about how copying occurs (and whether it happens at all), and at what particular point in time the pointer moves.

And the third consequence - and here we are just approaching your problems - is that sometimes we see very strange behavior that is difficult to understand.This happens specifically when you try to modify the array itself, which you bypass in a loop.

A large collection of behavior in borderline cases that appear when you modify an array during an iteration can be found in PHP tests. You can start with this test , then change 012 to 013 in the address, and so on. You will see how foreach behavior will manifest itself in different situations (all sorts of combinations of links, etc.).

Now back to your examples:

foreach ($array as $item) {
  echo "$item\n";
  $array[] = $item;
}
print_r($array);
/* Output in loop:    1 2 3 4 5
   $array after loop: 1 2 3 4 5 1 2 3 4 5 */


Here $ array has refcount = 1 before the loop, so it will not be copied, but will get addref. Once you assign the value to $ array [], zval will be split, so the array to which you add the elements and the iterable array will be two different arrays.

foreach ($array as $key => $item) {
  $array[$key + 1] = $item + 2;
  echo "$item\n";
}
print_r($array);
/* Output in loop:    1 2 3 4 5
   $array after loop: 1 3 4 5 6 7 */


The same situation as in the first test.

// Сдвигаем указатель на единицу, чтобы убедиться, что это не влияет на foreach
var_dump(each($array));
foreach ($array as $item) {
  echo "$item\n";
}
var_dump(each($array));
/* Output
  array(4) {
    [1]=>
    int(1)
    ["value"]=>
    int(1)
    [0]=>
    int(0)
    ["key"]=>
    int(0)
  }
  1
  2
  3
  4
  5
  bool(false)
*/


The same story again. During the foreach loop, you have refcount = 1 and you get only addref, the internal pointer $ array will be changed. At the end of the loop, the pointer becomes NULL (this means that the iteration is complete). each demonstrates this by returning false.

foreach ($array as $key => $item) {
  echo "$item\n";
  each($array);
}
/* Output: 1 2 3 4 5 */


foreach ($array as $key => $item) {
  echo "$item\n";
  reset($array);
}
/* Output: 1 2 3 4 5 */


The functions each and reset are both referenced by reference. $ array has refcount = 2 when it comes to them, as a result of which it must be split. Again foreach will work on a separate array.

But these examples are not convincing enough. The behavior begins to be truly unpredictable when you use current in a loop:

foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 2 2 2 2 2 */


Here you should keep in mind that current is also referenced, although it does not modify the array. This is necessary in order to work in concert with all other functions, like next, which are accessed by reference (current, in fact, is preferably a ref function; it can get a value, but uses a link if it can). The reference means that the array must be separated, therefore $ array and the copy of $ array that foreach uses will be independent. Why you get 2, not 1, is also mentioned above: foreach extends the array pointer to the beginning of the user code , and not after. So even if the code is still working with the first element, foreach has already moved the pointer to the second.

Now try to make a small change:

$ref = &$array;
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 2 3 4 5 false */


Here we have is_ref = 1, so the array is not copied (as above). But now when there is is_ref, the array no longer needs to be divided, passing by reference to current. Now current and foreach work with the same array. You see the array shifted by one just because foreach handles the pointer.

You will see the same thing when you do an array traversal by links:

foreach ($array as &$val) {
    var_dump(current($array));
}
/* Output: 2 3 4 5 false */


The most important thing here is that foreach will assign our $ array is_ref = 1 when it will loop around it by reference, so it will turn out the same as above.

Another small variation, here we assign our array to another variable:

$foo = $array;
foreach ($array as $val) {
    var_dump(current($array));
}
/* Output: 1 1 1 1 1 */


Here, the refcount of the $ array is set to 2 when the loop started, so you need to make a copy before you begin. Thus, the $ array and the array used by foreach will be different from the very beginning. That's why you get the position of the internal pointer of the array that was relevant before the start of the loop (in this case, it was in the first position).

Iteration of objects



When iterating over objects, it makes sense to consider two cases:

The object is not Traversable (or rather, the get_iterator internal handler is not defined)


In this case, iteration happens almost the same way as with arrays. The same semantics of copying. The only difference: foreach will run some additional code to skip properties that are not available in the current scope. A couple more interesting facts:

  • For declared properties, PHP reoptimizes the hash table of properties. If you do iterate the object, it must reconstruct this hash table (which increases memory usage). Not that you should worry about this, just keep in mind.
  • At each iteration, the hash table of properties will be obtained again, that is, PHP will call get_properties again, and again, and again. For "ordinary" properties, this is not so important, but if properties are created dynamically (built-in classes often do this), then the property table will be recounted every time.


Traversable Object


In this case, all that is said above will not be applied in any way. Also, PHP will not copy and will not use any tricks like increasing the pointer until the loop passes. I think that the mode of passage through a traversable object is much more predictable and does not require further description.

Replacing an iterable object during a loop



Another unusual case that I did not mention is that PHP allows the possibility of replacing an iterable object during a loop. You can start with one array and continue by replacing it halfway with another. Or start with an array, then replace it with an object:

$arr = [1, 2, 3, 4, 5];
$obj = (object) [6, 7, 8, 9, 10];
$ref =& $arr;
foreach ($ref as $val) {
    echo "$val\n";
    if ($val == 3) {
        $ref = $obj;
    }
}
/* Output: 1 2 3 6 7 8 9 10 */


As you can see, PHP just started to bypass another entity as soon as the replacement occurred.

Changing the internal array pointer during iteration



The last detail of the foreach behavior that I did not mention (because it can be used to get really strange behavior ): what can happen if you try to change the internal pointer of the array while looping through.

Here you may not get what you expected: if you call next or prev in the body of the loop (in case of passing by reference), you will see that the internal pointer has moved, but this did not affect the behavior of the iterator in any way. The reason is that foreach backs up the current position and hash of the current item in the HashPointer after each pass of the loop. On the next pass, foreach will check to see if the position of the internal pointer has changed and try to restore it using this hash.

Let's see what it means to "try." The first example shows how changing the internal pointer does not change the foreach mode:

$array = [1, 2, 3, 4, 5];
$ref =& $array;
foreach ($array as $value) {
    var_dump($value);
    reset($array);
}
// output: 1, 2, 3, 4, 5


Now let's try to unset the element that will be accessed by foreach on the first pass (key 1):

$array = [1, 2, 3, 4, 5];
$ref =& $array;
foreach ($array as $value) {
    var_dump($value);
    unset($array[1]);
    reset($array);
}
// output: 1, 1, 3, 4, 5


Here you will see that the counter is reset, because it was not possible to find an element with a suitable hash.

Keep in mind a hash is just a hash. Collisions happen. Now let's try this:

$array = ['EzEz' => 1, 'EzFY' => 2, 'FYEz' => 3];
$ref =& $array;
foreach ($array as $value) {
    unset($array['EzFY']);
    $array['FYFZ'] = 4;
    reset($array);
    var_dump($value);
}
// output: 1 1 3 4


It works as we expected. We removed the EzFY key (the one where the foreach was), so a reset was made. We also added an additional key, so at the end we see 4.

And here comes the unknown. What happens if you replace the FYFY key with FYFZ? Let's try:

$array = ['EzEz' => 1, 'EzFY' => 2, 'FYEz' => 3];
$ref =& $array;
foreach ($array as $value) {
    unset($array['EzFY']);
    $array['FYFY'] = 4;
    reset($array);
    var_dump($value);
}
// output: 1 4


Now the cycle has passed directly to the new element, skipping everything else. This is because the FYFY key has a collision with EzFY (actually, all the keys from this array too). Moreover, the FYFY element is located at the same memory address as the EzFY element that has just been deleted. So for PHP it will be the same position with the same hash. The position is “restored” and there is a transition to the end of the array.

Also popular now: