JunOS update on EX4500 switches in VirtualChassis - what could go wrong? Part 2
So, without delaying the matter, I publish the second part of the earlier post . I express my gratitude for the publication - it is nice that the article interested you and the topic found a continuation.
Let me remind you that in the last part I settled on the fact that after rebooting one of the VC devices did not work properly. As was rightly noted in one of the comments, it turns out that after all this I went home. No, the first part describes about 20 minutes of my almost five-hour saga. Fastened? Go!
After the reboot, it is not clear what happened and whether it happened, but most importantly, the client traffic has gone. I am connected via the dedicated management Ethernet interface and the first surprise is that member1 became the main RE:
In principle, this happens and is not scary, since I have the same devices with the pre-provisioned VC configuration and any of them can be a wizard. The OS has been updated and this is good. But this is no longer good:
The device sees only one RE, and there should be two of them. Further investigation only confirms that the non-burning LEDs are not without reason:
The first device, member0, is recognized as Linecard and has the status Inactive - this means that it does not take an active part in the virtual chassis. Dedicated stack interfaces (vcp-1 and vcp-0) are active, so you can try local connection:
That's it! The OS was updated only on the second device, and on the first - the old one (pay attention to the version of the firmware file FPC0 and FPC1), so the VC logic deactivated it. One way or another, the device is there and you can try to update it again. One problem - when updating, I followed the guides from Juniper and put the image in / var / tmp, respectively, it is now empty there and you need to fill the image again. I focus on this switch and try to update the system / reboot only it several times (member1 continues to work):
At the end of the download / update process, each time I see:
Despite the lack of knowledge on the Unix on which JunOS is based, the line “KDB: enter: panic” is not encouraging. Among other things, the system falls into system debugging mode (db>), and this is very bad. For reference: Juniper has a mode of a console familiar to everyone, where the working hardware is configured, you can go to the Unix command line as root and more; there is loader> bootloader mode for restoring and filling the operating system image, roughly corresponding to rommon> Cisco; and there is a debug mode db>, which appears when there are problems with the physical components of the structure. You can do very little in this mode if you are not a Juniper TAC engineer. At that moment, I don’t really understand what it is and, as a proud Windows user, I try to click "next":
Oh miracle - the system boots, albeit with the old version. At that time, I did not realize that this old version was loaded from the backup partition (slice alternate), since the updated version was written to the main partition and in my case it could not be loaded from it. Therefore, it is so important to update the bootloader whenever possible - this is another saving straw in case of problems. As a remark: also pay attention to the lines “Logging to master ... Connection to master failed”. All devices combined in VC have a single management console, that is, when connecting, for example via SSH, we immediately get to the master device console. Since in my case VC is inoperative, I get into the local hardware control mode.
In the process, I come up with uploading an OS image to a workable RE and copying it between the VC members - this is faster and there is no need to constantly get distracted by WinSCP. This works even in my case, since the communication channels between the devices are active.
Nevertheless, an attempt to update and reboot each time gives the same result - I find myself in system debug mode with the subsequent opportunity to download the old version. Accordingly, the problem is constant and I will not achieve anything by repeating the steps. Then I came up with the idea of going - after all, I have a device with a working system (member1) and there is a flash drive on which you can roll up a snapshot and boot from it. So I do:
Pay attention to the messages when connecting a flash drive - it is defined as a system device da1, it will be needed in the future. The snapshot on the external flash drive repeats that on the internal storage of the device - version 12.3 on the main partition (/ dev / da1s2a) and 11.1 - on the backup (/ dev / da1s1a). Slice names can also come in handy if you want to boot the system from a specific section. I insert the USB flash drive into the problem device and continue:
Here, again, as a precaution, I went into the local device control session, most likely it was possible to reload member0 from the wizard console. When I restart, I see a constantly cyclic sequence:
The switch does not move anywhere further than these repeated lines. What the?!? Can't find the core? After a while, I pay attention to the penultimate line, press Enter and get into the loader:
It's a bit ridiculous, but still better than a cyclic reboot. The loader mode itself is just designed to restore the system, that is, I'm in the right place. The operating time has exceeded 2 hours ... I try different options for the location of the system image and updates - without result.
Actually, these lines should work, but for some reason they don’t work - either at that time I wasn’t thinking anything, or something else. I see the same cyclic reboot and swearing at the lack of a kernel. In the process of constant rebooting, another interesting thing pops up:
For me at that moment this is nothing more than an assumption, but bearing in mind that Juniper means devices with 0, it seems strange to me to have “disk2” - I have one flash drive. In addition, when I inserted the flash drive, it was recognized as da1. If you go back a little, you can see that the device tried to boot from 2 disks immediately after rebooting from the console (when I indicated the external USB flash drive as a boot device), but until now I did not notice this. We return to the loader and confirm the fears, there is no disk 2, and the flash drive is a zero device:
All? Yes, no matter how! The system again tries to boot from disk 2, but now I feel that I'm on the right track. Along the way, I sort through the nearby options with different slices on a flash drive (nextboot diskXsY), with no result. Already almost desperate, I find information that the boot device should be set as an environment variable from U-boot mode. I don’t know how to describe this fourth mode and what can be done there, but you can get there by interrupting the boot process by pressing Ctrl + C at the very beginning when the system polls for USB devices (USB: scanning bus for devices ...). The first line contains INTERRUPT in the <> delimiters, but markup and fonts move out because of it, so I removed the delimiters:
Let's see what I saw after the reboot:
“WARNING: JUNOS versions running on dual partitions are not the same” is not scary and expected, because the new version is contained only in the main slice of the device.
“Connection to master failed ...” and “warning: This chassis is operating in a non-master role ...” are not scary, since VC needs time to restore communication between members and synchronize the configuration.
After several minutes of waiting, the system itself asks to restart the console (WARNING: cli has been replaced by an updated version) and now a new version is loaded on the correct RE.
We check:
Victory! Complete and unconditional! To say that I was pleased with myself was to say nothing, the ChSV simply went through the roof. Despite the fact that my work lasted about 4 hours, it was not so important, as the clients did not feel it. I not only gave myself a virtual medal, but also saved a lot of money for my company. I got so many impressions during these 4 hours that it then took many days (and beer) to put everything together and understand the whole picture.
Now it remains only to make snapshots on the internal storage in the main section and, after a week or two - in the backup. Why in a week - to run in the new version in production, since downloading the old version of the system from the backup partition is much easier than downgrading it on the entire device.
We analyze the situation.
According to Juniper TAC, upgrade problems were due to damage to the primary boot partition. Nothing can be done with this and the switch must be taken under warranty. I still really hope that the problem was caused by damage to the file system (incorrect reboot or the like) and was fixed during the upgrade process (Un-Protected 1 sectors Erasing Flash .... done) when I set the environment variable.
What fright the device wanted to boot from disk2, if no one explicitly pointed to it and it was not in the system - it is not clear, TAC also found it difficult to comment. In the logs, you could even trace that disk2 appears from nowhere (note that new boot device = disk1s2 changes to new boot device = disk2):
In fact, this problem increased the time spent by an hour and a half. Yes, the switch also swears at the lack of a kernel, but why then the system tries to use disk2, if the system did not seem to see it in loader> it is not clear. I can assume that if there are problems with the boot, the device tries to cycle through the disks, but again, the system did not see the disk2 device. How and why then the same flash drive in the future successfully loaded the device also raises questions.
It is possible that I was mistaken here:
because when you restart the loader’s settings are lost. I had to try “boot” instead of “reboot”, but then I didn’t.
The new version of the system significantly increased the load on the device. On the old version, the processor load during the day was about 27-30%, after the update - 45-48%, but neither the fairly simple configuration of the device nor the characteristics of the traffic changed. After several remote sessions with Juniper TAC, the reason could not be established - there were speculations about a memory leak and similar problems, but no. Strange, but had to be accepted as a fact.
An attentive reader could notice that the device names displayed in the loader (disk0) and used to boot successfully (disk1 and then / dev / da1s1a) are different. With what it is connected I will not venture to assert. I can assume that the names change depending on the degree of successful system boot. Loader loaded - received some device names, contact from db> - there will be others; from the CLI we generally call devices through “media external” and “media internal”. In general, so far only an assumption.
Most of the above calculations and commands I put together in a guide long before the update. After that, I periodically reread and supplement it if possible problems occurred to me. In it there was only db> mode and ==> setenv procedures. It’s clear, to foresee everything did not work out and something did not work as it should. But honestly - without this guide and time for his mental running-in, I would give up. Moreover, it was night work and the sharpness of mind was reduced.
Backups - although they did not help me much, their presence calmed my conscience and soul. In the worst case, even if the entire internal storage is damaged, I would copy the text config to the console. These two points are a guarantee that you will concentrate on work, and not on analysis of how to return everything to its original state and what to do next.
Of the significant shortcomings: in the process of work, I launched several PuTTY tabs that write the log to a single file. Then it was very difficult to sort everything out by individual devices and timestamps, it was better to use SecureCRT or run a separate window on different devices, especially since I had enough funds for this.
And at the end - a picture from the scene. I hope this post will be useful to you. Good luck in upcoming updates!
PS in the output of the commands I used markup for regular code, which looks worse than markup with a background of the source code of a certain language or BASH. However, the markup “code” allows the selection in bold, which was important for me to highlight interesting places in the output of the commands. If anyone shares how to do both (background + bold inside), I will be grateful and promise to use it in the future.
Update: it turned out that in different browsers and versions, the markup of the code is displayed differently. I’m troubled to smoke further, how to make the text more visual and readable.
Let me remind you that in the last part I settled on the fact that after rebooting one of the VC devices did not work properly. As was rightly noted in one of the comments, it turns out that after all this I went home. No, the first part describes about 20 minutes of my almost five-hour saga. Fastened? Go!
After the reboot, it is not clear what happened and whether it happened, but most importantly, the client traffic has gone. I am connected via the dedicated management Ethernet interface and the first surprise is that member1 became the main RE:
login as: user
user@switch password:
--- JUNOS 12.3R12.4 built 2016-01-20 04:27:51 UTC
{master:1}
user@switch>
In principle, this happens and is not scary, since I have the same devices with the pre-provisioned VC configuration and any of them can be a wizard. The OS has been updated and this is good. But this is no longer good:
user@switch> show chassis routing-engine
Routing Engine status:
Slot 1:
Current state Master
DRAM 1024
Memory utilization 45 percent
CPU utilization:
User 14 percent
Background 0 percent
Kernel 11 percent
Interrupt 1 percent
Idle 74 percent
Model EX4500-40F
Serial ID
Start time 2016-06-02 01:28:45
Uptime 34 minutes, 55 seconds
Last reboot reason Router rebooted after a normal shutdown.
Load averages: 1 minute 5 minute 15 minute
0.59 0.80 0.66
{master:1}
user@switch>
The device sees only one RE, and there should be two of them. Further investigation only confirms that the non-burning LEDs are not without reason:
user@switch> show virtual-chassis
Preprovisioned Virtual Chassis
Virtual Chassis ID:
Virtual Chassis Mode: Enabled
Mstr Mixed Neighbor List
Member ID Status Serial No Model prio Role Mode ID Interface
0 (FPC 0) Inactive ХХХХХ ex4500-40f 129 Linecard N 1 vcp-1
1 vcp-0
1 (FPC 1) Prsnt ХХХХХ ex4500-40f 129 Master* N 0 vcp-1
0 vcp-0
{master:1}
user@switch>
The first device, member0, is recognized as Linecard and has the status Inactive - this means that it does not take an active part in the virtual chassis. Dedicated stack interfaces (vcp-1 and vcp-0) are active, so you can try local connection:
Connection and verification
{master: 1}
user @ switch> request session member 0
--- JUNOS 11.1R3.5 built 2011-06-25 01:18:46 UTC
{linecard: 0}
user @ switch> show system storage
fpc0:
- Filesystem Size Used Avail Capacity Mounted on
/ dev / da0s1a 370M 142M 198M 42% /
devfs 1.0K 1.0K 0B 100% / dev
/ dev / md0 37M 37M 0B 100% / packages / mnt / jbase
/ dev / md1 12M 7.3M 3.6M 67 % / packages / mfs-jcrypto-ex
/ dev / md2 22M 22M 0B 100% / packages / mnt / jcrypto-ex- 11.1R3.5
/ dev / md3 8.7M 4.1M 3.9M 51% / packages / mfs-jdocs- ex
/ dev / md4 6.3M 6.3M 0B 100% / packages / mnt / jdocs-ex- 11.1R3.5
/ dev / md5 64M 61M -1.4M 102% / packages / mfs-jkernel-ex
/ dev / md6 162M 162M 0B 100% /packages/mnt/jkernel-ex-11.1R3.5
/ dev / md7 13M 8.5M 3.5M 71% / packages / mfs-jpfe-ex45x
/ dev / md8 24M 24M 0B 100% /packages/mnt/jpfe-ex45x-11.1R3.5
/ dev / md9 20M 15M 2.9M 84% / packages / mfs-jroute-ex
/ dev / md10 47M 47M 0B 100% /packages/mnt/jroute-ex-11.1 R3.5
/ dev / md11 16M 11M 3.2M 78% / packages / mfs-jswitch-ex
/ dev / md12 35M 35M 0B 100% /packages/mnt/jswitch-ex-11.1R3.5
/ dev / md13 12M 7.8M 3.6M 68% / packages / mfs-jweb-ex
/ dev / md14 22M 22M 0B 100% /packages/mnt/jweb-ex-11.1R3.5
/ dev / md15 126M 8.0K 116M 0% / tmp
/ dev / da0s3e 243M 4.4M 219M 2% / var
/ dev / da0s3d 727M 130K 668M 0% / var / tmp
/ dev / da0s4d 123M 492K 113M 0% / config
/ dev / md16 118M 14M 95M 13% / var / rundb
procfs 4.0K 4.0K 0B 100% / proc
/ var / jail / etc 243M 4.4M 219M 2% /packages/mnt/jweb-ex-11.1R3.5/jail / var / etc
/ var / jail / run 243M 4.4M 219M 2% /packages/mnt/jweb-ex-11.1R3.5/jail/var/run
/ var / jail / tmp 243M 4.4M 219M 2% / packages / mnt / jweb-ex-11.1R3.5 / jail / var / tmp
/ var / tmp 727M 130K 668M 0% /packages/mnt/jweb-ex-11.1R3.5/jail/var/tmp/uploads
devfs 1.0K 1.0 K 0B 100% /packages/mnt/jweb-ex-11.1R3.5/jail/dev
fpc1:
- Filesystem Size Used Avail Capacity Mounted on
/ dev / da0s2a 363M 130M 204M 39% /
devfs 1.0K 1.0K 0K 100% / dev
/ dev / md0 69M 69M 0B 100% / packages / mnt / jbase
/ dev / md1 5.8M 1.1M 4.2M 21% / packages / mfs-fips-mode-powerpc
/ dev / md2 2.9M 2.9M 0B 100% / packages / mnt / fips-mode-powerpc- 12.3R12.4
/ dev / md3 9.1M 4.4M 3.9M 53% / packages / mfs-jcrypto-ex
/ dev / md4 12M 12M 0B 100% / packages / mnt / jcrypto-ex- 12.3R12.4
/ dev / md5 8.1M 3.5M 4.0M 47% / packages / mfs-jdocs-ex
/ dev / md6 6.2M 6.2M 0B 100% / packages / mnt / jdocs-ex-12.3R12.4
/ dev / md7 43M 39M 616K 98% / packages / mfs-jkernel-ex
/ dev / md8 109M 109M 0B 100% /packages/mnt/jkernel-ex-12.3R12. 4
/ dev / md9 12M 7.9M 3.6M 69% / packages / mfs-jpfe-ex45x
/ dev / md10 22M 22M 0B 100% /packages/mnt/jpfe-ex45x-12.3R12.4
/ dev / md11 17M 12M 3.2M 79% / packages / mfs-jroute-ex
/ dev / md12 38M 38M 0B 100% /packages/mnt/jroute-ex-12.3R12.4
/ dev / md13 12M 7.2M 3.6M 67% / packages / mfs-jswitch-ex
/ dev / md14 21M 21M 0B 100% /packages/mnt/jswitch-ex-12.3R12.4
/ dev / md15 14M 9.5M 3.4M 73% / packages / mfs-jweb-ex
/ dev / md16 25M 25M 0B 100% /packages/mnt/jweb-ex-12.3R12.4
/ dev / da0s3e 243M 20M 204M 9% / var
/ dev / md17 252M 12K 232M 0% / tmp
/ dev / da0s3d 727M 107M 561M 16% / var / tmp
/ dev / da0s4d 123M 494K 113M 0% / config
/ dev / md18 118M 22M 86M 20% / var / rundb
procfs 4.0K 4.0K 0B 100% / proc
/ var / jail / etc 243M 20M 204M 9% /packages/mnt/jweb-ex-12.3R12.4/jail/var/etc
/ var / jail / run 243M 20M 204M 9% / packages / mnt / jweb-ex -12.3R12.4 / jail / var / run
/ var / jail / tmp 243M 20M 204M 9% /packages/mnt/jweb-ex-12.3R12.4/jail/var/tmp
/ var / tmp 727M 107M 561M 16% /packages/mnt/jweb-ex-12.3R12.4/jail/var/tmp/uploads
devfs 1.0K 1.0K 0B 100% /packages/mnt/jweb-ex-12.3R12. 4 / jail / dev
{linecard: 0}
user @ switch> exit
rlogin: connection closed
{master: 1}
user @ switch>
user @ switch> request session member 0
--- JUNOS 11.1R3.5 built 2011-06-25 01:18:46 UTC
{linecard: 0}
user @ switch> show system storage
fpc0:
- Filesystem Size Used Avail Capacity Mounted on
/ dev / da0s1a 370M 142M 198M 42% /
devfs 1.0K 1.0K 0B 100% / dev
/ dev / md0 37M 37M 0B 100% / packages / mnt / jbase
/ dev / md1 12M 7.3M 3.6M 67 % / packages / mfs-jcrypto-ex
/ dev / md2 22M 22M 0B 100% / packages / mnt / jcrypto-ex- 11.1R3.5
/ dev / md3 8.7M 4.1M 3.9M 51% / packages / mfs-jdocs- ex
/ dev / md4 6.3M 6.3M 0B 100% / packages / mnt / jdocs-ex- 11.1R3.5
/ dev / md5 64M 61M -1.4M 102% / packages / mfs-jkernel-ex
/ dev / md6 162M 162M 0B 100% /packages/mnt/jkernel-ex-11.1R3.5
/ dev / md7 13M 8.5M 3.5M 71% / packages / mfs-jpfe-ex45x
/ dev / md8 24M 24M 0B 100% /packages/mnt/jpfe-ex45x-11.1R3.5
/ dev / md9 20M 15M 2.9M 84% / packages / mfs-jroute-ex
/ dev / md10 47M 47M 0B 100% /packages/mnt/jroute-ex-11.1 R3.5
/ dev / md11 16M 11M 3.2M 78% / packages / mfs-jswitch-ex
/ dev / md12 35M 35M 0B 100% /packages/mnt/jswitch-ex-11.1R3.5
/ dev / md13 12M 7.8M 3.6M 68% / packages / mfs-jweb-ex
/ dev / md14 22M 22M 0B 100% /packages/mnt/jweb-ex-11.1R3.5
/ dev / md15 126M 8.0K 116M 0% / tmp
/ dev / da0s3e 243M 4.4M 219M 2% / var
/ dev / da0s3d 727M 130K 668M 0% / var / tmp
/ dev / da0s4d 123M 492K 113M 0% / config
/ dev / md16 118M 14M 95M 13% / var / rundb
procfs 4.0K 4.0K 0B 100% / proc
/ var / jail / etc 243M 4.4M 219M 2% /packages/mnt/jweb-ex-11.1R3.5/jail / var / etc
/ var / jail / run 243M 4.4M 219M 2% /packages/mnt/jweb-ex-11.1R3.5/jail/var/run
/ var / jail / tmp 243M 4.4M 219M 2% / packages / mnt / jweb-ex-11.1R3.5 / jail / var / tmp
/ var / tmp 727M 130K 668M 0% /packages/mnt/jweb-ex-11.1R3.5/jail/var/tmp/uploads
devfs 1.0K 1.0 K 0B 100% /packages/mnt/jweb-ex-11.1R3.5/jail/dev
fpc1:
- Filesystem Size Used Avail Capacity Mounted on
/ dev / da0s2a 363M 130M 204M 39% /
devfs 1.0K 1.0K 0K 100% / dev
/ dev / md0 69M 69M 0B 100% / packages / mnt / jbase
/ dev / md1 5.8M 1.1M 4.2M 21% / packages / mfs-fips-mode-powerpc
/ dev / md2 2.9M 2.9M 0B 100% / packages / mnt / fips-mode-powerpc- 12.3R12.4
/ dev / md3 9.1M 4.4M 3.9M 53% / packages / mfs-jcrypto-ex
/ dev / md4 12M 12M 0B 100% / packages / mnt / jcrypto-ex- 12.3R12.4
/ dev / md5 8.1M 3.5M 4.0M 47% / packages / mfs-jdocs-ex
/ dev / md6 6.2M 6.2M 0B 100% / packages / mnt / jdocs-ex-12.3R12.4
/ dev / md7 43M 39M 616K 98% / packages / mfs-jkernel-ex
/ dev / md8 109M 109M 0B 100% /packages/mnt/jkernel-ex-12.3R12. 4
/ dev / md9 12M 7.9M 3.6M 69% / packages / mfs-jpfe-ex45x
/ dev / md10 22M 22M 0B 100% /packages/mnt/jpfe-ex45x-12.3R12.4
/ dev / md11 17M 12M 3.2M 79% / packages / mfs-jroute-ex
/ dev / md12 38M 38M 0B 100% /packages/mnt/jroute-ex-12.3R12.4
/ dev / md13 12M 7.2M 3.6M 67% / packages / mfs-jswitch-ex
/ dev / md14 21M 21M 0B 100% /packages/mnt/jswitch-ex-12.3R12.4
/ dev / md15 14M 9.5M 3.4M 73% / packages / mfs-jweb-ex
/ dev / md16 25M 25M 0B 100% /packages/mnt/jweb-ex-12.3R12.4
/ dev / da0s3e 243M 20M 204M 9% / var
/ dev / md17 252M 12K 232M 0% / tmp
/ dev / da0s3d 727M 107M 561M 16% / var / tmp
/ dev / da0s4d 123M 494K 113M 0% / config
/ dev / md18 118M 22M 86M 20% / var / rundb
procfs 4.0K 4.0K 0B 100% / proc
/ var / jail / etc 243M 20M 204M 9% /packages/mnt/jweb-ex-12.3R12.4/jail/var/etc
/ var / jail / run 243M 20M 204M 9% / packages / mnt / jweb-ex -12.3R12.4 / jail / var / run
/ var / jail / tmp 243M 20M 204M 9% /packages/mnt/jweb-ex-12.3R12.4/jail/var/tmp
/ var / tmp 727M 107M 561M 16% /packages/mnt/jweb-ex-12.3R12.4/jail/var/tmp/uploads
devfs 1.0K 1.0K 0B 100% /packages/mnt/jweb-ex-12.3R12. 4 / jail / dev
{linecard: 0}
user @ switch> exit
rlogin: connection closed
{master: 1}
user @ switch>
That's it! The OS was updated only on the second device, and on the first - the old one (pay attention to the version of the firmware file FPC0 and FPC1), so the VC logic deactivated it. One way or another, the device is there and you can try to update it again. One problem - when updating, I followed the guides from Juniper and put the image in / var / tmp, respectively, it is now empty there and you need to fill the image again. I focus on this switch and try to update the system / reboot only it several times (member1 continues to work):
{master:1}
user@switch> request system software add /var/tmp/jinstall-XXX.tgz validate member 0
user@switch> request system reboot member 0
At the end of the download / update process, each time I see:
Installing disk0s3d:/jinstall-ex-4500-12.3R12.4-domestic-signed.tgz
Verified jinstall-ex-4500-12.3R12.4-domestic.tgz signed by PackageProduction_12_ 3_0
mode = 040700, inum = 38, fs = /instrootmnt/var
panic: ffs_valloc: dup alloc
###Entering boot mastership relinquish phase
KDB: enter: panic
###Entering boot mastership relinquish phase
[thread pid 316 tid 100041 ]
Stopped at kdb_enter+0x1a0: addis r3, r0, -0x7fa4
db>
Despite the lack of knowledge on the Unix on which JunOS is based, the line “KDB: enter: panic” is not encouraging. Among other things, the system falls into system debugging mode (db>), and this is very bad. For reference: Juniper has a mode of a console familiar to everyone, where the working hardware is configured, you can go to the Unix command line as root and more; there is loader> bootloader mode for restoring and filling the operating system image, roughly corresponding to rommon> Cisco; and there is a debug mode db>, which appears when there are problems with the physical components of the structure. You can do very little in this mode if you are not a Juniper TAC engineer. At that moment, I don’t really understand what it is and, as a proud Windows user, I try to click "next":
db> help
DDB Quick Help
-------------------
Type 'c' to continue, 'reset' or 'panic' to restart.
print p examine x search set write
w delete d break dwatch watch dhwatch
hwatch step s continue c until next
match trace alltrace where bt call show
ps gdb reset kill watchdog thread panic
ddbdumpsys dumpsys halt reboot
db> c
Uptime: 2m41s
Cannot dump. No dump device defined.
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
...Много вывода при перезагрузке...
***** FILE SYSTEM MARKED CLEAN *****
switch (ttyu0)
login: user
Logging to master
...
Connection to master failed, enabling local login
Password:
--- JUNOS 11.1R3.5 built 2011-06-25 01:18:46 UTC
{linecard:0}
user@switch>
Oh miracle - the system boots, albeit with the old version. At that time, I did not realize that this old version was loaded from the backup partition (slice alternate), since the updated version was written to the main partition and in my case it could not be loaded from it. Therefore, it is so important to update the bootloader whenever possible - this is another saving straw in case of problems. As a remark: also pay attention to the lines “Logging to master ... Connection to master failed”. All devices combined in VC have a single management console, that is, when connecting, for example via SSH, we immediately get to the master device console. Since in my case VC is inoperative, I get into the local hardware control mode.
In the process, I come up with uploading an OS image to a workable RE and copying it between the VC members - this is faster and there is no need to constantly get distracted by WinSCP. This works even in my case, since the communication channels between the devices are active.
user@switch> file copy fpc1:/var/tmp/jinstall-XXX.tgz fpc0:/var/tmp/jinstall-XXX.tgz
Nevertheless, an attempt to update and reboot each time gives the same result - I find myself in system debug mode with the subsequent opportunity to download the old version. Accordingly, the problem is constant and I will not achieve anything by repeating the steps. Then I came up with the idea of going - after all, I have a device with a working system (member1) and there is a flash drive on which you can roll up a snapshot and boot from it. So I do:
{master:1}
umass1: SanDisk Corporation U3 Cruzer Micro, rev 2.00/0.10, addr 4
da1 at umass-sim1 bus 1 target 0 lun 0
da1: Removable Direct Access SCSI-2 device
da1: 40.000MB/s transfers
da1: 973MB (1994385 512 byte sectors: 64H 32S/T 973C)
user@switch> request system snapshot local partition media external
user@switch> show system snapshot media external
fpc0:
--------------------------------------------------------------------------
error: external media missing or invalid
fpc1:
--------------------------------------------------------------------------
Information for snapshot on external (/dev/da1s1a) (backup)
Creation date: Jun 2 02:28:20 2016
JUNOS version on snapshot:
jbase : 11.1R3.5
jkernel-ex: 11.1R3.5
jcrypto-ex: 11.1R3.5
jdocs-ex: 11.1R3.5
jswitch-ex: 11.1R3.5
jpfe-ex45x: 11.1R3.5
jroute-ex: 11.1R3.5
jweb-ex: 11.1R3.5
Information for snapshot on external (/dev/da1s2a) (primary)
Creation date: Jun 2 02:29:21 2016
JUNOS version on snapshot:
jbase : ex-12.3R12.4
jkernel-ex: 12.3R12.4
jcrypto-ex: 12.3R12.4
jdocs-ex: 12.3R12.4
jswitch-ex: 12.3R12.4
jpfe-ex45x: 12.3R12.4
jroute-ex: 12.3R12.4
jweb-ex: 12.3R12.4
fips-mode-powerpc: 12.3R12.4
Pay attention to the messages when connecting a flash drive - it is defined as a system device da1, it will be needed in the future. The snapshot on the external flash drive repeats that on the internal storage of the device - version 12.3 on the main partition (/ dev / da1s2a) and 11.1 - on the backup (/ dev / da1s1a). Slice names can also come in handy if you want to boot the system from a specific section. I insert the USB flash drive into the problem device and continue:
user@switch> request session member 0
--- JUNOS 11.1R3.5 built 2011-06-25 01:18:46 UTC
{linecard:0}
user@switch> request system reboot member 0 media external
Reboot the system ? [yes,no] (no) yes
Here, again, as a precaution, I went into the local device control session, most likely it was possible to reload member0 from the wizard console. When I restart, I see a constantly cyclic sequence:
U-Boot 1.1.6 (Mar 26 2011 - 04:34:19)
Board: EX4500-40F 10.4
EPLD: Version 6.2 (0x81)
DRAM: Initializing (1024 MB)
FLASH: 8 MB
Firmware Version: 01.00.00
USB: scanning bus for devices... 3 USB Device(s) found
scanning bus for storage devices... 1 Storage Device(s) found
ELF file is 32 bit
Consoles: U-Boot console
FreeBSD/PowerPC U-Boot bootstrap loader, Revision 2.4
(hmerge@svl-junos-pool130.juniper.net, Sat Mar 26 02:46:28 PDT 2011)
Memory: 1024MB
bootsequencing is enabled
bootsuccess is not set
new boot device = disk2
can't load '/kernel'
can't load '/kernel.old'
Press Enter to stop auto bootsequencing and to enter loader prompt.
Watchdog timed out. Resetting the board.
The switch does not move anywhere further than these repeated lines. What the?!? Can't find the core? After a while, I pay attention to the penultimate line, press Enter and get into the loader:
loader> ?
Available commands:
bcachestat get disk block cache stats
boot boot a file or loaded kernel
autoboot boot automatically after a delay
help detailed help
? list commands
show show variable(s)
set set a variable
unset unset a variable
echo echo arguments
read read input from the terminal
more show contents of a file
nextboot set next boot device
lsdev list all devices
install install JUNOS
include read commands from a file
ls list files
load load a kernel or module
unload unload all modules
lsmod list loaded modules
export export variables to U-Boot environment
save save U-Boot environment
heap show heap usage
It's a bit ridiculous, but still better than a cyclic reboot. The loader mode itself is just designed to restore the system, that is, I'm in the right place. The operating time has exceeded 2 hours ... I try different options for the location of the system image and updates - without result.
loader> install /var/tmp/jinstall-ex-4500-12.3R12.4-domestic-signed.tgz
invalid URL
loader> install --format file:///jinstall-ex-4500-12.3R12.4-domestic-signed.tgz
cannot open package (error 22)
loader> install --format file:///jinstall-ex-4500-12.3R12.4-domestic-signed.tgz
Device NOT ready
Request Sense returned 06 28 00
cannot open package (error 5)
Actually, these lines should work, but for some reason they don’t work - either at that time I wasn’t thinking anything, or something else. I see the same cyclic reboot and swearing at the lack of a kernel. In the process of constant rebooting, another interesting thing pops up:
Firmware Version: 01.00.00
USB: scanning bus for devices... 3 USB Device(s) found
scanning bus for storage devices... 1 Storage Device(s) found
ELF file is 32 bit
Consoles: U-Boot console
FreeBSD/PowerPC U-Boot bootstrap loader, Revision 2.4
(hmerge@svl-junos-pool130.juniper.net, Sat Mar 26 02:46:28 PDT 2011)
Memory: 1024MB
bootsequencing is enabled
bootsuccess is not set
new boot device = disk2
For me at that moment this is nothing more than an assumption, but bearing in mind that Juniper means devices with 0, it seems strange to me to have “disk2” - I have one flash drive. In addition, when I inserted the flash drive, it was recognized as da1. If you go back a little, you can see that the device tried to boot from 2 disks immediately after rebooting from the console (when I indicated the external USB flash drive as a boot device), but until now I did not notice this. We return to the loader and confirm the fears, there is no disk 2, and the flash drive is a zero device:
loader> lsdev
disk devices:
disk0 - USB storage device 0
net devices:
net0:
loader> nextboot disk0:
loader> reboot
Resetting...
All? Yes, no matter how! The system again tries to boot from disk 2, but now I feel that I'm on the right track. Along the way, I sort through the nearby options with different slices on a flash drive (nextboot diskXsY), with no result. Already almost desperate, I find information that the boot device should be set as an environment variable from U-boot mode. I don’t know how to describe this fourth mode and what can be done there, but you can get there by interrupting the boot process by pressing Ctrl + C at the very beginning when the system polls for USB devices (USB: scanning bus for devices ...). The first line contains INTERRUPT in the <> delimiters, but markup and fonts move out because of it, so I removed the delimiters:
=> INTERRUPT
=> setenv loaddev disk1
=> saveenv
Saving Environment to Flash...
. done
Un-Protected 1 sectors
Erasing Flash...
. done
Erased 1 sectors
Writing to Flash... writing to flash...
done
. done
Protected 1 sectors
=> reset
...Перезагрузка...
...
Boot media /dev/da1 has dual root support
WARNING: JUNOS versions running on dual partitions are not same
** /dev/da1s1a
FILE SYSTEM CLEAN; SKIPPING CHECKS
clean, 274948 free (84 frags, 34358 blocks, 0.0% fragmentation)
switch (ttyu0)
login: user
Logging to master
...
Connection to master failed, enabling local login
Password:
--- JUNOS 12.3R12.4 built 2016-01-20 04:27:51 UTC
warning: This chassis is operating in a non-master role as part of a virtual-chassis (VC) system.
warning: Use of interactive commands should be limited to debugging and VC Port operations.
warning: Full CLI access is provided by the Virtual Chassis Master (VC-M) chassis.
warning: The VC-M can be identified through the show virtual-chassis status command executed at this console.
warning: Please logout and log into the VC-M to use CLI.
{linecard:1}
user@switch>
WARNING: cli has been replaced by an updated version:
CLI release 12.3R12.4 built by builder on 2016-01-20 03:55:45 UTC
Restart cli using the new version ? [yes,no] (yes)
Restarting cli ...
{master:0}
user@switch>
Let's see what I saw after the reboot:
“WARNING: JUNOS versions running on dual partitions are not the same” is not scary and expected, because the new version is contained only in the main slice of the device.
“Connection to master failed ...” and “warning: This chassis is operating in a non-master role ...” are not scary, since VC needs time to restore communication between members and synchronize the configuration.
After several minutes of waiting, the system itself asks to restart the console (WARNING: cli has been replaced by an updated version) and now a new version is loaded on the correct RE.
We check:
user@switch> show chassis routing-engine
Routing Engine status:
Slot 0:
Current state Master
DRAM 1024
Memory utilization 50 percent
CPU utilization:
User 43 percent
Background 0 percent
Kernel 24 percent
Interrupt 1 percent
Idle 32 percent
Model EX4500-40F
Serial ID
Start time 2016-06-02 03:43:20
Uptime 3 minutes, 22 seconds
Last reboot reason Router rebooted after a normal shutdown.
Load averages: 1 minute 5 minute 15 minute
2.40 1.12 0.46
Routing Engine status:
Slot 1:
Current state Backup
DRAM 1024
Memory utilization 44 percent
CPU utilization:
User 40 percent
Background 0 percent
Kernel 30 percent
Interrupt 1 percent
Idle 28 percent
Model EX4500-40F
Serial ID
Start time 2016-06-02 01:28:45
Uptime 2 hours, 17 minutes, 57 seconds
Last reboot reason Router rebooted after a normal shutdown.
Load averages: 1 minute 5 minute 15 minute
0.49 0.46 0.44
{master:0}
user@switch>
show virtual-chassis
Preprovisioned Virtual Chassis
Virtual Chassis ID:
Virtual Chassis Mode: Enabled
Mstr Mixed Neighbor List
Member ID Status Serial No Model prio Role Mode ID Interface
0 (FPC 0) Prsnt ex4500-40f ХХХХ 129 Master* N 1 vcp-1
1 vcp-0
1 (FPC 1) Prsnt ex4500-40f ХХХХ 129 Backup N 0 vcp-1
0 vcp-0
{master:0}
Victory! Complete and unconditional! To say that I was pleased with myself was to say nothing, the ChSV simply went through the roof. Despite the fact that my work lasted about 4 hours, it was not so important, as the clients did not feel it. I not only gave myself a virtual medal, but also saved a lot of money for my company. I got so many impressions during these 4 hours that it then took many days (and beer) to put everything together and understand the whole picture.
Now it remains only to make snapshots on the internal storage in the main section and, after a week or two - in the backup. Why in a week - to run in the new version in production, since downloading the old version of the system from the backup partition is much easier than downgrading it on the entire device.
We analyze the situation.
According to Juniper TAC, upgrade problems were due to damage to the primary boot partition. Nothing can be done with this and the switch must be taken under warranty. I still really hope that the problem was caused by damage to the file system (incorrect reboot or the like) and was fixed during the upgrade process (Un-Protected 1 sectors Erasing Flash .... done) when I set the environment variable.
What fright the device wanted to boot from disk2, if no one explicitly pointed to it and it was not in the system - it is not clear, TAC also found it difficult to comment. In the logs, you could even trace that disk2 appears from nowhere (note that new boot device = disk1s2 changes to new boot device = disk2):
Change boot device
user @ switch> request system reboot member 0 media external
Reboot the system? [yes, no] (no) yes
Rebooting fpc0
*** FINAL System shutdown message from root @ switch *** System going down IMMEDIATELY {linecard: 0}
iuriia @ CORE> JWaiting (max 300 seconds) for system process `vnlru_mem ' to stop ... done
Waiting (max 300 seconds) for system process `vnlru 'to stop ... done
Waiting (max 300 seconds) for system process` bufdaemon' to stop ... done
Waiting (max 300 seconds) for system process `syncer 'to stop ...
Syncing disks, vnodes remaining ... 2 2 2 0 1 1 1 0 0 0 0 0 done
syncing disks ... All buffers synced.
Uptime: 23m53s
recorded reboot as normal shutdown
Rebooting ...
U-Boot 1.1.6 (Mar 26 2011 - 04:34:19)
Board: EX4500-40F 10.4
EPLD: Version 6.2 (0x82)
DRAM: Initializing (1024 MB)
FLASH: 8 MB
Firmware Version: 01.00.00
USB: scanning bus for devices ... 3 USB Device (s) found
scanning bus for storage devices ... 1 Storage Device (s) found
ELF file is 32 bit
Consoles: U-Boot console FreeBSD / PowerPC U-Boot bootstrap loader, Revision 2.4 (hmerge @ svl -junos-pool130.juniper.net, Sat Mar 26 02:46:28 PDT 2011) Memory: 1024MB bootsequencing is enabled
bootsuccess is set
new boot device = disk1s2:
can't load '/ kernel' can't load '/ kernel .old 'Press Enter to stop auto bootsequencing and to enter loader prompt. Watchdog timed out. Resetting the board.
U-Boot 1.1.6 (Mar 26 2011 - 04:34:19)
Board: EX4500-40F 10.4
EPLD: Version 6.2 (0x81)
DRAM: Initializing (1024 MB)
FLASH: 8 MB
Firmware Version: 01.00.00
USB: scanning bus for devices ... 3 USB Device (s) found
scanning bus for storage devices ... 1 Storage Device (s) found
ELF file is 32 bit
Consoles: U-Boot console FreeBSD / PowerPC U-Boot bootstrap loader, Revision 2.4 (hmerge @ svl -junos-pool130.juniper.net, Sat Mar 26 02:46:28 PDT 2011) Memory: 1024MB bootsequencing is enabled
bootsuccess is not set
new boot device = disk2
Reboot the system? [yes, no] (no) yes
Rebooting fpc0
*** FINAL System shutdown message from root @ switch *** System going down IMMEDIATELY {linecard: 0}
iuriia @ CORE> JWaiting (max 300 seconds) for system process `vnlru_mem ' to stop ... done
Waiting (max 300 seconds) for system process `vnlru 'to stop ... done
Waiting (max 300 seconds) for system process` bufdaemon' to stop ... done
Waiting (max 300 seconds) for system process `syncer 'to stop ...
Syncing disks, vnodes remaining ... 2 2 2 0 1 1 1 0 0 0 0 0 done
syncing disks ... All buffers synced.
Uptime: 23m53s
recorded reboot as normal shutdown
Rebooting ...
U-Boot 1.1.6 (Mar 26 2011 - 04:34:19)
Board: EX4500-40F 10.4
EPLD: Version 6.2 (0x82)
DRAM: Initializing (1024 MB)
FLASH: 8 MB
Firmware Version: 01.00.00
USB: scanning bus for devices ... 3 USB Device (s) found
scanning bus for storage devices ... 1 Storage Device (s) found
ELF file is 32 bit
Consoles: U-Boot console FreeBSD / PowerPC U-Boot bootstrap loader, Revision 2.4 (hmerge @ svl -junos-pool130.juniper.net, Sat Mar 26 02:46:28 PDT 2011) Memory: 1024MB bootsequencing is enabled
bootsuccess is set
new boot device = disk1s2:
can't load '/ kernel' can't load '/ kernel .old 'Press Enter to stop auto bootsequencing and to enter loader prompt. Watchdog timed out. Resetting the board.
U-Boot 1.1.6 (Mar 26 2011 - 04:34:19)
Board: EX4500-40F 10.4
EPLD: Version 6.2 (0x81)
DRAM: Initializing (1024 MB)
FLASH: 8 MB
Firmware Version: 01.00.00
USB: scanning bus for devices ... 3 USB Device (s) found
scanning bus for storage devices ... 1 Storage Device (s) found
ELF file is 32 bit
Consoles: U-Boot console FreeBSD / PowerPC U-Boot bootstrap loader, Revision 2.4 (hmerge @ svl -junos-pool130.juniper.net, Sat Mar 26 02:46:28 PDT 2011) Memory: 1024MB bootsequencing is enabled
bootsuccess is not set
new boot device = disk2
In fact, this problem increased the time spent by an hour and a half. Yes, the switch also swears at the lack of a kernel, but why then the system tries to use disk2, if the system did not seem to see it in loader> it is not clear. I can assume that if there are problems with the boot, the device tries to cycle through the disks, but again, the system did not see the disk2 device. How and why then the same flash drive in the future successfully loaded the device also raises questions.
It is possible that I was mistaken here:
loader> nextboot disk0:
loader> reboot
because when you restart the loader’s settings are lost. I had to try “boot” instead of “reboot”, but then I didn’t.
The new version of the system significantly increased the load on the device. On the old version, the processor load during the day was about 27-30%, after the update - 45-48%, but neither the fairly simple configuration of the device nor the characteristics of the traffic changed. After several remote sessions with Juniper TAC, the reason could not be established - there were speculations about a memory leak and similar problems, but no. Strange, but had to be accepted as a fact.
An attentive reader could notice that the device names displayed in the loader (disk0) and used to boot successfully (disk1 and then / dev / da1s1a) are different. With what it is connected I will not venture to assert. I can assume that the names change depending on the degree of successful system boot. Loader loaded - received some device names, contact from db> - there will be others; from the CLI we generally call devices through “media external” and “media internal”. In general, so far only an assumption.
Most of the above calculations and commands I put together in a guide long before the update. After that, I periodically reread and supplement it if possible problems occurred to me. In it there was only db> mode and ==> setenv procedures. It’s clear, to foresee everything did not work out and something did not work as it should. But honestly - without this guide and time for his mental running-in, I would give up. Moreover, it was night work and the sharpness of mind was reduced.
Backups - although they did not help me much, their presence calmed my conscience and soul. In the worst case, even if the entire internal storage is damaged, I would copy the text config to the console. These two points are a guarantee that you will concentrate on work, and not on analysis of how to return everything to its original state and what to do next.
Of the significant shortcomings: in the process of work, I launched several PuTTY tabs that write the log to a single file. Then it was very difficult to sort everything out by individual devices and timestamps, it was better to use SecureCRT or run a separate window on different devices, especially since I had enough funds for this.
And at the end - a picture from the scene. I hope this post will be useful to you. Good luck in upcoming updates!
PS in the output of the commands I used markup for regular code, which looks worse than markup with a background of the source code of a certain language or BASH. However, the markup “code” allows the selection in bold, which was important for me to highlight interesting places in the output of the commands. If anyone shares how to do both (background + bold inside), I will be grateful and promise to use it in the future.
Update: it turned out that in different browsers and versions, the markup of the code is displayed differently. I’m troubled to smoke further, how to make the text more visual and readable.