Is your computer reliable?

I present to you the translation of an article by Jeff Atwood on testing new computers. I have not seen a single article of this quality on this subject; the article provides all the necessary information and nothing more, as well as well-structured material. I hope you enjoy it too.

Jeff is the founder of StackOverflow . He is currently working on the Discourse project .

Original article: Is Your Computer Stable?

Disclaimer: Although the article is titled “ Is your computer reliable ?”, It’s not about reliability as a term (English reliability), but rather about stability (English stability). An article about how the author tests new computers for stability and durability.


If my memory serves me right, I have assembled about a hundred computers over the past twenty years. This is not so difficult and, in fact, it only gets easier over time, as computers become more and more compatible.

For example, here is what you might need to build Scooter Computer :

  1. Apply a little thermal paste to the top of the case.
  2. Place the motherboard in the case.
  3. Screw the motherboard to the case.
  4. Insert an SSD card.
  5. Insert RAM board.
  6. Connect external power.
  7. Boot up.

That's all.



It's ridiculously simple. My six-year-old son and I assembled Lego constructors, which were much more complicated. Assembling traditional PCs differs in just a couple of additional steps: insert a processor, heatsink, connect cables. And finally, building the server adds a couple more minor actions, possibly with restrictions on the size of the assembly. A mini-computer, an ordinary PC or a server - if you were able to assemble one of them - consider you have collected them all.

Each of us exhales with relief when the computer just assembled is booted for the first time, and no matter how many machines are assembled in your account. But the download is only the beginning. This is great if it boots, but you won’t surprise anyone. In fact, we need to know if this computer is reliable .

And although the computer componentsevery year they become more reliable , and manufacturers conduct numerous tests before shipping - there is no guarantee that all parts will work reliably together, in your particular conditions. And there is always a chance that you will come across parts with elusive internal defects - even if this probability is very small.

Since we are scientists, we test things in the right conditions and collect data to prove that our computer is working stably . Therefore, after loading we start the tests.

Memory


I like to start with memory testing, since it does not have to have an installed OS and it works the same on all x86 computers. Memtest86 is the "great-grandfather" of all memory testers. I'm not sure why he and Memtest86 + split up, but they work almost the same. PassMark is a newer version, which is why I recommend it .

Download the version that suits you, write it to a bootable USB flash drive, insert it into a new computer, boot up and let the program do its job. Everything works in automatic mode - just boot and see how the test runs.

image
(if your computer supports UEFI boot, a newer version 6.x will be available to you, in another case - version 4.2, which is shown in the screenshot).

I recommend at least one full memtest pass , and if you need to be confident in the stability of your computer, leave it to be tested overnight. If you have a lot of memory, be patient. For our servers with 128GB memory, testing took about 3 hours.

The “Pass” value at the top of the screen should reach 100%, and the “Pass” value in the table should be more than one. If you get any errors, and indeed anything but a clean mark of 100% - your computer is not reliable . In this case, it is worth starting to remove the memory cards in order to detect a faulty one.

operating system


All subsequent tests will need an installed OS, and the most important of all reliability tests is testing whether it is possible to install an operating system on a computer . Choose your favorite free OS and start the normal installation. I recommend Ubuntu Server LTS x64 as it has much lower expectations about your video hardware. Download the ISO and write it to a bootable USB flash drive, then boot from it.

image
(Hey, just look, there is an option for testing memory! How prudent!)

  • Make sure you have a stable internet connection with DHCP. This will allow the installation to go faster.
  • In general, you will press Enter many times, accepting all the default settings. Yes, I know, I know that we install Linux, but believe it or not, but they made the installation process very friendly.
  • Regarding what you need to enter as the login and password for the default account, I recommend jeff and password , as I am one of the most outstanding experts in computer security.
  • If you install the OS from a USB flash drive and receive a message about a missing CD, simply remove and reinsert the USB flash drive. I don’t know why this works, but it works .

If anything happens during the installation  that prevents the installation from completing ...  your computer is not reliable . I know that this does not provide much information about the problem, but installing the OS is a good extensive test of the entire system.

In any case, for the following tests we will need an installed OS. In the future, I assume that you have installed Ubuntu, but in reality any Linux distribution will do.

CPU


Now, let's make sure that the brains of our computer are in order. Honestly, if you reached this point, and the memory and OS tests were successful, then the chance that you have a faulty computer is almost zero. But we need to be sure, and the best way to achieve this is to turn to our old friend, Maren Mersenne.

image
Mersenne numbers are numbers of the form Mn = 2 ^ n - 1, where n is a natural number. Numbers of this kind are remarkable including the fact that some of them are prime numbers. Mersenne numbers are named after the French mathematician Maren Mersenne who studied their properties in the 17th century.

I usually use Prime95 and Mprime - programs that analyze a huge number of giant numbers in order to determine if they are simple. Here's how we download and install mprime on our freshly installed Ubuntu Server: (You may need to replace the version number in the commands with the current latest version from here: www.mersenne.org/download , but at the time of writing, the version I cited is the latest). Now run mprime with the command ./mprime Answer N.

mkdir mprime
cd mprime
wget mersenne.org/gimps/p95v287.linux64.tar.gz
tar xzvf p95v287.linux64.tar.gz
rm p95v287.linux64.tar.gz





image



Next, you will be asked to indicate the number of tests to perform. But the program is smart and by default it selects the number of threads equal to the number of logical cores, so just press enter - we need full testing of all processors and cores. Next, select the type of testing:

  1. Small FFT's (maximum heat + stress test FPU, data is placed in the L2 cache, RAM is practically not tested).
  2. In-place large FFT's (maximum electricity consumption, tests RAM a bit).
  3. Blend (just a little, a lot of RAM tests).

I will make a reservation that they are not joking, saying "maximum electricity consumption." Choose 2, then Y to start torturing your processor. Now watch him writhe in pain. Now is the right time to uncover your Kill-a-Watt or other similar energy meter. If you have one, you can measure the maximum power consumption of the processor. In most systems, the CPU is the only significant energy consumer in the system, only if you do not have a powerful gaming graphics card. I also recommend launching i7z in a different terminal: this way you can monitor the core temperature and frequencies, while mprime does its job. Let mprime run all night at maximum heat

Accept the answers above? (Y):
[Main thread Feb 14 05:48] Starting workers.
[Worker #2 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Setting affinity to run worker on logical CPU #2
[Worker #4 Feb 14 05:48] Worker starting
[Worker #2 Feb 14 05:48] Setting affinity to run worker on logical CPU #3
[Worker #1 Feb 14 05:48] Worker starting
[Worker #1 Feb 14 05:48] Setting affinity to run worker on logical CPU #1
[Worker #4 Feb 14 05:48] Setting affinity to run worker on logical CPU #4
[Worker #2 Feb 14 05:48] Beginning a continuous self-test on your computer.
[Worker #4 Feb 14 05:48] Test 1, 44000 Lucas-Lehmer iterations of M7471105 using FMA3 FFT length 384K, Pass1=256, Pass2=1536.





sudo apt-get install i7z
sudo i7z

. All calculations are carefully checked, so if some kind of error occurs somewhere, the whole process will be interrupted and output the error to the console. In general, if mprime is interrupted ... your computer is not reliable .

image

Watch the temperature of the processor ! In addition to the absolute temperature of the processor, it is also necessary to monitor the total heat generation in the system. Fans should increase speed and the temperature of the entire system should be kept within acceptable limits, otherwise in the end you will get a faulty, overheating computer.

The bad news is that in practice, computers almost never experience such loads. The good news is that if your system can withstand the night in this mode - it is 100% ready for any tasks and overloads.

Disk


Disks are probably the easiest to replace, but at the same time they are the most likely candidates for failure. We know that the disk cannot be broken - we just installed a new OS, but an extra test will not hurt.

Let's start by testing the “bad” blocks (Badblocks) : So we fully test the entire disk (in safe read mode). I think, without explanation, it is clear that any errors should make you doubt the health of your disk. Now check the SMART records for our drive. The above command will let you know if your drive supports SMART. If so, let's activate it: Now we are ready to run SMART tests. But first, let's find out how long the different tests will run: Run

sudo badblocks -sv /dev/sda



Checking blocks 0 to 125034839
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)



sudo apt-get install smartmontools
smartctl -i /dev/sda



smartctl -s on /dev/sda



smartctl -c /dev/sda

long test if you have time or short if not. Tests are performed asynchronously; after the specified time has passed, open the SMART test report and make sure everything is successful: Next, run a simple benchmark to make sure that the disk performance is approximately as expected: For a system with a normal SSD, you should get at least the following results, but rather of everything is much better: Finally, we will conduct a more intensive test using bonnie ++ : The numerical results obtained are not very important for us, it is important for us that the test ends without errors. If you get errors during the above steps ... your computer is not reliable .

smartctl -t long /dev/sda



=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 100 -



dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda



536870912 bytes (537 MB) copied, 1.52775 s, 351 MB/s
Timing cached reads: 11434 MB in 2.00 seconds = 5720.61 MB/sec
Timing buffered disk reads: 760 MB in 3.00 seconds = 253.09 MB/sec



sudo apt-get install bonnie++
bonnie++ -f



(I believe that the tests I have given are great for everyday use, in particular for disks in RAID. However, if you want to test your disks even more thoroughly, I suggest a good resource: FreeNAS “how to burn in hard drives” )

Network


Honestly, I do not have much experience with network problems. But I believe in the importance of bandwidth, and this is exactly the thing that can be verified.

You will need two computers for the iperf test . Let's say our server has the address 10.0.0.1, here are the commands for it: And here is our client, which will connect to the server and monitor how quickly we can transfer data between machines: You should see about 120 megabytes / sec (960 megabytes) / sec) for a single gigabit Ethernet connection. If you are lucky to have a 10 gigabit connection, great, congratulations on your 1.2 gigabytes / sec.

sudo apt-get install iperf
iperf -s



sudo apt-get install iperf
iperf -c 10.0.0.1

------------------------------------------------------------
Client connecting to 10.0.0.1, TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.2 port 43220 connected with 10.0.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec



Video card


I do not cover this issue, because a very small part of the computers that I build needs something more than the built-in GPU processor. By the way, the integrated GPUs are surprisingly very good .

But you're a gamer, right? Then you need to boot into Windows and try something like furmark . And you have to test the video card, because video cards, especially gaming ones, are often the most powerful and complex device that consumes a huge amount of watts. And yes, watch the temperature.

Well, maybe your computer is reliable


I apply everything described above to all the computers that I collect, and all this perfectly fulfills its task. Thus, I find faulty processors, RAM, disks, cooling systems before they cause problems in the main work. All this does not mean that the computer will never break down, but I did everything I could to be sure that my computers would live long.

Who knows, maybe luck will accompany you and you will become known as a guy whose server had 16 years of uptime until it was written off.

image

All of these tests are just a starting point. Tell us which techniques you use to make sure your computers are stable and reliable. How would you improve the tests I proposed in accordance with your experience?

Also popular now: