adaptun October 20, 2013 at 23:56

Thirty Years Bug Fight

Transfer

The first edition spoke of a twenty-year bug. In fact, he is 30 years old. Thanks Sidnekin .

Today, reading some data, my program processed 36,916 possible dates. Two of the 36,916 failed the test. I did not attach any importance to this, because these dates were from the data provided by the client, and such data is often surprising. However, looking at the source data, it turned out that the verification did not pass on January 1, 2011 and January 1, 2007. There was a bug in the program that I wrote a month ago. But it turned out that this bug is 30 years old.

To anyone who doesn't really understand the software ecosystem, the writing below will seem strange, but that makes sense. Due to a decision made a long time ago to bring money to one company, my $ client spent money on paying me to fix a bug introduced by one company by accident and another specially. To explain this, I will have to talk about a third company that added a feature that ultimately became a bug, and a few other facts that influenced an incomprehensible bug that I fixed today.

In the good old days, Apple computers sometimes reset the date to January 1, 1904. The reason is simple: in those days , Apple computers used battery-powered system time.to keep track of date and time. What happened when the battery was low? Apple computers counted their dates as the number of seconds elapsed since the beginning of the era . The era in this case is just the start date of the countdown. And for Apple computers, that date was January 1, 1904. When the battery went down, this number became a new date. But why did this really happen?

In those days, Apple used 32 bits to store the number of seconds from the start date. One bit can contain two values: 0 or 1. Two bits - four values: 00, 01, 10, 11. Three bits - eight values: 000, 001, 010, 011, 100, 101, 110, 111. And so on. . How many values are contained in 32 bits? 32 bits contain 2 ³²(or 4'294'967'296) values. For Apple dates, this was approximately 136 years old, so old Apple computers cannot work with dates after 2040, and if the battery in the system clock went down, the date was again 0 seconds after the era and you had to manually set the current number each time you turn on the computer (before buying a new battery).

However, Apple’s decision to store dates as the number of seconds after an era meant that it was not possible to store dates untilera. As we will see, this had far-reaching consequences. This was a feature, not a bug, added by Apple. Among other things, this meant that the Macintosh operating system was immune to the Y2K problem (although many Mac programs did not, because they had their own dates to circumvent the limitations of Macs).

Moving on, we meet Lotus 1-2-3 , IBM’s killer application that helped launch the PC revolution, although VisiCalc is actuallyon Apple gave rise to personal computers. We can say that, if it were not for 1-2-3, then the PCs most likely did not leave their niche and computer technology developed in a completely different way. However, Lotus 1-2-3 incorrectly considered the 1900th leap year. When Microsoft released Multiplan, its first spreadsheet program, it was unable to conquer the market . So when developing Excel, it was decided not only to copy the column naming rules for Lotus 1-2-3, but also to make the products fully compatible with bugs, including intentional handling from 1900 as a leap year, the problem is still relevant since. So for 1-2-3 this was a bug, but for Excel - a feature that guarantees all 1-2-3 users the ability to import spreadsheets into Excel without any difference in dates, even if they were incorrect.

Over time, Microsoft decided to release Excel for the Apple Macintosh, but there was a problem. As already mentioned, the Macintosh did not understand the date until January 1, 1904, and for Excel, the era was January 1, 1900. So Excel tweaked to recognize the era and store the date relative to the corresponding era. In a Microsoft support article, this problem is described quite clearly . And that leads to my bug.

My current $ client receives spreadsheets from many of his clients. These tables could be made on Windows, or could be made on a Mac. As a result, the era in these tables may be January 1, 1900 or January 1, 1904. How to find out which one? The file format in Excel stores such information , but the parser that I use does not provide it and believes that you yourself know what era you are dealing with in this file. I probably should have spent a lot of time trying to figure out how to read the Excel binary format and send the patch to the parser developer, but I had other things to do with my $ client and I sketched a heuristic to determine which era this file belongs to. She was simple.

In Excel, it can be stored, for example, on July 5, 1998, but this number can be formatted as "07-05-98" (useless American format), "July 5, 98", "July 5, 1998", "5-JUL- 98 ”and many more useless options (ironically, the only format my version of Excel does not offer is ISO 8601). Inside, the unformatted value is either 35981 for the era of 1900, or 34519 for the era of 1904 (these numbers correspond to the number of days that have passed since the era). I used a persistent parser to extract a year from a formatted date. and then an Excel parser to extract the year from the unformatted value. If they differed by four, then the dates in the file were considered from 1904.

Why not use formatted dates right away? Because July 5, 1998 mayformatted as July 1998, losing a day. We get spreadsheets from so many companies and they create them in so many different ways that they expect from us (from me in this case) the ability to understand. Excel understands what's what, so I must!

It was then that 39082 and gave me a push. Remember how Lotus 1-2-3 considered the 1900th leap year, and how honestly copied it to Excel? Since this adds one day to 1900, many date calculation functions may be wrong for one day. This means that 39082 may be January 1, 2011 (on Mac), or maybebe December 31, 2006 (on Windows). It's great of course that my parser extracts 2011 from a formatted value. But since the Excel parser does not know from which era the dates in this file are calculated, it considers by default that it is from 1900, returns the year 2006, my program sees that the difference is five years old, considers that this is an error, writes it to the log and returns unformatted value.

To get around this, I came up with the following (pseudocode):

difference = formatted_year - parsed_year
if ( 0 == difference )
    assume 1900 date system
if ( 4 == difference )
    assume 1904 date system
if ( 5 == difference and parsed month is December and parsed day is 31 )
    assume 1900 date system

Now all 36'916 dates are parsed correctly.

Note : for fun, if you have a Mac with Excel, you can try to enter a date before 1904 and format it in a different format. You can enter it, but you cannot format it, because Excel will consider it to be plain text. At the same time, for Microsoft Excel, all days of the week until March 1, 1900 are incorrect due to a bug in the program released in January 1983.

Update : I was told that Spreadseet :: ParseExcel understands the 1904 flag . Unfortunately, I use Spreadsheet :: ParseExcel :: Stream, which does not understand. Even on huge machines, we don’t have enough memory when using a standard parser, so we use streaming. My attempts to get around this restriction came across another bug.

Update 2 : Turns out Microsoft first released Excel for Macintosh .

Update 3 : According to Joel Spolsky , a bug in Lotus 1-2-3 could be a deliberate attempt to simplify the program. I had previously hinted that Lotus did it intentionally, but since I'm not 100% sure, I did not write about it.

Tags:

Thirty Years Bug Fight

Also popular now: