Excel spoiled 20% of spreadsheets in genetics research
Fragment of the data table from the scientific work “ Tissue-specific expression and regulation of genes of sexual dimorphism in mice ” (doi: 10.1101 / gr.5217506)
In Microsoft Excel spreadsheets, the default cell format is “General”. Many have come up against the fact that Excel for no reason at all converts the entered value into a date. This is the most common automatic conversion error. To avoid it, you should change the data type from “General” to “Numeric” (for numbers) or “Text” (for text). In the latter case, the entered data will not be converted at all.
Inexperienced users of Excel suffer from automatic conversion. It is surprising that among these inexperienced users is a large number of scientists. Automatically scan publications in PubMed usinga set of bash-scripts showed that about 20% of the tables with data in scientific works on genetics contain errors in converting Excel data in the names of genes!
First, the authors of the study checked all the basic human genetic symbols - and found that Excel with the default settings automatically converts 35 human genetic symbols, turning them into dates: these are the genes FEB1, FEB2, FEB5, FEB6, FEB7, FEB9, FEB10, MARCH1, MARC1 MARCH2 MARC2 MARCH3 MARCH4 MARCH6 MARCH7 MARCH9 MARCH10 MARCH11 , DEC1.
The procedure for selecting the names of genetic symbols established by the rules, for which the maximum length is limited to six characters, and the name must be an abbreviated gene name or abbreviation, so that all septiins are encoded in SEPT, and all recombinant proteins (Membrane-Associated Ring Finger) begin with MAR.
The situation gets worse, given the synonyms. For example, the genetic symbol CRNN ( gene identifier 49860 ) has a synonym for SEP53, which Excel turns into September 1953.
The captin protein with an approved KPTN symbol has a synonym 2E4 ( gene identifier 11133 ), which is converted into the number 20,000.
The interferon-induced transmembrane IFITM1 protein with alias 9-27 ( gene identifier 8519 ) turns into September 27.
The Excel program is a real minefield for geneticists, because the genetic nomenclature includes identifiers such as 201E9, 9130022E09, 3e46, NA, NaN, etc. Thus, Excel converts genetic symbols not only into dates, but also floating-point numbers. For example, 2310009E13 turns into 2.31E + 13.
Having compiled a list of possible errors, the authors of the scientific work downloaded the accompanying spreadsheets from all scientific papers published in 18 scientific journals from 2005 to 2015. For interdisciplinary journals, the sample was limited to scientific work with the word "gene" in the title. The tables were converted to the .tsv format and passed through
regexwith the search for the specified rows. They took the 2004 scriptand added several more date formats there (DD / MM / YY and MM-DD-YY). Scripts launched on Ubuntu v14.04 LTS with the GNU bash shell version 4.3.11.
The script of 2004 , when a similar study of automatic conversion of genetic symbols in scientific papers was conducted.
Errors in scientific papers are contained in spreadsheets, which are published as accompanying materials for scientific works. Almost a thousand scientific papers with errors are listed in the first column of this table .
A total of 35,175 accompanying spreadsheets were analyzed. 7467 gene lists found in 3597 scientific papers. Excel conversion errors were found in 987 accompanying files for 704 published scientific articles. That is, for this sample, the percentage of scientific papers in which the accompanying Excel spreadsheets contain conversion errors is 19.6%.
For some reason, a positive correlation was found between the influence of the journal (journal impact factor, JIF) and the percentage of scientific papers with errors in the accompanying files (Spearman rho = 0.52, two-sided p value = 0.03).
The analysis also showed that the number of scientific papers with errors has increased by about 15% per year over the past five years, overtaking the overall growth in the number of scientific publications (3.8%).
The authors propose to contact the developers of Microsoft. Anyone can send them a message via Excel's Suggestion Box or via Twitter (Excel developers have a Twitter account ).
The problem with the automatic conversion of scientific symbols in Excel spreadsheets has been raised repeatedly . Back in 2004, authors of scientific papers warned that gene identifiers are automatically distorted in Excel. Unfortunately, over the past decade the problem has not been resolved. On the contrary, even more scientific works on bioinformatics with errors began to appear.
To some extent, the authors of scientific works themselves are to blame for the errors. They do not know the Excel interface well enough, cannot change the data type in the cell format, do not carefully check the tables with accompanying data. The reviewers who accept such works for publication are also to blame, without also checking the tables. The compilers of the table with errors specifically noted several cases when conversion errors are found in the first lines of the accompanying table, which indicates insufficient attention to the authors of scientific papers and reviewers.
In general, in the scientific works there are even more ridiculous mistakes. For example, a scientific work with a typo was published right in the headline in a journal on nuclear physics in 1991.. Instead of the phrase "hadron collider" ("hadron collider"), the phrase "hardon collider" went into print, which has a completely different meaning.
Computer interface developers are well aware that the absolute majority of users do not change the default settings in programs. If Excel is set to the default data type "General" - then it will be used, so that the appearance of errors with automatic conversion in scientific works on bioinformatics and genetics is predictable.
The authors of the study note that there is no way to permanently disable the automatic conversion of data into Excel and other LibreOffice Calc and OpenOffice Calc spreadsheets. Each time you need to set the data type manually. The exception is the Google Sheets table, where data is not corrupted. Moreover, when reopening Google Sheets tables in Excel, LibreOffice Calc or OpenOffice Calc, genetic symbols like SEPT1 and MARCH1 are protected from being converted to dates.