Editing CSV-files, so as not to break the data

    HFLabs products in industrial volumes process data: addresses, full names, company details and another car in total. Naturally, testers deal with this data every day: update test cases, study the results of cleaning. Often, customers provide a “live” base for the tester to set up a service for it.

    The first thing we learn about new QA is to keep the data in its original form. All the covenants: "Do no harm." In the article I will tell you how to work with CSV files in Excel and Open Office. Tips will help not to spoil anything, save the information after editing and generally feel more confident.

    The material is basic, the professionals will definitely get bored.

    What are CSV files?

    The CSV format is used to store tables in text files. Data is often packaged in tables, so CSV-files are very popular.

    The CSV file consists of data lines and separators, which denote the boundaries of the

    CSV columns, which is deciphered as comma-separated values ​​- “comma-separated values”. But don't let the name fool you: the semicolons and tabs can serve as column delimiters in a CSV file. This will still be a CSV file.

    CSV has a lot of advantages in front of the same Excel format: text files are as simple as a button, open quickly, read on any device and in any environment without additional tools.

    Because of its advantages, CSV is a super-popular data exchange format, although it is already 40 years old. CSV has been using industrial application programs and upload data from databases to it.

    One problem - a text editor to work with CSV is not enough. Nothing if the table is simple: in the first field the ID is of one length, in the second the date of one format, and in the third some address. But when the fields are of different lengths and there are more than three of them, the torment begins.

    Watch for delimiters and columns — break your eyes.

    Even worse with data analysis — try Notepad to at least add up all the numbers in a column. I'm not talking about beautiful graphics.

    Therefore, CSV-files are analyzed and edited in Excel and analogues: Open Office , LibreOffice and others.

    Veterans who still read it: guys, we know about the analysis directly in the database using SQL, we know about  Tableau and  Talend Open Studio . This article is for beginners, but at a basic level and a small amount of data Excel with analogs is enough.

    How Excel spoils data: from classics

    Everything would be fine, but Excel, barely opening a CSV file, starts its crafty freaks. Without a demand, he changes the data so that they become unusable. And it does this completely unnoticed. Because of this, in due time, we grabbed a bunch of problems.

    Most incidents are due to the fact that a program without demand converts strings with a set of numbers into numbers.

    Rounds. For example, in the source cell, two phones are stored separated by commas without spaces: "5235834,5235835". What will Excel do? Famously turn the numbers into one number and round up to two digits after the decimal point: "5235834,52". So we lose the second phone.

    Leads to exponential form. Excel carefully converts "123456789012345" to the number "1.2E + 15". The original value will lose completely.

    The problem is relevant for long, characters of fifteen, digital lines. For example, KLADR codes (this is the state identifier of the address object: city, street, house).

    Removes leading pros. Excel considers that a plus at the beginning of a line with numbers is a completely extra character. They say, and so it is clear that the number is positive, since it is not worth a minus. Therefore, the leading plus in the number “+74955235834” will be rejected as superfluous - you get “74955235834”. (In reality, the number will suffer even more, but for clarity, I’ll manage plus).

    The loss of a plus is critical, for example, if the data goes to a third-party system, and that when importing it rigidly checks the format.

    Splits three digits.A numeric string longer than three characters of Excel, kind soul, carefully disassemble. For example, "8 495 5235834" will turn into "84 955 235 834".

    Formatting is important at least for telephone numbers: spaces separate country and city codes from the rest of the number and from each other. Excel easily violates the correct division of the phone.

    Removes leading zeros. Excel will turn the string "00523446" into "523446".
    And in the TIN, for example, the first two digits are the region code. For the Republic of Altai, it starts from scratch - “04”. Without zero, the meaning of the number will be distorted, and the TIN format check will not pass at all.

    Changes dates for local settings. Excel will gladly fix the house number "1/2" to "01.fev." Because Windows has suggested that in this form it is more convenient for you to read the dates.

    We defeat data corruption with correct import.

    But seriously, it’s not Excel that’s to blame for the troubles, but the unobvious way to import data into the program.

    By default, Excel applies the General type to the data in the downloaded CSV file. Because of it, the program recognizes numeric strings as numbers. Such an order can be defeated using the built-in import tool.

    I launch the import mechanism which is built in Excel. In the menu this is “Data → Get External Data → From Text”.

    I choose a CSV file with data, a dialog opens. In the dialog, I click on the file type Delimited (with delimiters). Encoding - the one in the file, usually determined by the machine. If the first line of the file is a header, I mark “My Data Has Headers”.

    I turn to the second step of the dialogue.I select the field separator (usually a semicolon - semicolon). Turning off “Treat consecutive delimiters as one”, and “Text qualifier” I put in “{none}”. (Text qualifier is a symbol of the beginning and end of the text. If the separator in CSV is a comma, then the text qualifier is needed to distinguish commas inside the text from commas-separators.)

    In the third step, I choose the format of the fields , for which it was all. For all columns I set the type "Text". By the way, if you click on the first column, hold down the shift and click on the last one, all columns will be highlighted at once. Conveniently.

    Then Excel will ask where to insert the data from CSV - you can simply click “OK”, and the data will appear in the open sheet.

    Before importing you will have to create a new workbook in Excel

    But! If I plan to add data to CSV via Excel, I have to do one more thing.

    After importing, you need to force all cells on the sheet to the “Text” format. Otherwise, new fields will get the same type of “General”.

    • Press Ctrl + A twice, Excel selects all the cells on the sheet;
    • I click the right mouse button;
    • I select Format Cells from the context menu;
    • In the dialog that opens, I select the “Text” data type on the left.

    To select all cells, you need to press Ctrl + A twice. Just two, this is not a joke, try

    After that, if you are lucky, Excel will leave the original data alone. But this is not the strongest guarantee, so after saving we will definitely check the file through a text viewer.

    Alternative: Open Office Calc

    I use Calc to work with CSV files. It is not that it does not consider digital data at all as strings, but at least it does not apply reformatting to them in accordance with the regional settings of Windows. Yes, and import easier.

    Of course, you will need an Open Office (OO) package. During installation, it will offer to reassign MS Office files to itself. I do not recommend it: although OO is quite functional, it does not fully understand the cunning microsoftware formatting of documents.

    But assigning OO as the default program for CSV files is quite reasonable. You can do this after installing the package.

    So, we start importing data from CSV. After double clicking on the file, Open Office displays a dialog.

    Note that in OO you do not need to create a new workbook and force the import to run, all by itself

    1. Encoding - as in the file.
    2. A “delimiter” is a semicolon. Naturally, if it is the separator in the file.
    3. The “text delimiter” is empty (all the same as in Excel).
    4. In the "Fields" section I click into the left-upper square of the table, all columns are highlighted. I specify the type "Text".

    A thing that has spoiled a lot of blood: if by mistake you select several field separators or the wrong text is separated, the file may open correctly, but it may not be saved correctly.

    In addition to Calc, libreOffice is popular in HFLabs, especially under Linux. Both are used more for CSV than Excel.

    Bonus Track: Problems saving from Calc to .xlsx

    If you save data from Calc to Excel format .xlsx, keep in mind - OO sometimes inexplicably losing data on a large scale.

    The white wasteland stretched in the middle in the original CSV file is richly filled with data.

    Therefore, after saving, I open the file again and make sure that the data is in place.

    If something is lost, the treatment is resave from CSV to .xlsx. Or, if Windows is installed, import from CSV to Excel and save from there.

    After re-saving, I must once again check that all the data is in place and there are no extra empty lines.

    If you are interested in working with data, look at our vacancies . HFLabs almost always need analysts, testers, implementation engineers, developers. We will provide data so that will not find it :)

    Also popular now: