Rule 10: 1 in programming and writing

Published on August 29, 2018

Rule 10: 1 in programming and writing

Original author: Yevgeniy Brikman
  • Transfer
In this article, the author analyzes the amount of time spent on writing books or software code, and comes to an interesting pattern. It can be used to plan the timing of work on projects.


Hofstadter's law: Any business always lasts longer than expected, even if you take into account the Hofstadter's law.
- Douglas Hofstadter, Gödel, Escher, Bach

Writing prose and code has a lot in common. But the most noticeable similarity probably lies in the fact that neither the writers nor the programmers can finish their work on time. Writers are famous for the notorious habit of breaking deadlines. Programmers have earned the reputation of people whose results are always seriously different from the original calculations. The question arises: why?
 
Today I have an idea how to answer it. And my discoveries struck me.

Studying your books


Both my books, Hi, Startup and Terraform: we launch and work , I wrote in the Atlas book creation environment , which provides for the management of all content using Git. This means that every line of text, every edit and every change was recorded in the Git commit log.

Check how much effort was spent on writing two books.
 
Hi, Startup.

Let's start with my first book, Hi, Startup . It has 602 pages and about 190 thousand words. I ran clocthe Hello, Startup git repositories and got the following results(for simplicity, fractional parts are dropped):



602 pages contain 26,571 lines of text. The lion's share is written in AsciiDoc , similar to Markdown. It is used in Atlas to write almost any content. With the help of HTML and CSS, Atlas defines the layout and structure of the book. Besides them, there are other programming languages ​​(Java, Ruby, Python, and not only), in which various examples are written to the topics discussed in the book.
 
But 602 pages and 26,571 lines are just the final result. They do not reflect about 10 months of writing, editing, editing, proofreading, stylistic adjustments, research, notes and other work contributing to the publication of the book. Therefore, to get more useful ideas, I used git-quick-statsto analyze the entire journal of the book.



So, I added 163,756 lines and removed 131,425, which together makes 295,181 lines of recycled material. That is, it turns out that I wrote or deleted a total of 295,181 lines, of which 26,571 lines remained in the end. This ratio is slightly more than 10: 1. For each published line, I had to first write 10 others!
 
I admit that counting the number of lines added to Git and deleted from it cannot be considered an ideal metric of the editing process. But, at least, this allows us to understand that it is not enough to evaluate the work done. A significant part of the process was not reflected at all in the Git commit log. For example, the first few chapters were written in Google Docs before I moved to Atlas, and many edits were made on my computer without commits.
 
Despite the fact that these data are far from ideal, I believe that the overall ratio of the “original text material” to the published one is 10: 1.

Terraform: we start and we work

Let's check, whether this proportion is applicable to my second book Terraform: we start and we workcontaining 206 pages and about 52 thousand words.

Simplified output cloc:



206 pages consist of 8410 lines of text. Again, most of the text is written in AsciiDoc, although in this book there are noticeably more code samples written primarily in HCL, the main language of Terraform. Besides him there are a lot of Markdowns, which I used to document HCL examples.
 
Let's use it git-quick-statsto check the history of edits in this book:



For almost five months I added 32,209 and deleted 22,402 lines, which in total gave 54,611 recycled lines. The accuracy of the evaluation process of editing this book suffers even more as the work began as a series of blog poststhat have gone through tangible processing before they move to Atlas and Git. The volume of these blog posts takes at least half of the book, so it will be logical to increase the final figure of the revised text by 50%. That is, the total will be 54611 * 1.5 = 81,916 lines of editable text, resulting in the total 8410 lines.
 
And again there is a ratio of about 10: 1!

It is not surprising that writers do not fit into the timeline. If according to the schedule it is supposed to hand over a book of 250 pages, then in practice it will come out that in the process we will write 2500 pages.

What about programming?


How are things in development? I decided to check out several open source git repositories of different maturity levels: from several months to 23 years.
 
terraform-aws-couchbase (2018)
 
terraform-aws-couchbase is a set of Couchbase deployment modules for AWS, the source code of which was opened in 2018.

Simplified conclusion cloc:



And here is the result of the check git-quick-stats:



We get as many as 37,693 lines of working code, resulting in 7,481 lines of the final code in a 5: 1 ratio. Even in the repository under 5 months already had to rewrite each line five times! Not surprisingly, the evaluation of software development is difficult: we do not even imagine that in order to get 7.5 thousand lines of the final code, in fact, you have to write 35 thousand.
 
Let's see how things are in older products.
 
Terratest (2016)
 
Terratest - opensource library, created in 2016 to test the infrastructure code.

Simplified output cloc:



Results git-quick-stats:


 
These are 49,126 working lines of code that have turned into 6,140 lines of summary text. For the two-year repository, the ratio was 8: 1. But Terratest is still quite young, so let's consider older repositories.
 
Terraform (2014)
 
Terraform is an open source library created in 2014 to manage infrastructure using programming methods.

Simplified output cloc:


 
Results git-quick-stats:



We get 12,945,966 working lines of code that resulted in 1,371,718 lines of the final result. The ratio is 9: 1. Terraform has been around for almost 4 years, but the library has not yet been released, so even with this ratio, its code base is not yet mature. Let's look further into the past.
 
Express.js (2010)
 
Express is a popular open source JavaScript framework released for web development in 2010.

Simplified output cloc:



Results git-quick-stats:



We get 224,211 working lines of code, reduced to 15 325 total lines. The result is 14: 1. Express is about 8 years old, its latest versions are number 4.x. It is considered the most popular and tested in web-based web framework for Node.js.
 
It seems that as soon as the ratio reaches the level of 10: 1, we can say with confidence that the code base is already “adult”. Let's check what will happen if we go even deeper into the past.
 
jQuery (2006)
 
jQuery is a popular open source JavaScript library released in 2006.

Simplified output cloc:



Results git-quick-stats:



Total 730,146 working lines of code, resulting in 47,559 lines of the final result. A ratio of 15: 1 for a nearly twelve-year repository.
 
Let's go ten more years ago.
 
MySQL (1995)
 
MySQL is a popular open source relational database created in 1995.

Simplified output cloc:



Results git-quick-stats:



We get 58,562,999 working lines, 3,662,869 lines of the final code and a ratio of 16: 1 for an almost twenty-three-year repository. Wow! Each line of MySQL code has been rewritten 16 times.

findings


The summarized results for my books are as follows:
Title
Work strings
Summary lines
Ratio
Hello startup
295 181
26,571
11: 1
Terraform: We start and work
81,916
8410
10: 1

Here is a summary table for various programming projects:
Title
Year of issue
Work strings
Summary lines
Ratio
terraform-aws-couchbase
2018
37 693
7481
5: 1
Terratest
2016
49,126
6140
8: 1
Terraform
2014
12,945,966
1,371,718
9: 1
Express
2010
224 211
15 325
14: 1
jQuery
2006
730,146
47,559
15: 1
Mysql
1995
58 562 999
3,662,869
16: 1

What do all these numbers mean?

Rule 10: 1 in prose and programming

Given that my data set is limited, I can only draw some preliminary conclusions:

  1. The ratio of "raw materials" and "final product" for the book is about 10: 1. Keep this figure in mind when you discuss with the editor the timetable for the delivery of the material. If you need to write a book of 300 pages, then in fact you have to write about 3 thousand pages.
  2. A similar rule can be derived for both mature and non-trivial software: the ratio of the amount of processed code to the total is at least 10: 1. Keep this in mind when a manager or client asks you to estimate time costs. An application of 10 thousand lines will require you to write about 100 thousand lines.

These findings can be summarized as rule 10: 1 for writing and programming :
Writing good software or text requires each line to be rewritten 10 times on average.

Next steps


Of course, lines of code and lines of text cannot be considered an ideal measure. But, I suppose, if you collect enough data, you can determine how the 10: 1 rule is universal and useful for specifying the time frame for completing the project.
 
Some questions that I would like to answer:

  • Is it possible to use the ratio of the processed lines of the code to the final ones as a quick metric for estimating the maturity of a particular software? For example, can we trust the solution of key infrastructure tasks to databases, programming languages ​​or operating systems if for them this ratio has reached at least 10: 1?
  • Does the amount of working text depend on the type of software? For example, Bill Scott found out that in Netflix only about 10% of the user interface code lives up to one year , and the remaining 90% by this time are completely rewritten. What is the speed of code replacement for backend, databases, command line utilities and other types of programs?
  • What percentage of the code is processed after the initial release? That is, what percentage of work can be considered "software support"?

If you are a book author and can do a similar analysis, I will be glad to know about your results. And if someone has time to automate such an analysis, it will be great to learn about the ratios found in various open source projects.

August 13 Update

Discussions of a post on Hacker News and Reddit's r / programming revealed two more interesting points:
 
  1. Apparently, a similar rule of 10: 1 is true for movies , journalism, music and photography! Who would have thought?
  2. Readers have left many comments that a change in even a single character can be counted as inserting or deleting a line in Git, so an indicator of 100 thousand modified lines does not mean that each line has undergone processing .

The last remark is valid, but, as I wrote above, my data do not take into account other types of changes:

  1. I do not commit for every single line. I can change it ten times, but only make one commit.
  2. The situation described in the previous paragraph is even more relevant for programming. During the testing of the code, I can change one line 50 times, while making only one commit.
  3. Many text editing and writing cycles were performed outside of Git (some chapters were written in Google Docs or Medium, and stylistic edits were made in PDF).

I think that all these factors compensate for the feature of accounting for inserting or deleting lines in Git. Of course, my estimates may be inaccurate, and the actual ratio will be 8: 1 or 12: 1. But in general, the difference is not too large, and 10: 1 is easier to remember.

August 14th Update

A Github Decagon user has created a repository called hofs-churn with a bash script to easily calculate how much code has been worked out in your repositories. He also used it to analyze a number of repositories, such as React.js, Vue, Angular, RxJava, and many others, and the results were quite interesting.

image