Becoming a contributor to PostgreSQL
- Tutorial
In this article I would like to talk about how the PostgreSQL development process looks like through the eyes of one of the contributors to this same PostgreSQL. I started developing this DBMS in December 2015, when I got a job at Postgres Professional. That is, not so long ago. And that means that memories of moments that at first seemed to me not quite obvious are still fresh. I would like to outline them so that it would be easier for new people coming to our team, as well as for all those who want to try themselves as a developer of an open relational DBMS. I will talk about how the PostgreSQL development process looks, what tools I use in my daily work, how patches should be made out, and so on. Interested persons please proceed to cat.
The question that excites the minds of millions is which IDE or text editor to use? :) Practice shows that you can develop PostgreSQL in anything. Some of my colleagues use Sublime Text, some prefer Vim, some Emacs, there are also KDevelop and Visual Studio Code users. I personally used CLion quite successfully at first, but now I switched to Vim + ctags. In general, the main thing is that the editor has syntax highlighting, a transition to the definition, maybe some simple things like renaming variables and checking spelling. You will hardly need some sophisticated automatic refactoring. The fact is that a patch with the result of such refactoring is unlikely to be so easily accepted.
The second equally exciting question is which OS or Linux distribution to choose? At our company, many developers use Ubuntu. There are also MacOS users. No one seems to be sitting under Windows - for development, a virtual machine is usually launched for this platform. There is one Arch Linux user. I personally used Ubuntu for a long time, but recently I hit my head and switched to FreeBSD. In general, any * nix system should work.
PostgreSQL is successfully compiled by GCC, CLang and Visual Studio, possibly by some other compilers (Intel C ++ Compiler?). Moreover, the community is committed to maintaining code compatibility with all of these compilers. So you can use any compiler. You can also use your favorite debugger, be it GDB, LLDB, something built into your IDE or some kind of WinDbg.
PostgreSQL code lives in Git . In addition to the official repository, there is still a mirror on GitHub , but this is purely a mirror. It’s pointless to open up issues there and send pullrequests there. During the development of the patch, nobody cares what version control system you use. But the patch is usually sent as the output of the git diff command.
In a first approximation, it seems, I forgot nothing. From time to time I also use perf, tcpdump, strace / truss, dtrace, rr, lcov, various static analyzers and other tools. But the need for them arises rather as an exception. The main development tools are a text editor, git, compiler, debugger and, of course, the brain. Yes, and another email client. But I will talk about this below.
PostgreSQL currently uses Autotools. Autotools alone is not a very nice thing. Also, not designed for Windows. Therefore, to build PostgreSQL, a special set of Perl scripts is provided for this platform, which is somewhat crippled. My colleague Yuri Zhuravlev is trying to push a patch that translates PostgreSQL to CMake . But everything is not easy there, since the current PostgreSQL extension system is heavily tied to Autotools.
All projects using Autotools are collected in approximately the same way:
For fast local deployment of PostgreSQL, I use this set of scripts , many of which Stas Kelvich shared with me .
The subtle point that all novice contributors fly into PostgreSQL without exception is that if you make a change to the .h file, be sure to run make clean. By default, when changing an .h file, the .c-dependent files are not recompiled. If you do not know this, you can observe the widest range of entertaining magical effects :)
Often a person searching for an idea for a patch is sent to the TODO list . In my opinion, this is pretty bad advice, for a number of reasons. Firstly, this list is not always up to date. Secondly, there are points about which no one knows exactly how to make them correctly, and therefore it was decided to simply add an item to TODO, maybe someday insight will come. Finally, thirdly, most of the tasks on this list are quite complex. I would advise starting with something simpler.
The easiest way is to look for typos in the code and documentation. There really are a lot of them. This happens for the reason that before the merge of the proposed patches, committers often rewrite them slightly, just a little bit. The result is a completely new patch that no one has read, hence the typos. You can just keep track of new commits and send 1-2 patches every week. Correcting comments on the code is difficult to break something, so your patch will be gladly accepted.
It happens that some pieces of code can be refactored a little. This is also a fairly simple change. We make the code more beautiful and correct, run the tests, if nothing breaks - we offer a patch.
Bug fixes. To the pgsql-bugs @ mailing listbugs will be regularly reported (usually minor). Usually fixing a bug is a freebie. We are writing a test that reproduces a bug. We rewrite the code so that the test no longer crashes. Helmet Patch.
Optimization. Also a freebie - the code should do the same, only faster. We write a benchmark that reproduces the performance problem, rewrite the code so that it works faster, helmet patch.
Improved documentation and comments. For example, you are trying to understand how the code works, but you do not understand. Looks like you found a place where code comments can be improved!
You can often find what to patch by building a project with some unusual compiler (for example, a very old or very new version of GCC), on an unusual platform (ARM, PowerPC, ...) under an unusual operating system (NetBSD, OpenIndiana). Tests do not usually spill over, but a couple of warnings may slip through during compilation. It often helps to run some kind of static analyzer through the code.
If you have no idea for your patch, you can essentiallyHelp the project by making a code review and / or testing someone else’s patch. Programmers, as a rule, really like to write code, but they don’t really like to review and test it. Therefore, there are really not enough reviewers in the PostgreSQL community. By the way, being a reviewer is quite simple. You need to make sure that the patch is applied, the code is then compiled and passes the tests, and also that the task that the author set for himself is solved. If it is not clear to you how to check this, perhaps the author has not described it well enough. This is an occasion to ask the author a question in the corresponding thread and put the patch in the waiting on author state. And if at the same time you are also able to read the code and give adequate advice on renaming variables and splitting procedures into several, then you simply have no price! About code review,
All communication between PostgreSQL developers takes place on the pgsql-hackers @ mailing list . It also makes sense to subscribe to pgsql-committers @ . Notifications about the latest merges arrive to the master there, sometimes a discussion of a specific commit is tied up. The traffic in these two mailing lists is not that big; it's not LKML for you. It is quite possible to read them from your main mailbox without any filters (although I do not read all threads in a row). I personally receive them all on a working e-mail.
It may also make sense to subscribe to pgsql-general @ (general questions) and the already mentioned pgsql-bugs @ (bug reports). But, strictly speaking, this is not required for development.
Regarding the choice of email client. In principle, anyone will do. Many use Thunderbird. I sat on Claws Mail for a long time, and now I crawled on Mutt . I saw one of my colleagues use GMail.
It’s good practice not to send HTML emails. The text of the letter in width should be limited to 72 characters. Of course, only English can be used. Using attachments, unlike the same LKML, is not prohibited. It’s better to upload heavy attachments somewhere, rather than send them directly to the mailing list.
In the PostgreSQL community, as far as I know, there is no code of conduct. But this does not negate the need to be polite, not to use sarcasm, never to get personal, and so on. Emails, especially in English, often come out somewhat dry. Therefore, it would be a good idea to use more words in the text like please, thank you, and so on. I personally try to start any letter with words like “Thank you everyone for great comments!” and end with something like “As always, please don't hesitate to share any thoughts on this topic!”. Try it and you will be surprised how friendly the community will be towards you.
Perhaps it would be worth saying a few words about the main actors in the mailing list, such as Tom Lane, Simon Riggs, Robert Haas, Andres Freund, Alvaro Herrera, Bruce Momjian, and others. But the problem is that there are quite a few actors, and it is difficult to say in advance who will be interested in your patch. Therefore, I can only say that it would be a good idea for the first time to read the signatures of the people who answer you, look at which domains their e-mails are on, look for their names in git log or in Google in the end.
By the way, some people from the PostgreSQL community have blogs (which can be found thanks to Google), which are not without reason to subscribe. I personally currently subscribe to the following PostgreSQL-related RSS feeds:
Note that the list includes PostgreSQL Planet , which aggregates many blogs that are not on the list.
In the general case, before starting work on some big patch, it makes sense to write in pgsql-hackers @ a proposal letter describing what you want to do, how, and why. It may turn out that nobody really needs it. Or vice versa, that it is so necessary that over the past 5 years several solutions have been proposed that you do not know about, and which you should first familiarize yourself with. Well, or you can just give a couple of implementation tips, where to look, which boundary cases to consider, and so on. PostgreSQL developers are busy people who have enough of their own business, so don’t be afraid that someone will steal your brilliant idea. Rather, they will tell you that this is unlikely to work, and provide an opportunity to prove the opposite.
Regarding the design of the code. PostgreSQL uses ANSI C, so forget about C11, C ++ or Rust. The pgindent utility is used to format the code. You will find instructions on how to build it in the PostgreSQL source, in the src / tools / pgindent / README file . Before creating a patch, always run the code through pgindent, otherwise no one will even look at it. (But make sure pgindent doesn’t make changes where you didn’t change anything! In this case, it may be easier to format the code manually.) Otherwise, there are no particularly strict rules. Just look at how the code is designed in the area of the place where you stick, and try to write in the same way.
When the patch is ready, send it to pgsql-hackers @, specifying the [PATCH] label and a short description in the subject. In the body of the letter, tell us what problem the patch solves, and how it does it. Read the newsletter archive to see how it usually looks. If the patch is small, for example, corrects a couple of typos, it can be accepted immediately and without any special questions. In more complex cases, the patch must be sent to the nearest commitfest :
Commitfest is the local name for a sprint. One commitfest lasts one month. For example, the September commitfest is now open. All new patches are added to it. In early September, consideration of patches from the September commitfest will begin, and all new patches will be added to November (there is no commitfest in October, bugs are fixed for a month, and so on). This continues until March, with only 4 commitfests - in September, November, January and March. Then comes the codefreeze, bugs are fixed, alpha and beta releases are formed.
Patches at commitfest are in different states. They all have speaking names. Needs review means a patch requires a review. Waiting on author means that some actions are required by the author of the patch. Ready for committer means that the patch has passed the codeview and there are no more questions about it. One of the committers can get acquainted with it and either hold it back or return it to the author for revision. Well and so on.
Be patient. If nobody responds to your patch, this does not mean that nobody needs it. Just now everyone is busy with other patches. If your patch is in the commitfest and does not hang in Waiting on author, no one will forget about it, do not worry. If the reviewer or the committer answered you, carefully read the answer, make the appropriate changes to the patch and send its new version. To argue with reviewers or committers, in my personal experience, this is a very thankless task. Faster to fix the code and send the corrected patch. Moreover, often then you realize that the reviewer or committer was generally right, and you were not. However, some of my colleagues have a different experience, and on the contrary, they believe that it is always necessary to argue.
While you are waiting for a response to your patch, it is a good idea to browse someone else’s patch yourself. There is such an unwritten rule in the PostgreSQL community - if you send a patch for a patch and you don’t review anyone yourself, then they will very quickly stop reviewing you too. Moreover, the faster other patches on the commitfest are accepted or rejected, the faster the queue will reach yours, the more time you will have to make changes before closing the commitfest.
Additional materials for self-study:
That is all I wanted to talk about today. If you have any questions, I will be happy to answer them in the comments.
Continued: Contribute to PostgreSQL: Examples of Real Patches, Part 1 of N
Set of tools
The question that excites the minds of millions is which IDE or text editor to use? :) Practice shows that you can develop PostgreSQL in anything. Some of my colleagues use Sublime Text, some prefer Vim, some Emacs, there are also KDevelop and Visual Studio Code users. I personally used CLion quite successfully at first, but now I switched to Vim + ctags. In general, the main thing is that the editor has syntax highlighting, a transition to the definition, maybe some simple things like renaming variables and checking spelling. You will hardly need some sophisticated automatic refactoring. The fact is that a patch with the result of such refactoring is unlikely to be so easily accepted.
The second equally exciting question is which OS or Linux distribution to choose? At our company, many developers use Ubuntu. There are also MacOS users. No one seems to be sitting under Windows - for development, a virtual machine is usually launched for this platform. There is one Arch Linux user. I personally used Ubuntu for a long time, but recently I hit my head and switched to FreeBSD. In general, any * nix system should work.
PostgreSQL is successfully compiled by GCC, CLang and Visual Studio, possibly by some other compilers (Intel C ++ Compiler?). Moreover, the community is committed to maintaining code compatibility with all of these compilers. So you can use any compiler. You can also use your favorite debugger, be it GDB, LLDB, something built into your IDE or some kind of WinDbg.
PostgreSQL code lives in Git . In addition to the official repository, there is still a mirror on GitHub , but this is purely a mirror. It’s pointless to open up issues there and send pullrequests there. During the development of the patch, nobody cares what version control system you use. But the patch is usually sent as the output of the git diff command.
In a first approximation, it seems, I forgot nothing. From time to time I also use perf, tcpdump, strace / truss, dtrace, rr, lcov, various static analyzers and other tools. But the need for them arises rather as an exception. The main development tools are a text editor, git, compiler, debugger and, of course, the brain. Yes, and another email client. But I will talk about this below.
Assembly, test run and so on
PostgreSQL currently uses Autotools. Autotools alone is not a very nice thing. Also, not designed for Windows. Therefore, to build PostgreSQL, a special set of Perl scripts is provided for this platform, which is somewhat crippled. My colleague Yuri Zhuravlev is trying to push a patch that translates PostgreSQL to CMake . But everything is not easy there, since the current PostgreSQL extension system is heavily tied to Autotools.
All projects using Autotools are collected in approximately the same way:
./configure --prefix=...
make -j4 -s
make check
make install
For fast local deployment of PostgreSQL, I use this set of scripts , many of which Stas Kelvich shared with me .
The subtle point that all novice contributors fly into PostgreSQL without exception is that if you make a change to the .h file, be sure to run make clean. By default, when changing an .h file, the .c-dependent files are not recompiled. If you do not know this, you can observe the widest range of entertaining magical effects :)
The idea for the first patch, and how else can help the project
Often a person searching for an idea for a patch is sent to the TODO list . In my opinion, this is pretty bad advice, for a number of reasons. Firstly, this list is not always up to date. Secondly, there are points about which no one knows exactly how to make them correctly, and therefore it was decided to simply add an item to TODO, maybe someday insight will come. Finally, thirdly, most of the tasks on this list are quite complex. I would advise starting with something simpler.
The easiest way is to look for typos in the code and documentation. There really are a lot of them. This happens for the reason that before the merge of the proposed patches, committers often rewrite them slightly, just a little bit. The result is a completely new patch that no one has read, hence the typos. You can just keep track of new commits and send 1-2 patches every week. Correcting comments on the code is difficult to break something, so your patch will be gladly accepted.
It happens that some pieces of code can be refactored a little. This is also a fairly simple change. We make the code more beautiful and correct, run the tests, if nothing breaks - we offer a patch.
Bug fixes. To the pgsql-bugs @ mailing listbugs will be regularly reported (usually minor). Usually fixing a bug is a freebie. We are writing a test that reproduces a bug. We rewrite the code so that the test no longer crashes. Helmet Patch.
Optimization. Also a freebie - the code should do the same, only faster. We write a benchmark that reproduces the performance problem, rewrite the code so that it works faster, helmet patch.
Improved documentation and comments. For example, you are trying to understand how the code works, but you do not understand. Looks like you found a place where code comments can be improved!
You can often find what to patch by building a project with some unusual compiler (for example, a very old or very new version of GCC), on an unusual platform (ARM, PowerPC, ...) under an unusual operating system (NetBSD, OpenIndiana). Tests do not usually spill over, but a couple of warnings may slip through during compilation. It often helps to run some kind of static analyzer through the code.
If you have no idea for your patch, you can essentiallyHelp the project by making a code review and / or testing someone else’s patch. Programmers, as a rule, really like to write code, but they don’t really like to review and test it. Therefore, there are really not enough reviewers in the PostgreSQL community. By the way, being a reviewer is quite simple. You need to make sure that the patch is applied, the code is then compiled and passes the tests, and also that the task that the author set for himself is solved. If it is not clear to you how to check this, perhaps the author has not described it well enough. This is an occasion to ask the author a question in the corresponding thread and put the patch in the waiting on author state. And if at the same time you are also able to read the code and give adequate advice on renaming variables and splitting procedures into several, then you simply have no price! About code review,
About mailing lists and blogs
All communication between PostgreSQL developers takes place on the pgsql-hackers @ mailing list . It also makes sense to subscribe to pgsql-committers @ . Notifications about the latest merges arrive to the master there, sometimes a discussion of a specific commit is tied up. The traffic in these two mailing lists is not that big; it's not LKML for you. It is quite possible to read them from your main mailbox without any filters (although I do not read all threads in a row). I personally receive them all on a working e-mail.
It may also make sense to subscribe to pgsql-general @ (general questions) and the already mentioned pgsql-bugs @ (bug reports). But, strictly speaking, this is not required for development.
Regarding the choice of email client. In principle, anyone will do. Many use Thunderbird. I sat on Claws Mail for a long time, and now I crawled on Mutt . I saw one of my colleagues use GMail.
It’s good practice not to send HTML emails. The text of the letter in width should be limited to 72 characters. Of course, only English can be used. Using attachments, unlike the same LKML, is not prohibited. It’s better to upload heavy attachments somewhere, rather than send them directly to the mailing list.
In the PostgreSQL community, as far as I know, there is no code of conduct. But this does not negate the need to be polite, not to use sarcasm, never to get personal, and so on. Emails, especially in English, often come out somewhat dry. Therefore, it would be a good idea to use more words in the text like please, thank you, and so on. I personally try to start any letter with words like “Thank you everyone for great comments!” and end with something like “As always, please don't hesitate to share any thoughts on this topic!”. Try it and you will be surprised how friendly the community will be towards you.
Perhaps it would be worth saying a few words about the main actors in the mailing list, such as Tom Lane, Simon Riggs, Robert Haas, Andres Freund, Alvaro Herrera, Bruce Momjian, and others. But the problem is that there are quite a few actors, and it is difficult to say in advance who will be interested in your patch. Therefore, I can only say that it would be a good idea for the first time to read the signatures of the people who answer you, look at which domains their e-mails are on, look for their names in git log or in Google in the end.
By the way, some people from the PostgreSQL community have blogs (which can be found thanks to Google), which are not without reason to subscribe. I personally currently subscribe to the following PostgreSQL-related RSS feeds:
# PostgreSQL
http://postgresmen.ru/news.xml
http://planet.postgresql.org/rss20.xml
http://habrahabr.ru/rss/company/postgrespro/blog/
http://www.postgrespro.ru/rss
http://www.postgresql.org/news.rss
http://postgresweekly.com/rss/1ijl6aaa
http://postgres-edu.blogspot.com/feeds/posts/default
http://feeds.feedburner.com/depesz
http://rhaas.blogspot.com/feeds/posts/default
http://amitkapila16.blogspot.com/feeds/posts/default
http://obartunov.livejournal.com/data/rss
Note that the list includes PostgreSQL Planet , which aggregates many blogs that are not on the list.
How to send a patch
In the general case, before starting work on some big patch, it makes sense to write in pgsql-hackers @ a proposal letter describing what you want to do, how, and why. It may turn out that nobody really needs it. Or vice versa, that it is so necessary that over the past 5 years several solutions have been proposed that you do not know about, and which you should first familiarize yourself with. Well, or you can just give a couple of implementation tips, where to look, which boundary cases to consider, and so on. PostgreSQL developers are busy people who have enough of their own business, so don’t be afraid that someone will steal your brilliant idea. Rather, they will tell you that this is unlikely to work, and provide an opportunity to prove the opposite.
Regarding the design of the code. PostgreSQL uses ANSI C, so forget about C11, C ++ or Rust. The pgindent utility is used to format the code. You will find instructions on how to build it in the PostgreSQL source, in the src / tools / pgindent / README file . Before creating a patch, always run the code through pgindent, otherwise no one will even look at it. (But make sure pgindent doesn’t make changes where you didn’t change anything! In this case, it may be easier to format the code manually.) Otherwise, there are no particularly strict rules. Just look at how the code is designed in the area of the place where you stick, and try to write in the same way.
When the patch is ready, send it to pgsql-hackers @, specifying the [PATCH] label and a short description in the subject. In the body of the letter, tell us what problem the patch solves, and how it does it. Read the newsletter archive to see how it usually looks. If the patch is small, for example, corrects a couple of typos, it can be accepted immediately and without any special questions. In more complex cases, the patch must be sent to the nearest commitfest :
Commitfest is the local name for a sprint. One commitfest lasts one month. For example, the September commitfest is now open. All new patches are added to it. In early September, consideration of patches from the September commitfest will begin, and all new patches will be added to November (there is no commitfest in October, bugs are fixed for a month, and so on). This continues until March, with only 4 commitfests - in September, November, January and March. Then comes the codefreeze, bugs are fixed, alpha and beta releases are formed.
Patches at commitfest are in different states. They all have speaking names. Needs review means a patch requires a review. Waiting on author means that some actions are required by the author of the patch. Ready for committer means that the patch has passed the codeview and there are no more questions about it. One of the committers can get acquainted with it and either hold it back or return it to the author for revision. Well and so on.
Be patient. If nobody responds to your patch, this does not mean that nobody needs it. Just now everyone is busy with other patches. If your patch is in the commitfest and does not hang in Waiting on author, no one will forget about it, do not worry. If the reviewer or the committer answered you, carefully read the answer, make the appropriate changes to the patch and send its new version. To argue with reviewers or committers, in my personal experience, this is a very thankless task. Faster to fix the code and send the corrected patch. Moreover, often then you realize that the reviewer or committer was generally right, and you were not. However, some of my colleagues have a different experience, and on the contrary, they believe that it is always necessary to argue.
While you are waiting for a response to your patch, it is a good idea to browse someone else’s patch yourself. There is such an unwritten rule in the PostgreSQL community - if you send a patch for a patch and you don’t review anyone yourself, then they will very quickly stop reviewing you too. Moreover, the faster other patches on the commitfest are accepted or rejected, the faster the queue will reach yours, the more time you will have to make changes before closing the commitfest.
Conclusion
Additional materials for self-study:
- Video course "Hacking PostgreSQL" by Anastasia Lubennikova . Great course on the internal structure of PostgreSQL. Available videos and slides.
- Book Database System Implementation . As it says, this is how PostgreSQL works.
- The basics of debugging with GDB . There you will also find links to articles about debugging with LLDB, using the wonderful RR tool, and more.
- About how to profile code using perf , bcc / eBPF and other tools. You will also find links to DTrace and SystemTap material in the articles.
- Valgrind Tutorials and Static Analyzers Tutorials for C / C ++ . These tools help to find various kinds of errors in the code; it is extremely useful to be able to use them.
- Our company permanently hires . The work is interesting, although somewhat specific. Accustomed to weekly sprints with weekly rolling out of new code, for a long time it was not easy for me to rebuild.
That is all I wanted to talk about today. If you have any questions, I will be happy to answer them in the comments.
Continued: Contribute to PostgreSQL: Examples of Real Patches, Part 1 of N