Dutoi Group @ University of the Pacific

Journals & Databases

UNIX/Linux

Programming

Presentation

Parallelism

Version Control

Python Notes

Open Licensing

What is Version Control?

Coming soon.

Centralized vs. Decentralized

Coming soon. [Biggest point for us: decentralized is low overhead to set up.]

Tools for Decentralized Version Control

My favored tool for version control of software (and pretty much anything else) is Mercurial. Mercurial is similar in spirit to git, but the command naming is more similar to Subversion, which is probably the biggest reason I started using it. It is not an important distinction however. Though I still have other much more minor reasons for prefering Mercurial over git, git is more widely used, and repositories of one can often be converted to repositories of the other.

The biggest point of the forgoing is that, if you will write/share code in my group, you will probably be using Mercurial (henceforth, just “Hg”), and so the rest of this page is written with that option exclusively in mind (though some of the more general advice is transferable).

Our Repository Model

Whereas the equality of repositories in Hg can be very powerful in large delocalized collaborations, with subgroups of developers pushing and pulling within a branch, we are a small localized collaboration.

For the most part, and this is not strict, each code had a specially designated repository from which we will do all our pushing and pulling. This is for no other reason than it is the simplest possible usage model of Hg; if we need something more complicated later, we can do it then. The special repository will be designated by the existance of a local (hidden) file named .repository. The file does contain some text explaining its meaning, but really the existance of this file should be taken as the marker. Locally that file is ignored by Hg (included in the .hgignore file, and it is not checked in, so that it is not propagated. This repository is thought of as the development trunk.

Occasionally a branch will be created from the trunk, cleaned of all its worst bugs, and this will be checked in as “release” version (though it may only be released to other computers in the group). That version will then be merged back with the trunk, even if this involves simply putting everything back the way it was. In this way the trunk still only ever has one head, but, when cloned, there is a clean release version available. On machines where we need the code to run stably, the local working directory will be updated to a release. Should we ever get so far as releasing code to the general public, or even just, end-user-type collaborators, it would be such a release that we then copy and strip of all the development junk (like .hg and attic) files.

In general, the structure of a repository should be such that the following directories live inside it.
<package-name>/ Documentation/ Applications/ Development/ The idea is that the directory in which these three things live can be named whatever you need it to be named to keep track of however many local branches you are working on, but that <package-name> is the constant name under which the package is imported into other code. If PYTHONPATH is set to find the parent directory of these three items, then the import statements will find the package therein. User documentation should live in the Documentation. The Applications directory appropriately lives outside of the package as will most user code. The applications therein must import the package as if it lived in some arbitrary location on the system. Tests should live in the Applications directory (which can and should be highly structured), even if they are only unit tests. Finally, the Development folder contains stuff useful to developers of the code, like “to do” lists or a Wiki instance. Should the code ever be distributed to end-users, then Development folder will be stripped, along with the .hg folder and any .hgignore file (not shown above), and Applications may also be thinned and may or may not be renamed to Samples, depending on whether the code acts more like a library or an application suite; some cleaning of local development garbage out of the package itself may be necessary (in the best case, only attic/ directories).

What to check in and how often?

In short, check your code into your local repository liberally and often. It is hard for it to hurt you to have a very fine-grained history of your changes. Do not worry about checking in buggy code; remember, it stays local at that point. No one else sees it (yet), and you will have a chance to fix it before any one gets it. They are unlikely to update thier working directory to an old version in your development branch, especially if you comment that it is buggy in the commit line (and it is their problem if they do update to that code).

I also say “liberally” because it never hurts to make more of your development gargbage accessible. Do not make too much clutter though. A good idea is to make a folder called attic where you can put some things that are not in use but might contain valuable reminders of how to do stuff. There is not much point in putting old checked-in code into the attic though because you could just get that out of an older version of the repository. An attic is a good place for ideas that you nixed in early stages, for which there is little point in checking it in even as a buggy rough draft. You might be tempted to be protective of code that you think is the early stages of a really good idea, or that you are sort of embarrassed of, but remember, if you do not check it in, and then you want it back after deleting it, you are digging around in the general system backups for it, which is dicier, and involves asking me and waiting for it (it is also fine to delete a checked in attic, once you think it is really no longer relevant).

One exception to liberally checking in everything would be big files, especially output data. It may be necessary to archive data from tests, but, if that data is big, we should consider an independent repository. The problem is that, once checked in, even if deleted later, that file is archived, and so it still takes up space and slows people down when cloning the repository.

The other major exception is .hgignore files. I do not want to go through all the theory, but it has bitten me a couple of times to have these be part of the repository. If you do not like seeing the existance of .hgignore when you type hg st because it is not checked in, consider adding .hgignore to the .hgignore file! Note that the special central repository (with the .repository file) may need to have a .hgignore file. In that case, it is likely a user will want to at least start with something similar, and that it would be good to have it under version control. In that case, a file named hgignore should be checked in in the Development subdirectory of the repository. In the central repository, this will just be soft-linked to .hgignore, but users may also choose to copy and modify it in their local branch.

What to push and how often?

Clean, merged code only! (usually)