Why We use Unix and Linux in Bioinformatics



This page under construction!


Having worked in bioinformatics for almost 3 decades, and used most of the major desktop and multiuser platform available during that time (IBM 370, Apple ][, VAX/VMS, Macintosh, DOS, Windows, Unix, Next, Linux), I can speak from experience with regards to why Unix is the preferred platform for both teaching and research in bioinformatics. These points are probably quite relevant in most areas of science.

Although a lot of reasons for Unix as the preferred system for bioinformatics are discussed, a recurrrent theme that is seen throughout can be stated as follows:

We are faced with the task of giving the biologist easy access to dozens or hundreds of programs that must work the same for all users on all machines on the system.





The workbench vs. the Swiss-Army knife - One of the most fundamental concepts in bioinformatics is that many of the tasks you need to do are best done by putting together several tools in series, each of which performs part of the task. This is called "pipelining", and it applies to very simple tasks, as well as tremendously complex tasks such as assembly and annotation of complete genomes. Pipelining is illustrated in the accompanying figure. Just as an enzyme takes a substrate and generates a product, a program takes input and pruduces output. For example, one program might extract an mRNA sequence from a larger genomic sequence. A second program would translate the mRNA into protein, the third might predict a secondary structure of the protein.

A fundamental design concept in Unix has always been to have a 'workbench' of small tools, each of which only does one thing but does it very well. Windows has always followed the "Swiss Army Knife" philosophy of continuing to add more and more functionality to user applications that get bigger and more cumbersome as time goes on.  Eventually, usability begins to suffer as illustrated, for example, by the proliferation of autocorrect functions that are often more frustrating than they are useful. And, to continue with the analogy, the tools on a Swiss Army Knife, say, a screwdriver, are seldom as easy to use as the workbench variety. This is why Canadian Tire sells more than just Swiss Army Knives.

Bioinformatics is highly-dependent on the existence of a good workbench of tools, rather than one or two Swiss Army Knives. In a multiuser system, we are faced with the problem of ensuring that every one of potentially hundreds of programs works the same way for all users on all machines. This is especially a problem because the system administrator will seldom have enough knowledge of biology to be able to test each program that he or she installs.

Let's say that we wanted to implement a bioinformatics workbench of 100 programs on either Unix or Windows.

Unix (as implemented on the BIRCH system)

A BIRCH system can be run from a standard user account requiring no special system administration privileges. That means that a biologist with some knowledge of Unix can do all installation, updating, and problem solving without waiting for the system administrator to get around to it.

All that is required to install each program is to copy it to a world-readable 'bin' (binary) directory. Installing 100 programs can be as easy as copying 100 programs to the same bin directory. In principle this could be done with a single Unix command.

Windows (as implemented on the ACN open area machines)

There is no mechanism for individual users on a multiuser system to install a program such that it could be used by everyone on the system. That means that the system administrator must do all installation, updating, and fix all problems. Installing 100 programs means 100 times the work of installing one. Each program must go through the install process, update the registry, and then be tested on individual machnes.

Network-centric computing vs. the standalone PC -

The concept that a computer is used by a single user, who has all his/her software and files on
one machine, is inextricably entrenched in Windows. Even though Windows now allows a user to have a home directory of sorts, no software that I have seen defaults to it. Windows software always wants to write files to a directory, usually named 'Application Data' deep within the 'Programs' hierarchy. If multiple users use a machine, every time you open a file you have to do an awful lot of clicking to get to a directory you own.Unless you could afford to put each specialized molecular bio. software package onto each person's PC, you are stuck with the task of having data from numerous users mingling in the same directory.


File organization - Just about any task in bioinformatics will generate large numbers of files. For example, one might run an  analysis which generates 3 output files on 25 samples, creating 75 output files per run.  The need to organize these files is therefore paramount.

I run a multiuser sequence analysis resource called BIRCH, for 'Biological Research Computer Hierarchy'. (see http://home.cc.umanitoba.ca/~psgendb). Notice the word hierarchy. In a well organized Unix system, the entire system throughout the campus or corporation behaves as a single hierarchy. To make programs and databases available to our users, I put them into a world-readable directory structure. To access the programs, the first time user has to run a single setup script, that adds lines to his/her .login and .cshrc files that tell the shell to look to the BIRCH global cshrc and login scripts for setup commands. That way, even when the configuration has to be changed, (eg. new environment variables, telling programs where to find files), the changes take effect the next time the user logs in, without them having to do a thing. I have been managing BIRCH as a multiuser system since 1991, and we now have over 140 users campuswide. In all that time, I have never once had to login to user accounts, one by one, to make some change take effect.

Windows still has drives (C:, H:, S: etc) that can be redefined by each user on their own PC, unless every PC is individually configured by somebody with enough time to go to each machine and  configure it. There are still environment variables as DOS had. When one considers how  many different configurations of PCs  there are on a campus, the idea of trying to implement a BIRCH-like system on Windows makes me queasy. I don't know that it's impossible, but I wouldn't want to make the attempt.

The concept of a HOME directory - A HOME directory is important for several reasons. It is a default location for all of a given user's files. It helps to contain all of that user's files in a single place. Most importantly, it gives the system a standardized location in which to find configuration and preference files. The most notable example would be bookmarks for your web browser. 

In Unix, all programs utilize the HOME directory concept. This means that on a system in which many machines mount the same HOME directory (eg. all ACN Unix machines) the user's workspace and programs retain their preferences and customizations no matter which machine you log into. Unix also has as the concept of the current working directory.  Start a program in any directory it will read and write files, by default, within that directory. Together, these two concepts make it easier to keep your directories organized by topic, rather than by program. For example

/home/plants/frist
courses
agbiotech
bioinformatics
cyto
research
grants
nserc05
nserc08
papers
birch
stii
yw1

In Windows, there is no standard home directory for the user. Although an H: drive exists on most multiuser Windows systems, it is almost never used by programs. Almost none of the programs in the START menu of PCs in Agriculture 237 saved Preferences or bookmarks from session to session. The one exception was Firefox. That means that users of open-area Windows machines on this campus cannot do something as fundamental as setting bookmarks or Preferences for the programs they use. This is even true of Microsoft Office.  In this way, Windows undermines the ability for a user to have a seamlessly identical session from one machine to the next.

Windows does have the capability to use a roaming user-profile, but it does not appear to be implemented at the Univ. of Manitoba. The reason may be that the roaming user profile must be downloaded to each machine, each time a user logs in, and synced with the server copy each time a change is made, which slows down performance. (In Unix this is easy because all settings are saved in the HOME directory. This would be equivalent to saving settings on the H: drive.)


Beyond "one window owns the screen" model - I guess commercial software vendors want to show how important their programs are by making them default to taking up the entire screen. Yes, you can click at the top of a window to make it take up only part of the screen, but every time you start another task, you have to keep doing that. This kind of behavior prevents people from learning how to work with multiple windows (eg. a sequence generating several windows, a web browser in another, a database in another window). Also, most windows apps are written with the intention that they should take up the entire screen, so they use a lot screen real estate, especially at the top of the window. Many Windows apps look pretty ugly if you try to make them take up less than a full screen. In contrast, apps written for X-windows tend to economize on screen real estate. (xv and acedb are champs, here).

Windows claims to have preemptive multitasking. However, there are still lots of situations in which a program is waiting to do something, or the program hangs, and the entire screen freezes up. You literally can't do anything else at this point, except, of course, to reboot. In a research environment, you have to be able to do a lot of tasks at the same time. With Windows, it seems that the more simultaneous tasks are going on, the higher the liklihood that one of them will freeze the system. While I have, a few times in my life seen an X11 application freeze up the screen, requiring me to log in from another terminal to cancel the job (which I don't think is possible in Windows), these occurrences have been rare.


The need for a command line - There are many kinds of tasks in bioinformatics that are much more difficult, or perhaps impossible to do by pointing and clicking, as compared to using a command line.

In this way

The best evidence for the perceived value of a command line is seen in the fact that when Windows NT was created, the small number of DOS commands (originally patterened after Unix commands) were augmented with hundreds of new commands. At the same time, the complete re-write of the Macintosh platform, from OS9 to OSX, was accomplished by reimplementing the  Macintosh GUI on top of BSD Unix. Thus, Macintosh is essentially a Unix system with a proprietary graphic interface.

Well, Windows is better than DOS, but not by much. Compared to the thousands of Unix commands, the improvement in Windows is barely significant. Lots of the more sophisticated sequence tools  don't have graphic interfaces, and need to be run from the command line. With GDE, it has been possible to automate the process of running  text-based/command line programs from a graphic interface, solely because the underlying commands were available in Unix to do so.  I use a GUI for most things, but when I want a robust command line, Unix has what I need.

An interesting exception that proves the rule is the Transproteomic Pipeline found at http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP. This pipeline was implemented on a Windows system, but the way they chose to do it was using CygWin (http://www.cygwin.org/cygwin). Cytwin which provides most of the common Unix commands under Windows, and makes it possible to run Unix programs that have been re-compiled for the Windows platform. The point is that even on a Windows system, the developers realized that they needed Unix tools to create a good data pipeline.

Unix desktops are easy to use - For biologists with little formal training in computers, it is important to shorten as many of the learning curves as we can. Fortunately, the evolution of Unix desktops has now made them as easy to use as those on Macintosh of Windows. A typical Unix or Linux system has a full suite of office productivity tools, drawing and graphics, web browsers, multimedia and other desktop applications.

Ease of system administration ultimately benefits the user - It's a simple as this: the easier things are for the system administrator, the more likely it is that the end user will have a clean system on which everything works.

Security and data integrity are important - Bioinformatics began gaining noteriety with the genomics era, in which the small lab basic research paradigm gave way to the big collaboration targeted research approach. The latter implies fierce competition, as well as the need to protect potential intellectual property.

Security has always been intrinsic to Unix, whereas Microsoft has only grudgingy accepted the need for security, which has led to the expression "usability trumps security."....

It is odd that we also seem to accept as normal that all of our data is stored on a single hard drive, or perhaps memory stick. We solve

Cost - Funding specifically targeted at bioinformatics funding at this institution has so far been non-existent. Bioinformatics seems to be an area that falls between the cracks, when it comes to funding. Because it is interdisciplinary, it tends to be perceived as "someone else's responsibility". There are no funding agencies in Canada that specifically have dedicated funding lines for bioinformatics, although these do exist in the US.

In general, the cost of doing bioinformatics is cheaper on Unix than on Windows or Macintosh....

Unix is the Green choice -   This is not specifically relevant to bioinformatics, but it's timely, and relevant to the fact that the mainstream public has finally realized that minimizing each person's environmental footprint is important. We're biologists. We should be setting a good example.

The Windows/Intel partnership has the strategy of maximizing profits by designing a rapid obsolescence cycle into product design.  That is, Microsoft pushes the envelope of machine performance by loading the OS and other software with features that expand to fill the available machine resources. Intel designs chips to keep up with software bloat. The net result is that the world thinks that it's normal to replace computers every 3 of 4 years.

Because Unix has always been designed as a multiuser, multitasking system, it deliberately optimizes the utilization of hardware resources. Consequently, a Unix can continue to be updated to the latest version for twice as long as Windows on identical hardware, cutting in half the amount of computer pollution that goes into landfills. A further reduction in computer waste can be realized by re-cycling Windows machines that can no longer run current Windows releases and installing Linux on them, rather than buying new computers. Finally, the fact that Unix lends itself to the use of thin clients encourages the further savings, both in terms of energy, cooling, and computer waste that is intrinsic to thin clients.



6. Installed base of software

It is still true that more molecular biology software is developed on Unix than any other platform, largely because for serious computing, and in particular, for serious programming, development is easier. The same is probably  true in many other areas of science.  A funny irony, when you consider the fact that the main reason Windows has such a strong monopoly is because of its market  share of office desktop software.

Much of the software being developed in bioinformatics today uses platform-independent languages such as Perl, Python or Java. The problem is that none of these is present as a standard part of Windows systems. They must be installed and configured on each Windows machine.