Why We use Unix and Linux in Bioinformatics
This page under construction!
Having worked in bioinformatics for almost 3 decades, and used most of
the major desktop and multiuser platform available during that time
(IBM 370, Apple ][, VAX/VMS, Macintosh, DOS, Windows, Unix, Next,
Linux), I can speak from experience with regards to why Unix is the
preferred platform for both teaching and research in bioinformatics.
These points are probably quite relevant in most areas of science.
Although a lot of reasons for Unix as the preferred system for
bioinformatics are discussed, a recurrrent theme that is seen
throughout can be stated as follows:
We
are faced with the task of giving the biologist easy access to
dozens or hundreds of programs that must work the same for all users on
all machines on the system.
The
workbench vs. the Swiss-Army knife - One of the most fundamental
concepts in bioinformatics is that many of the tasks you need to do are
best done by putting together several tools in series, each of which
performs part of the task. This is called "pipelining", and it applies
to very simple tasks, as well as tremendously complex tasks such as
assembly and annotation of complete genomes. Pipelining is illustrated
in the accompanying figure. Just as an enzyme takes a substrate and
generates a product, a program takes input and pruduces output. For
example, one program might extract an mRNA sequence from a larger
genomic sequence. A second program would translate the mRNA into
protein, the third might predict a secondary structure of the protein.
A fundamental design concept in Unix has always been to have a
'workbench' of small tools, each of which only does one thing but does
it very well. Windows has always followed the "Swiss Army Knife"
philosophy of continuing to add more and more functionality to user
applications that get bigger and more cumbersome as time goes on.
Eventually, usability begins to suffer as illustrated, for example, by
the proliferation of autocorrect functions that are often more
frustrating than they are useful. And, to continue with the analogy,
the tools on a Swiss Army Knife, say, a screwdriver, are seldom as easy
to use as the workbench variety. This is why Canadian Tire sells more
than just Swiss Army Knives.
Bioinformatics is highly-dependent on the existence of a good workbench
of tools, rather than one or two Swiss Army Knives. In a multiuser
system, we are faced with the problem of ensuring that every one of
potentially hundreds of programs works the same way for all users on
all machines. This is especially a problem because the system
administrator will seldom have enough knowledge of biology to be able
to test each program that he or she installs.
Let's say that we wanted to implement a bioinformatics workbench of 100
programs on either Unix or Windows.
Unix (as implemented on the
BIRCH system)
A BIRCH system can be run from a standard user account requiring no
special system administration privileges. That means that a biologist
with some knowledge of Unix can do all installation, updating, and
problem solving without waiting for the system administrator to get
around to it.
All that is required to install each program is to copy it to a
world-readable 'bin' (binary) directory. Installing 100 programs can be
as easy as copying 100 programs to the same bin directory. In principle
this could be done with a single Unix command.
Windows (as implemented on
the ACN open area machines)
There is no mechanism for individual users on a multiuser system to
install a program such that it could be used by everyone on the system.
That means that the system administrator must do all installation,
updating, and fix all problems. Installing 100 programs means 100 times
the work of installing one.
Each program must go through the install process, update the registry,
and then be tested on individual machnes.
Network-centric
computing vs. the standalone PC -
The concept that a computer is used by a single user, who has all
his/her software and files on
one machine, is inextricably entrenched in Windows. Even though Windows
now allows a user to have a home directory of sorts, no software that I
have seen defaults to it. Windows software always wants to write files
to a directory, usually named 'Application Data' deep within the
'Programs' hierarchy. If multiple users use a machine, every time you
open a file you have to do an awful lot of clicking to get to a
directory you own.Unless you could afford to put each specialized
molecular bio. software package onto each person's PC, you are stuck
with the task of having data from numerous users mingling in the same
directory.
File
organization - Just about any task in bioinformatics will
generate large numbers of files. For example, one might run an
analysis which generates 3 output files on 25 samples, creating 75
output files per run. The need to organize these files is
therefore paramount.
I run a multiuser sequence analysis resource called BIRCH, for
'Biological Research Computer Hierarchy'. (see
http://home.cc.umanitoba.ca/~psgendb). Notice the word hierarchy. In a
well organized Unix system, the entire system throughout the campus or
corporation behaves as a single hierarchy. To make programs and
databases available to our users, I put them into a world-readable
directory structure. To access the programs, the first time user has to
run a single setup script, that adds lines to his/her .login and .cshrc
files that tell the shell to look to the BIRCH global cshrc and login
scripts for setup commands. That way, even when the configuration has
to be changed, (eg. new environment variables, telling programs where
to find files), the changes take effect the next time the user logs in,
without them having to do a thing. I have been managing BIRCH as a
multiuser system since 1991, and we now have over 140 users campuswide.
In all that time, I have never once had to login to user accounts, one
by one, to make some change take effect.
Windows still has drives (C:, H:, S: etc) that can be redefined by each
user on their own PC, unless every PC is individually configured by
somebody with enough time to go to each machine and configure it.
There are still environment variables as DOS had. When one considers
how many different configurations of PCs
there are on a campus, the idea of trying to implement a BIRCH-like
system on Windows makes me queasy. I don't know that it's impossible,
but I wouldn't want to make the attempt.
The concept of a
HOME directory - A HOME directory is important for several
reasons. It is a default location for all of a given user's files. It
helps to contain all of that user's files in a single place. Most
importantly, it gives the system a standardized location in which to
find configuration and preference files. The most notable example would
be bookmarks for your web browser.
In Unix, all programs utilize the HOME directory concept. This means
that on a system in which many machines mount the same HOME directory
(eg. all ACN Unix machines) the user's workspace and programs retain
their preferences and customizations no matter which machine you log
into. Unix also has as the concept of the current working
directory. Start a program in any directory it will read and
write files,
by default, within that directory. Together, these two concepts make it
easier to keep your
directories organized by topic, rather than by program. For example
/home/plants/frist
courses
agbiotech
bioinformatics
cyto
research
grants
nserc05
nserc08
papers
birch
stii
yw1
In Windows, there is no standard home directory for the user. Although
an H: drive exists on most multiuser Windows systems, it is almost
never used by programs. Almost none of the programs in the START menu
of PCs in Agriculture 237 saved Preferences or bookmarks from session
to session. The one exception was Firefox. That means that users of
open-area Windows machines on this campus cannot do something as
fundamental as setting bookmarks or Preferences for the programs they
use. This is even true of Microsoft Office. In this way, Windows
undermines the ability for a user to have a seamlessly identical
session from one machine to the next.
Windows does have the capability to use a roaming user-profile, but it
does not appear to be implemented at the Univ. of Manitoba. The reason
may be that the roaming user
profile must be downloaded to each machine, each time a user logs
in, and synced with the server copy each time a change is made, which
slows down performance. (In Unix this is easy because all settings are
saved in the HOME directory. This would be equivalent to saving
settings on the H: drive.)
Beyond "one
window owns the screen" model - I guess commercial software
vendors want to show how important their
programs are by making them default to taking up the entire screen.
Yes, you can click at the top of a window to make it take up only part
of the screen, but every time you start another task, you have to keep
doing that. This kind of behavior prevents people from learning how to
work with multiple windows (eg. a sequence generating several windows,
a web browser in another, a database in another window). Also, most
windows apps are written with the intention that they should take up
the entire screen, so they use a lot screen real estate, especially at
the top of the window. Many Windows apps look pretty ugly if you try to
make them take up less than a full screen. In contrast, apps written
for X-windows tend to economize on screen real estate. (xv and acedb
are champs, here).
Windows claims to have preemptive multitasking. However, there are
still lots of situations in which a program is waiting to do something,
or the program hangs, and the entire screen freezes up. You literally
can't do anything else at this point, except, of course, to reboot. In
a research environment, you have to be able to do a lot of tasks at the
same time. With Windows, it seems that the more simultaneous tasks are
going on, the higher the liklihood that one of them will freeze the
system. While I have, a few times in my life seen an X11 application
freeze up the screen, requiring me to log in from another terminal to
cancel the job (which I don't think is possible in Windows), these
occurrences have been rare.
The need for a
command line - There are many kinds of tasks in bioinformatics
that are much more difficult, or perhaps impossible to do by pointing
and clicking, as compared to using a command line.
In this way
The best evidence for the perceived value of a command line is seen in
the fact that when Windows NT was created, the small number of DOS
commands (originally patterened after Unix commands) were augmented
with hundreds of new commands. At the same time, the complete re-write
of the Macintosh platform, from OS9 to OSX, was accomplished by
reimplementing the Macintosh GUI on top of BSD Unix. Thus,
Macintosh is essentially a Unix system with a proprietary graphic
interface.
Well, Windows is better than DOS, but not by much. Compared to the
thousands of Unix commands, the improvement in Windows is barely
significant. Lots of the more sophisticated sequence tools don't
have
graphic interfaces, and need to be run from the command line. With GDE,
it has been possible to automate the process of running
text-based/command line programs from a graphic interface, solely
because the underlying commands were available in Unix to do so.
I use
a GUI for most things, but when I want a robust command line, Unix has
what I need.
An interesting exception that proves the rule is the Transproteomic
Pipeline found at http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP.
This pipeline was implemented on a Windows system, but the way they
chose to do it was using CygWin (http://www.cygwin.org/cygwin).
Cytwin which provides most of the common Unix commands under Windows,
and makes it possible to run Unix programs that have been re-compiled
for the Windows platform. The point is that even on a Windows system,
the developers realized that they needed Unix tools to create a good
data pipeline.
Unix desktops are
easy to use - For biologists with little formal training in
computers, it is important to shorten as many of the learning curves as
we can. Fortunately, the evolution of Unix desktops has now made them
as easy to use as those on Macintosh of Windows. A typical Unix or
Linux system has a full suite of office productivity tools, drawing and
graphics, web browsers, multimedia and other desktop applications.
Ease of system
administration ultimately benefits the user - It's
a simple as this: the easier things are for the system administrator,
the more likely it is that the end user will have a clean system on
which everything works.
Security and
data integrity are important - Bioinformatics began
gaining noteriety with the genomics era, in which the small lab basic
research paradigm gave way to the big collaboration targeted research
approach. The latter implies fierce competition, as well as the need to
protect potential intellectual property.
Security has always been intrinsic to Unix, whereas Microsoft has only
grudgingy accepted the need for security, which has led to the
expression "usability trumps security."....
It is odd that we also seem to accept as normal that all of our data is
stored on a single hard drive, or perhaps memory stick. We solve
Cost -
Funding specifically targeted at bioinformatics funding at this
institution has so far been non-existent. Bioinformatics seems to be an
area that falls between the cracks, when it comes to funding. Because
it is interdisciplinary, it tends to be perceived as "someone else's
responsibility". There are no funding agencies in Canada that
specifically have dedicated funding lines for bioinformatics, although
these do exist in the US.
In general, the cost of doing bioinformatics is cheaper on Unix than on
Windows or Macintosh....
Unix is the Green
choice - This is not specifically relevant to
bioinformatics, but it's timely, and relevant to the fact that the
mainstream public has finally realized that minimizing each person's
environmental footprint is important. We're biologists. We should be
setting a good example.
The Windows/Intel partnership has the strategy of maximizing profits by
designing a rapid obsolescence cycle into product design. That
is,
Microsoft pushes the envelope of machine performance by loading the OS
and other software with features that expand to fill the available
machine resources. Intel designs chips to keep up with software bloat.
The net result is that the world thinks that it's normal to replace
computers every 3 of 4 years.
Because Unix has always been designed as a multiuser, multitasking
system, it deliberately optimizes the utilization of hardware
resources. Consequently, a Unix can continue to be updated to the
latest version for twice as long as Windows on identical hardware,
cutting in half the amount of computer pollution that goes into
landfills. A further reduction in computer waste can be realized by
re-cycling Windows machines that can no longer run current Windows
releases and installing Linux on them, rather than buying new
computers. Finally, the fact that Unix lends itself to the use of thin
clients encourages the further savings, both in terms of energy,
cooling, and computer waste that is intrinsic to thin clients.
6. Installed base of software
It is still true that more molecular biology software is developed on
Unix than any other platform, largely because for serious computing,
and in particular, for serious programming, development is easier. The
same is probably true in many other areas of science. A
funny irony, when you consider the fact that the main reason Windows
has such a strong monopoly is because of its market share of
office desktop software.
Much of the software being developed in bioinformatics today uses
platform-independent languages such as Perl, Python or Java. The
problem is that none of these is present as a standard part of Windows
systems. They must be installed and configured on each Windows machine.