Sunteți pe pagina 1din 5

6/13/2016

A comparison of bioinformatics programming languages |

A comparison of bioinformatics programming languages


Posted on 20 November, 2012 by Mark Christie

The times are a-changinand


most molecular ecologists and
evolutionary biologists are no
longer asking themselves,
Should I learn a programming
language?, but rather Which
programming language should I
learn?. There are a variety of
programming languages that are
used by the bioinformatics
community, and the number of
bioinformatics-compatible
computer languages available is
on the rise. As such, it can be a
little daunting to decide which
programming languages to
master. From my perusing of
various online forums, many

If you program enough, it can change the way you look at the world

professional programmers will


insist that you should pick a
programming language that works best for each particular purpose. I somewhat agree with that sentiment, but
how many languages can you realistically expect to learn? Furthermore, it is often more efficient to be an expert
in a handful of languages than to be an intermediate-level programmer in a greater number of languages. On the
flip-side, being dogmatically attached to a single language can be detrimental to productivity. From a statistical
and quantitative point of view, I prefer R because it is open source. I also like Linux as both a glue to bind
analyses and for quick data management tasks. But what language should you use for all those other
bioinformatics-type tasks that you need to accomplish (e.g., filtering reads, mapping reads, parsing BLAST files,
identifying SNPs)?
A paper by Fourment and Gillings provides a nice comparison of languages commonly used in bioinformatics. In
this paper, the programming languages are divided into scripting languages (Perl and Python), semi-compiled
languages (Java and C#), and fully compiled languages (C and C++). Perl and Python programs are (typically)
compiled each time before they run and they are often not compiled to the same extent as C and C++ (but see
PyPy for Python). This means that C and C++ typically run faster and require less memory after a program has
been completed. Like most things in life, however, there is a tradeoff in that C and C++ programs usually require
more lines of code because there are more details that have to be specified in each program. Thus there is a
tradeoff between time spent developing, writing, and debugging code and the time that the program takes to run
through completion. This tradeoff is nicely illustrated in Figures 1 and 5 from the paper.
I would wager that there are a number ofPerl gurus that could substantially reduce the number of lines of code in
the Perl program depicted in the Figure above. However, the authors of the paper understandably wanted the
programs to be readable and easily documentable. This is, in fact, a common complaint with Perl: it can be
unreadable, and a nightmare for anyone but the original programmer to comprehend. Below are four lines of Perl
http://www.molecularecologist.com/2012/11/a-comparison-of-bioinformatics-programming-languages/

1/5

6/13/2016

A comparison of bioinformatics programming languages |

that have been purposefully


obfuscated, but which illustrate
the need to program carefully in
Perl.

Figures 1 and 5 from Fourment and Gillings, which illustrate the tradeo between lines of code written and the
speed at which a global alignment program runs to completion. Notice that the compiled and semi-compiled
languages run much faster, but can take more lines of code to write. The semi-compiled languages (C# and Java),
do not necessarily take more lines (though see notes on Perl below).

@P=split//,.URRUU\c8R;@d=split//,\nrekcah xinU / lreP rehtona tsuJ;sub p{ @p{r$p,u$p}=


(P,P);piper$p,u$p;++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$})&6]$p{$}=/ ^$P/ix?
$P:close$}keys%p}pppppmap{$p{$}=~/^[P.]/&& close$}%pwaituntil$?map{/^r/&&
<$>}%p;$_=$d[$q];sleep rand(2)if/\S/;print
These 4 lines actually represent a fairly sophisticated program, but are difficult to decipher. Many Perl-users
defend this concern by rightly claiming that it is up to the programmer to provide clear, concise code and
appropriate comments and documentation. I have spent the last two years learning and working with Perl and
when I first started I was guilty of creating strange-looking code. If I went back to a program after a few months,
it could take me quite a long time to figure out what I had written. When working with Perl now, I comment on
almost every line and write detailed comments before the script. For some reason, this process of reflection helps
me write cleaner, more concise code.
Programming languages are also simply a matter of personal taste. Personally, I dont like or dislike
programming in Perl I am somewhat ambivalent about it. However, I really love programming in R. I love the
structure, the syntax, the clever-but-simple ways to optimize code etc. Recently, I have begun using Python and
have found it to be similar, in many respects, to programming in R. I find that I am now using Python more often
than Perl for the simple reason that I find it to be a more enjoyable experience. Unfortunately, the only way to
figure this out is to spend time working with both languages. Perl and Python both have bioinformatics resources
for ready use so that you dont have to reinvent the wheel: Biopython and Bioperl.
And while I am on the subject of reinventing the wheel despite what everyone will tell you it can be a good
thing to occasionally reinvent the wheel when it comes to becoming proficient with a programming language.
Obviously, once you have become an advanced programmer it is a waste of time to recreate well-designed code,
but you are only going to become an expert by starting with simple programs and building up from there. Why
not create a script to filter your Illumina reads? Sure, there are hundreds of them out there but you may not
understand how to create more sophisticated scripts until you give it a shot yourself.
P.S.I have not used C or Java enough to comment on them. If you have used these (or different) programming
languages please add your experiences with these languages to the comments. Do you enjoy using them? Why
did you pick them, etc.?
References:
http://www.molecularecologist.com/2012/11/a-comparison-of-bioinformatics-programming-languages/

2/5

6/13/2016

A comparison of bioinformatics programming languages |

Fourment, M. and Gillings, M.R. 2008. A comparison of common programming languages used in bioinformatics.
BMC Bioinformatics 9: 82.
See also:
Dudley, J.T. and Butte A.J. 2009. A quick guide for developing effective bioinformatics programming skills. PLoS
Computational Biology. 5:12
Like

Tweet

17

Share

Share and Enjoy

About Mark Christie


Mark Christie is an assistant professor in the Department of Biological Sciences and Department of Forestry & Natural
Resources at Purdue University.
View all posts by Mark Christie

This entry was posted in bioinformatics, next generation sequencing, software. Bookmark the permalink.

10 Comments

The Molecular Ecologist

Share

Recommend

Login

Sort by Oldest

Join the discussion


Sean Hoban

4 years ago

I use Java, and highly recommend it. There are a variety of approaches to creating loops and conditional statements, it is relatively
straightforward to read and write files, and very easy to create and manage arrays. Another important advantage is there are a huge
number of examples and code snippets scattered across the internet for beginners and advanced users alike. Programming for either
command line executables or GUI interfaces is also easy and intuitive. I think Java has also been designed to be very good at
preventing run-time errors. I tried picking up C and Java simultaneously and found Java much easier than C to lay out a program. I
use R sometimes, but I mostly use Java because its easier to interact with files and write much more complex code. Well, those are
my thoughts. I do hope to get into Python sometime soon.

Reply Share

Tim Vines

4 years ago

Does Mathematica count as a language or a program? The lab I did my PhD in was all M'ca, all the time. As a low level user it was
much shallower learning curve than R, especially when it came to manipulating lists (R is a complete jerk with lists).
2

Reply Share

Mark Christie

4 years ago

I still haven't completely mastered the power of lists in R. My understanding is that they hold many different data types (data.frames,
vectors, matrices etc) and so would be useful in large projects. Any avid R list users care to comment?


unionx

Reply Share

3 years ago

Java is good, but I don't recommend it for bioinformatics tasks. JVM takes a long time to start, and numerical computation in Java is
not as good as in Python or R.
3

Reply Share

Mark Christie > unionx

3 years ago

Although I do not use Java myself, one interesting thing I have noticed is that the people running it on our cluster are almost
(1) always running it in parallel and (2) are using very little computational resources. My guess would be that initial
development in R or Python would be a good idea, but moving it over to Java or C might be a good idea when you start scaling
up your applications.
1

Reply Share

unionx > Mark Christie

3 years ago

I know some bioinfo guys who just write batch scripts to do some calculation. I am not sure whether they need to build
online service. Yes, Java is very good for online service, and I use Clojure for that.

http://www.molecularecologist.com/2012/11/a-comparison-of-bioinformatics-programming-languages/

3/5

6/13/2016

A comparison of bioinformatics programming languages |

online service. Yes, Java is very good for online service, and I use Clojure for that.


Jon Puritz

Reply Share

3 years ago

I still rely on others to actually write the heavy duty analysis code, but I find bash incredibly easy and useful for analysis pipelines. I
highly recommend that every bioinformatician be familiar with what bash and baseline unix commands can do for data
manipulation.

Reply Share

Eric Thomas

3 years ago

I used to wave the Java and C++ flags high but after solid libraries like biopython and scipy its hard to justify the time you would
need to replicate a lot of this in Java. Python is just quick and can handle most things you need. The in house GUI (tkinter) doesn't
have as much going it as java but it usually more then fills the needs of a basic program. After doing this for about a year in a half, I
have all but fully converted to python.


Matt

Reply Share

2 years ago

For use once research code that filters/formats data anyone using C or Java doesn't value their own time and likely just doesn't know
any other languages. I know all of the mentioned languages in this article with the exception of C# which I have only played with
because portability matters to me. Each has its place but for just day to day data munging only Python and Perl are viable options
with Perl genuinely nicer in syntax for shell script activities (e.g. no significant white space in if statements). For stats and plots R or
Python with stats models and matplotlib are both great. Personally I try to stick more with Python because its just a lot less clunky in
syntax IMHO and more useful to know if you ever decide to leave science. If you wish to write a large scale application others will
contribute to that is highly algorithmic rather than on data processing Java or Python are equally viable. If you want something that's
solving some serious problems maybe in large combinatorial space you want to be using C with MPI or OpenMP at a minimum, if
you didnt already know this then you aren't solving really serious combinatorial problems! Most of us arent, unless you are dealing
with short read assembly or phylogenetic tree search. The most valuable thing is your own time, line count isn't a great measure of
productivity. Java you copy paste the same 50-100 lines every time so once you have some of your own libs written its not too bad.
Ultimately for really simple stuff Perl cannot be beaten since you can inline whilst in the shell. perl -lane 'print $F[2]*$F[4]' <
input.tsv for this sort of task: stripping the third and fifth columns of a table and multiplying them (or running any function could be
seq comparisson) Perl mastery cannot be beaten. Converting a history of one liners like this into a full script doesn't take much effort
at a later date too. The real issue is I haven't met a single person in bioinfo who wasn't comp sci trained get beyond 'intermediate' in
any single language. The OP suggests you need a lot of time to learn many languages. That isn't true, after deep understanding of two
languages with varied syntax it gets very easy to learn. If you know only one language (other than perhaps C) its not possible to be a
true master since you lack understanding of how things might be working underneath the high level syntax. Like Java if you don't
see more

Reply Share

Edward Kirton

2 years ago

I've been working in bioinformatics for over a decade and have used a dozen languages over the years, including the ones discussed
above. The bread-and-butter coding of bioinformaticians is writing scripts which wrap powerful third-party programs and
manipulate files, often to create a pipeline (usually on the cluster). For this, the best are perl, python, bash, maybe c#. Each has
pros/cons. Start with one of these and learn good coding practices (e.g. use of repositories like Git, good documentation habits, testdriven development, agile project management, etc.). Which to learn? I recommend you use whichever one you can get good
coaching on. Do you have someone at work whom is willing to answer questions, do code reviews, and do paired-programming with
you?

Subscribe

Reply Share

Add Disqus to your site Add Disqus Add

Privacy

ProudlypoweredbyWordPress.

http://www.molecularecologist.com/2012/11/a-comparison-of-bioinformatics-programming-languages/

4/5

6/13/2016

A comparison of bioinformatics programming languages |

http://www.molecularecologist.com/2012/11/a-comparison-of-bioinformatics-programming-languages/

5/5

S-ar putea să vă placă și