Default header image

So you want to be a computational biologist?

Two computational biologists give advice when starting out on computational projects.

The term ‘computational biologist’ can encompass several roles, including data analyst, data curator, database developer, statistician, mathematical modeler, bioinformatician, software developer, ontologist—and many more. What’s clear is that computers are now essential components of modern biological research, and scientists are being asked to adopt new skills in computational biology and master new terminology. Whether you’re a student, a professor, or somewhere in between, if you increasingly find that computational analysis is important to your research, follow the advice below and start along the road toward becoming a computational biologist!

Understand your goals and choose appropriate methods

Key to good computational biology is the selection and use of appropriate software. Before you can usefully interpret the output of a piece of software, you must understand what the software is doing. You wouldn’t go into the laboratory and perform a polymerase chain reaction without a basic understanding of the method. Why would you do the same with a computational analysis? Understanding the underlying methods and algorithms gives you the tools to interpret the results. That doesn’t mean you need to read through each line of source code, but you should have a grasp of the concepts.

Software tools are often implementations of a particular algorithm that may be well-suited for particular types of data; for example, in de novo assembly, an Overlap-Layout-Consensus assembler is optimized for longer sequence reads, whereas de Bruijn graphs were designed with short reads in mind. Choosing software employing the most appropriate algorithm will save you a lot of time.

Set traps for your own scripts and other people’s

Laboratory scientists wouldn’t dream of running experiments without the necessary positive and negative controls… tests are the computational biology equivalent.

How do you know your script, software or pipeline is working? Computers will happily output results for the most bizarre of input data, and the absence of an error message is not an indication of success. Create tests, small datasets for which the answer is known, and check that the software or pipeline can reproduce that answer. Try and do that for every ‘type’ of answer you expect to find. Double-check the results of everything to see if those results make sense. Laboratory scientists wouldn’t dream of running experiments without the necessary positive and negative controls, and these tests are the computational biology equivalent.

You’re a scientist, not a programmer

The perfect is the enemy of the good. Remember you are a scientist and the quality of your research is what is important, not how pretty your source code looks. Perfectly written, extensively documented, elegant code that gets the answer wrong is not as useful as a basic script that gets it right. Having said that, once you’re sure your core algorithm works, spend time making it elegant and documenting how to use it. Use your biological knowledge as much as possible—that’s what makes you a computational biologist.

Use version control software

Versioning will help you track changes to your code, maintain multiple versions and to work collaboratively with others. Using a standard tool, such as Git or Subversion, you will also be able to publish your code easily. Be nice to your future self. A few well-placed README files explaining the choices you made and why you made them will be a boon in months or years when you return to a project. Document your code and scripts so that you understand what they do. When you come to publish your work, try publishing the scripts and methods you used to generate your results so that others can reproduce them. Also consider keeping a digital laboratory notebook to document your analyses as you perform them. Repositories, such as Github, are ideal for this and also help you maintain copies of the repository to serve as off-site backups.

More information available on: https://www.nature.com/articles/nbt.2740