Paul Joseph Davis

Broken Bioinformatics Formats. JSON FTW.
========================================

As a recently anointed bioinformaticist I have to come out and say it.
Bioinformatics has some fucking shit file formats. As an entire discipline we
need to think about the future and move away from the horrible state in which
data is regularly produced and consumed by software. My vote is for JSON.
Hopefully I can convince a few brave souls to jump on board.

Broken file formats
-------------------

Current formats are broken. If you don't believe me, first go write a GenBank
[1] parser and then read the three specifications of the General Feature Format
(GFF [2]). And those are just the tip of the iceberg. If that hasn't irritated
the crap out of you, write a Blast [3] result parser. I'll even be nice and
suggest the XML output. By now I should have majority convinced. And if not,
well then maybe I should think about a different career before I commit seppuku
[4].

Badly designed formats (Now, not then)
--------------------------------------

Lets face it. Each of these formats was poorly designed. And I say this with all
possible respect, but each of these formats was designed with human consumption
in mind. And that's just wrong. What's that you say? These formats were designed
years ago when they were viewed directly by humans? I don't fucking care. This
is the world of today not 10 or 15 years ago. Scientists don't normally view
results in textual formats any more. They view them in terms of markup or
graphical displays from some desktop application.

Recent pushes towards XML
-------------------------

I've noticed that EBI appears to be doing alot of work on distributing XML [5].
XML sucks. I used to like XML. Now it just irritates the crap out of me. It all
started when I tried researching DTD [6], RelaxNG [7], and XSD [8]. Go read up
on those a bit. They'll make you want to punch something. And its not that
they're intrinsically bad at what they do. Its what they do is intrinsically bad
(for applications in biology, more on that in another post).

Fact is though, XML has failed to take hold. Its overly complicated to deal with
for lots of the small tasks that are currently common for lots of
bioinformaticists. We live in a land of one off scripts for testing random
ideas. Quick development to test approaches etc. Sitting down and writing a
decent parser for each highly repetitive yet slightly different task gets old
quick.

XML is just too heavy for bioinformatics.

Libraries! Use the libraries!
-----------------------------

This is a misnomer. Why not hide the ugliness of the data behind a nice library
interface? I'm a purist and throwing the bathroom mat over the puddle of vomit
doesn't change the fact you had one or ten too many tequila shots. This only
works as long as the libraries exist. And these huge lumbering libraries just
don't keep up with every language. I mean, how many BioErlang or BioLisp (I
should probably google those, but lets just assume they don't exist for now...)
libraries have you evaluated lately?

Think how much better our libraries would be if we didn't have to deal with
these horrible data formats. Think about how using a standardized data format
would allow any tool in any language to communicate without forcing data through
some obtuse outdated format.

JSON - No seriously. JSON
--------------------------

I used to think JSON [9] was a weird little cousin of the other markup
languages. Turns out, it is the weird little cousin of the markup languages but
in a slightly less deformed way. Its a dead simple format that has language
bindings in practically every language. Its a public specification so for those
languages that don't have bindings, writing new ones would be fairly straight
forward.

Seriously. Those of you feeling underwhelmed think a bit harder on it. Go toy
with it in your language of choice. Keep in mind all the brilliant possibilities
that having a common file format would open up. Imagine how easy it'd be to
design large customizable pipelines for shuffling JSON documents around.

Obviously I'm no dreamer. Getting JSON to actually take hold would be incredibly
difficult. Official committees and standards bodies would be involved. But I
think if we poke around at the idea and start with a few people designing JSON
dependent libraries and tools we can spread JSON like a virus through the
bioinformatics world.

References
----------

[1]:  ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
[2]:  http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
[3]:  http://blast.ncbi.nlm.nih.gov/Blast.cgi
[4]:  http://en.wikipedia.org/wiki/Seppuku
[5]:  http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
[6]:  http://en.wikipedia.org/wiki/XML
[7]:  http://en.wikipedia.org/wiki/Document_Type_Definition
[8]:  http://relaxng.org/
[9]:  http://www.w3.org/XML/Schema
[10]:  http://www.json.org/




Copyright Notice
----------------

Copyright 2008-2010 Paul Joseph Davis

License
-------

http://creativecommons.org/licenses/by/3.0/