Note: I should have written this ages ago, but only now has the cold weather caught up with my nocturnal habits.
As part of the first keystone summer school we organised a summer school over four days in July 2015 at the University of Malta. The summer school was titled Keyword Search over Big Data and I was asked to help with organising the one-and-a-half-day big data hackathon. With my bioinformatics hat on, it was easy to fish for “big data” for the event.
The hackathon consisted of six questions related to search in Big Data. The questions were set in the molecular biology/bioinformatics domain, and consisted of searches of specific genes and DNA/protein sequences across the human genome. The human genome (~3.2GB) and an annotation text file (~1.2GB) were supplied to the students (these are supplied below). As the designated “expert in the field” I gave the students a one hour introductory lecture in the area (I love the looks of computer scientists when shown the wonderful/”exceptional” world of molecular biology).
The files required for the data hackathon are (if you use these tasks, please make sure to acknowledge properly):
The protein sequence file (required for task 6)
We asked participants to form groups of at least two and at most three participants. In total, there were nine groups who participated in the data hackathon. The participants were free to choose operating system, programming language and techniques to implement their solutions. This flexibility was possible due to availability (and sponsorship) of Microsoft’s Azure platform. This also enabled students to try distributed solutions.
At the end of the hackathon, the participants were asked to demonstrate their solutions to the five judges and deliver a lightning five minute presentation in front of the other teams describing the techniques they employed for their solutions. The students were given a testing suite which they were judged upon.
A number of interesting techniques were used to solve the problems. These included breaking down the genome file into a number of smaller files to speed up the search and using compression techniques to reduce the size of the genome data file. Some students also tried using lucene for indexing and searching.
Awards were assigned to the best performing 3 teams (1st placing team was awarded 100 Euros to each participant; 2nd team 70 Euros each and 3rd team 40 Euros each). The prizes were sponsored by one of our local industrial partners, Altaro Software.