SAM is a new file format for representing genomic sequence alignments. There are apis available for Java, Perl, and Python. But none for Ruby as far as I know, which is a shame when you want to manipulate SAM files in a rakefile. But – there is a little hackette we can pull. First, install RJB (the Ruby Java Bridge). Then we can do interesting things with the Picard Java API:
This routine will load a bam file (i.e. a binary SAM file), excise the sequences overlapping the regions listed in file “regions” (as “chr start end”), and print them out in fastq format.
require 'rjb'
#Setup JVM:
Rjb::load(classpath = '.:./sam-jdk-1.0.3.jar', jvmargs=[])
bam = "alignment.bam"
bai = bam+".bai"
#Some ugly bridging code:
file = Rjb::import('java.io.File')
samfilereader = Rjb::import('net.sf.samtools.SAMFileReader')
#most BAM files I've seen have errors, so let's be lenient:
samfilereader.setDefaultValidationStringency(
Rjb::import('net.sf.samtools.SAMFileReader$ValidationStringency').LENIENT
)
#Instantiate the java object:
sam=samfilereader.new_with_sig(
'Ljava.io.File;Ljava.io.File;',file.new(bam),file.new(bai)
)
#From here on it's plain sailing
File.open("regions").each_line do |line|
l = line.split
#we can use ruby ints and strings in java methods:
overlapping = sam.queryOverlapping(l[0],l[1].to_i,l[2].to_i)
while (overlapping.hasNext)
r = overlapping.next
#pull out what we need from the SAMRecord object:
puts "@"+r.getReadName
puts r.getReadString
puts "+"
puts r.getBaseQualityString
end
overlapping.close
end
I was pleasantly suprised that once the SAMFileReader object is created, using the API is quite nice! Hope this is useful to someone using SAM in Ruby.
Db4o is a native java/C# database framework. In a db4o database, you can easily
store and retieve java objects. This all seems rather nice, but when using db4o
with Scala, there is a snag.
Db4o allows easy construction of querys by implementing their Predicate interface. Example, from their tutorial:
List <Pilot> pilots = db.query(new Predicate <Pilot>() {
public boolean match(Pilot pilot) {
return pilot.getPoints() == 100;
}
});
The idea is that, rather than running every object in the database past the
filter .getPoints()==100, db4o parses the bytecode and looks for
objects that match using the database index. However, this doesn’t seem to work for Scala bytecode.
Maybe at some point this will be fixed. But for now, Scala’s syntax comes to the rescue. Db4o converts Predicate implementations into something called a SODA query. The code for constructing a SODA query from scratch in Java, looks like:
Query query=db.query();
query.constrain(Pilot.class);
Query pointQuery=query.descend("points");
query.descend("name").constrain("Rubens Barrichello")
.or(pointQuery.constrain(new Integer(99)).greater()
.and(pointQuery.constrain(new Integer(199)).smaller()));
ObjectSet result=query.execute();
listResult(result);
(Again, this is from the tutorial).
Pretty ugly, huh? However, a Scala version is not too bad:
val query = db query
query constrain classOf[Pilot]
query descend "name" constrain "Rubens Barrichello" or
(query descend "points" constrain 99 greater) and
(query descend "points" constrain 199 smaller)
result = query execute
listResult result
From a few minutes play, it seems that things work like they should – you can use parentheses to surround subqueries and connect with and or or.
For example:
(
(query descend "name" constrain "Rubens Barrichello") or
(query descend "points" constrain 99 greater)
) and
(query descend "points" constrain 199 smaller)
Means (“Reubens Barrichello” OR > 99 points ) AND < 199 points, while:
query descend "name" constrain "Rubens Barrichello" or
(
(query descend "points" constrain 99 greater) and
(query descend "points" constrain 199 smaller)
)
means “Reubens Barrichello” OR (> 99 points AND < 199 points).
This actually isn’t too bad to work with, after all!