Logspace

Using RJB to manipulate SAM files in Ruby

SAM is a new file format for representing genomic sequence alignments. There are apis available for Java, Perl, and Python. But none for Ruby as far as I know, which is a shame when you want to manipulate SAM files in a rakefile. But – there is a little hackette we can pull. First, install RJB (the Ruby Java Bridge). Then we can do interesting things with the Picard Java API:

This routine will load a bam file (i.e. a binary SAM file), excise the sequences overlapping the regions listed in file “regions” (as “chr start end”), and print them out in fastq format.

require 'rjb'
#Setup JVM:
Rjb::load(classpath = '.:./sam-jdk-1.0.3.jar', jvmargs=[])

bam = "alignment.bam"
bai = bam+".bai"

#Some ugly bridging code:
file = Rjb::import('java.io.File')
samfilereader = Rjb::import('net.sf.samtools.SAMFileReader')

#most BAM files I've seen have errors, so let's be lenient:
samfilereader.setDefaultValidationStringency(
   Rjb::import('net.sf.samtools.SAMFileReader$ValidationStringency').LENIENT
)

#Instantiate the java object:
sam=samfilereader.new_with_sig(
 'Ljava.io.File;Ljava.io.File;',file.new(bam),file.new(bai)
)

#From here on it's plain sailing
File.open("regions").each_line do |line|
        l = line.split
        #we can use ruby ints and strings in java methods:
        overlapping = sam.queryOverlapping(l[0],l[1].to_i,l[2].to_i)

        while (overlapping.hasNext)
                r = overlapping.next
                #pull out what we need from the SAMRecord object:
                puts "@"+r.getReadName 
                puts r.getReadString
                puts "+" 
                puts r.getBaseQualityString
        end
        overlapping.close
end

I was pleasantly suprised that once the SAMFileReader object is created, using the API is quite nice! Hope this is useful to someone using SAM in Ruby.

03 Sep 2009
blog comments powered by Disqus