[gt-users] New feature announcement
Gordon Gremme
gremme at gmail.com
Tue Jan 27 03:13:56 CET 2009
> i'm still foggy at best on understanding the Stream or vistor stuff.
Let me try to shed some light on the subject ;-)
Parts of the following explanation are taken from another email,
therefore it also explains things you probably know already.
GenomeTools uses the GenomeNode interface and most importantly its
FeatureNode implementation to represent all kinds of genome annotations.
A FeatureNode is basically a directed acyclic graph (DAG) whereas each
node represents an
annotated genomic region (e.g. exon from position 20 to 30) and each
vertex represents a part-of relationship (exon is part-of gene). The
nice thing about this data structure is its versatility. You can
create it automatically from GFF3 files (via a GFF3InStream) or
manually in a script (as is done in some of the AnnotationSketch
examples). Once you got all your GenomeNodes you can easily store them
in a GFF3 file with the GFF3OutStream.
The other implementation (RegionNode, SequenceNode, CommentNode) are
mostly used to represent other parts of GFF3 files (sequence-region
lines, embedded Fasta files, and comment lines) and are usually not
used if one constructs annotations manually (FeatureNodes suffice for
this).
To process annotations, e.g. for retrieval of all exons which are
below a certain length, two basic approaches exist: Sequentially via
NodeStreams or randomly via a FeatureIndex.
Sequentially via NodeStreams is the approach most GFF3-related tools
contained in GenomeTools take. They implement the NodeStream interface
which easily allows to plug modules together which transform the
FeatureNodes on the C code level and on the shell level.
Example: Our FilterStream allows to filter FeatureNodes according to
different criteria.
On the C code level we create three streams and plug them together:
The GFF3InStream reads from a set of GFF3 files and returns
GenomeNodes, the FilterStream takes any NodeStream and filters the
nodes in accordance with its settings, and the GFF3OutStreams takes any
NodeStream and writes its content as GFF3 output.
At the end we pull nodes through the GFF3OutStream which in turn asks
his predecessor (who ask his predecessor and so forth) for new
GenomeNodes until they are exhausted.
gff3_in_stream = gt_gff3_in_stream_new_unsorted(argc - parsed_args,
argv + parsed_args);
/* create a filter stream */
filter_stream = gt_filter_stream_new(gff3_in_stream, arguments->seqid, ...);
/* create a gff3 output stream */
gff3_out_stream = gt_gff3_out_stream_new(arguments->targetbest
? targetbest_filter_stream
: filter_stream,
arguments->outfp);
/* pull the features through the stream and free them afterwards */
while (!(had_err = gt_node_stream_next(gff3_out_stream, &gn, err)) &&
gn) {
gt_genome_node_delete(gn);
}
This approach is very memory efficient, because you do not have to
read the sometimes rather large annotation files all at once. But,
since they are sequential, you have to read the whole file every time
you process it.
The sequential approach allows to combine tools on the shell level
easily. Example:
gth -gff3out -skipalignmentout ... | gt gff3 -sort - | gt filter
-seqid chr21 -overlap 1200 2000 | gt sketch test.png -
If you need random access or multiple queries to the same annotation
set (as we had the need in the context of AnnotationSketch where we
have multiple range queries), the FeatureIndex interface is probably
the place to start. Our current implementation stores all the features
in main memory and allows only simple queries, but more sophisticated
indexing and query strategies could be implemented in another
FeatureIndex implementation.
So now we covered the GenomeNode interface (to represent annotations),
the NodeStream interface (to process annotations sequentially), and
the FeatureIndex interface (to process annotations with random
access).
The missing NodeVisitor serves mainly software engineering purposes.
It allows to process different implementations of the GenomeNode
interface without excessive downcasting. I can't describe visitors
better than it was done in the Design Patterns book
(http://en.wikipedia.org/wiki/Design_Patterns). See also
http://en.wikipedia.org/wiki/Visitor_pattern, but I find the
explanation in the book better.
I hope this makes things clearer. If not, please ask!
Gordon
More information about the gt-users
mailing list