The AnnotationSketch annotation drawing library

The AnnotationSketch module is a versatile and efficient C-based drawing library for GFF3-compatible genomic annotations. It is included in the GenomeTools distribution. Additionally, bindings to the Lua, Python and Ruby programming languages are provided.

Contents

Overview

AnnotationSketch consists of several classes (see Fig. 1), which take part in three visualization phases.

[Dataflow]

Figure 1: Schematic of the AnnotationSketch classes involved in image creation. Interfaces are printed in italics.

Phase 1: Feature selection

The GFF3 input data are parsed into a directed acyclic graph (annotation graph, see Fig. 2 for an example) whose nodes correspond to single features (i.e. lines from the GFF3 file). Consequently, edges in the graph represent the part-of relationships between groups of genomic features according to the Sequence Ontology hierarchy. Note that GFF3 input files must be valid according to the GFF3 specification to ensure that they can be read for AnnotationSketch drawing or any other kind of manipulation using GenomeTools. A validating GFF3 parser is available in GenomeTools (and can be run using gt gff3validator).

[GFF3 tree]

Figure 2: Example sequence region containing two genes in an annotation graph depicting the part-of relationships between their components.

Each top-level node (which is a node without a parent) is then registered into a persistent FeatureIndex object. The FeatureIndex holds a collection of the top-level nodes of all features in each sequence region in an interval tree data structure that can be efficiently queried for features in a genomic region of interest. All child nodes of the top-level node are then available by the use of traversal functions.

Alternatively, annotation graphs can be built by the user by creating each node explicitly and then connecting the nodes in a way such that the relationships are reflected in the graph structure (see example section for example code).

Phase 2: Layout

The next step consists of processing the features (given via a FeatureIndex or a simple array of top level nodes) into a structural Diagram object which represents a single view of the annotations of a genomic region. First, semantic units are formed from the annotation subgraphs. This is done by building blocks out of connected features by grouping and overlaying them according to several user-defined collapsing options (see Collapsing). By default, a separate track is then created for each Sequence Ontology feature type. Alternatively, if more granularity in track assignment is desired, track selector functions can be used to create tracks and assign blocks to them based on arbitrary feature characteristics. This is simply done by creating a unique identifier string per track. The Diagram object can also be used to hold one or more custom tracks, which allow users to develop their own graphical representations as plugins. The Diagram is then prepared for image output by calculating a compact Layout in which the Block objects in a track are distributed into Line objects, each containing non-overlapping blocks (see Fig. 3). The overall layout calculated this way tries to keep lines as compact as possible, minimising the amount of vertical space used. How new Lines are created depends on the chosen implementation of the LineBreaker interface, by default a Block is pushed into a new Line when either the Block or its caption overlaps with another one.

[Diagram]

Figure 3: The components of the Layout class reflect sections of the resulting image.

Phase 3: Rendering

In the final phase, the Layout object is used as a blueprint to create an image of a given type and size, considering user-defined options. The rendering process is invoked by calling the sketch() method of a Layout object. All rendering logic is implemented in classes implementing the Canvas interface, whose methods are called during traversal of the Layout members. It encapsulates the state of a drawing and works independently of the chosen rendering back-end. Instead, rendering backend-dependent subclasses of Canvas are closely tied to a specific implementation of the Graphics interface, which provides methods to draw a number of primitives to a drawing surface abstraction. It wraps around the respective low-level graphics engine and allows for its easy extension or replacement.
Currently, there is a Graphics implementation for the Cairo 2D graphics library (GraphicsCairo) and two Canvas subclasses providing access to the image file formats supported by Cairo (CanvasCairoFile) and to arbitrary Cairo contexts (CanvasCairoContext, which directly accesses a cairo_t). This class can be used, for example, to directly draw AnnotationSketch output in any graphical environment which is supported by Cairo (see list of supported surface types).

Collapsing

By default, Lines are grouped by the Sequence Ontology type associated with the top-level elements of their Blocks, resulting in one track per type. To obtain a shorter yet concise output, tracks for parent types in the feature graph can be enabled to contain all the features of their child types. The features with the given type are then drawn on top of their parent features (e.g. all exon and intron features are placed into their parent mRNA or gene track). This process is called collapsing. Collapsing can be enabled by setting the collapse_to_parent option for the respective child type to true, e.g. the following options:

style = {
  exon = {
    ...,
    collapse_to_parent = true,
    ...,
  },
  intron = {
    ...,
    collapse_to_parent = true,
    ...,
  },
  CDS = {
    ...,
    collapse_to_parent = true,
    ...,
  },
}
would lead to all features of the exon, intron and CDS types collapsing into the mRNA track (see Fig. 4 and 5).

Figure 4: Schematic of the relationships between the gene, mRNA, exon, intron and CDS types and the colors of their representations in a diagram. The arrows illustrate how the relationships influence the collapsing process if collapsing is enabled for the exon, intron and CDS types. In this example, they will be drawn on top of their parent mRNA features.

[collapsing]

[collapsed/uncollapsed views]

Figure 5:(click to enlarge) Example image of the cnn and cbs genes from Drosophila melanogaster (Ensembl release 51, positions 9326816--9341000 on chromosome arm 2R) as drawn by AnnotationSketch. At the bottom, the calculated GC content of the respective sequence is drawn via a custom track attached to the diagram. (a) shows a collapsed view in which all exon, intron and CDS types are collapsed into their parent type's track. In contrast, (b) shows the cbs gene with all collapsing options set to false, resulting in each type being drawn in its own track.

Styles

The Lua scripting language is used to provide user-defined settings. Settings can be imported from a script that is executed when loaded, thus eliminating the need for another parser. The Lua configuration data are made accessible to C via the Style class. Configurable options include assignment of display styles to each feature type, spacer and margin sizes, and collapsing parameters.

Instead of giving direct values, callback Lua functions can be used in some options to generate feature-dependent configuration settings at run-time. During layout and/or rendering, the GenomeNode object for the feature to be rendered is passed to the callback function which can then be evaluated and the appropriate type can be returned.

For example, setting the following options in the style file (or via the Lua bindings):

style = {
  ...,
  mRNA = {
    block_caption      = function(gn)
                           rng = gn:get_range()
                           return string.format("%s/%s (%dbp, %d exons)",
                                 gn:get_attribute("Parent"),
                                 gn:get_attribute("ID"),
                                 rng:get_end() - rng:get_start() + 1,
                                 #(gn:get_exons()))
                         end,
    ...
  },

  exon = {
    -- Color definitions
    fill               = function(gn)
                           if gn:get_score() then
                             aval = gn:get_score()*1.0
                           else
                             aval = 0.0
                           end
                           return {red=1.0, green=0.0, blue=0.0, alpha=aval}
                         end,
    ...
  },
  ...
}

will result in a changed rendering (see Fig. 6). The block_caption function overrides the default block naming scheme, allowing to set custom captions to each block depending on feature properties. Color definitions such as the fill setting for a feature's fill color can also be individually styled using callbacks. In this case, the color intensity is shaded by the exon feature's score value (e.g. given in a GFF file).

[Example rendering with callback functions]

Figure 6: Example rendering of a GFF file using callback functions to enable custom block captions and score-dependent shading of exon features.

An overview of the keywords used in a style file is available.

Track assignment

Tracks are normally created for each annotation source and/or Sequence Ontology type encountered in the annotation graph. More control is possible using track selector functions (read more).

Custom tracks

Even more customisability is possible using custom tracks. Custom tracks are classes which contain custom drawing functionality, making it easy to add user-defined graphical content to any AnnotationSketch drawing (read more).

The gt sketch tool

The GenomeTools gt executable provides a new tool which uses the AnnotationSketch library to create a drawing in PNG, PDF, PostScript or SVG format from GFF3 annotations. The annotations can be given by supplying one or more file names as command line arguments:
$ gt sketch output.png annotation.gff3
$
or by receiving GFF3 data via the standard input, here prepared by the gt gff3 tool:
$ gt gff3 -sort -addintrons annotation.gff3 | gt sketch output.png
$
The region to create a diagram for can be specified in detail by using the -seqid, -start and -end parameters. For example, if the D. melanogaster gene annotation is given as a dmel_annotation.gff3, use
$ gt sketch -format pdf -seqid 2R -start 9326816 -end 9332879 output.pdf dmel_annotation.gff3
$
to plot the gene from the FlyBase default view in PDF format. The -force switch is used to force overwriting of an already existing output file. The -pipe option additionally allows passing the GFF3 input through the sketch tool via the standard output, allowing the intermediate visualisation of results in a longer pipeline of connected GFF3 tools. More command line options are available; their documentation can be viewed using the -help option. If an input file is not plotted due to parsing errors, GenomeTools includes a strict GFF3 validator tool to check whether the input file is in valid GFF3 format. Simply run a command like the following:
$ gt gff3validator input_file.gff3
input is valid GFF3
$
This validator also allows one to check the SO types occurring in a GFF3 file against a given OBO ontology file. This checking can be enabled by specifying the file as an argument to the -typecheck option.

If the PDF, SVG and/or PostScript output format options are not available in the gt binary, the most likely cause is that PDF, SVG and/or PostScript support is disabled in your local Cairo headers and thus also not available in your local Cairo library. This issue is not directly related to AnnotationSketch and can be resolved by recompiling the Cairo library with the proper backend support enabled.

Code examples

Besides the native C programming interface, AnnotationSketch is usable from a variety of programming languages. Code examples for the individual languages can be found here.

API Reference

A function reference for the AnnotationSketch classes can be found in the GenomeTools C API reference.