The GenomeTools design

Major design goals of the GenomeTools system:

Correctness
- built-in unit tests
- extensive test suite
Portability
- should run on every POSIX compliant UNIX system
- depend only on a C/C++ compiler and the GNU Make build system
Efficiency
Minimalism (if in doubt, use the simpler solution)

The GenomeTools C code is split into components plus the runtime and the main program. There are the following types of components:

classes
modules
unit tests
tools

It is assumed that the reader is familiar with the terminology of object oriented software design.

The runtime

The runtime class gtr.h/gtr.c is the nucleus of the GenomeTools system. An object of this class is the place of execution in GenomeTools, like a process in an operating system. A new runtime object can be created with the gtr_new() function. gtr_register_components() registers all build in components to the runtime, and gtr_run() starts it.

Rationale

Having an explicit runtime class unifies the design — a runtime instance is just another object. In theory, this would allow to start multiple instances of the runtime class in parallel. For example, in threads.

The runtime class is also the place were we start an embedded Lua script language interpreter.

The main program

The GenomeTools main program given in the file gt.c simply creates a new runtime object and starts it.

Classes

The central component type in GenomeTools is the class. Structuring the C code into classes and modules gives us a unified design approach which simplifies the thinking about design issues and avoids that the codebase becomes monolithic, a problem often encountered in C programs.

Simple classes

For most classes, a simple class suffices. A simple class is a class which does not inherit from other classes and from which no other classes inherit. Using mostly simple classes avoids the problems of large class hierarchies, namely the interdependence of classes which inherit from one another. The major advantage of simple classes over simple C structs is information hiding.

Implementing simple classes

We describe now how to implement a simple classes using the string class str.[ch] of GenomeTools as an example. The interface to a class is always given in the .h or _api.h header file (str_api.h in our example). To achieve information hiding the header file cannot contain implementation details of the class. The implementation can always be found in the corresponding .c file (str.c in our example). Therefore, we start with the following C construct to define our Str class in str.h:

typedef struct Str Str;

This seldomly used feature of C introduces a new data type named Str which is a synonym for the struct Str data type, which needs not to be known at this point. In the scope of the header file, the new data type Str cannot be used, since it's size is unknown to the compiler at this point. Nevertheless, pointers of type Str can still be defined, because in C all pointers have the same size, regardless of it's type. Using this fact, we can define a constructor function:

Str*          str_new(void);

which returns a new string object and a destructor function

void          str_free(Str*);

which destroys a given string object. This gives us the basic structure of the string class header file: A new data type (which represents the class and it's objects), a constructor function, and a destructor function.

#ifndef STR_H
#define STR_H

/* the string class, string objects are strings which grow on demand */
typedef struct Str Str;

Str*          str_new(void);
void          str_free(Str*);

#endif

Now we look at the implementation side of the story, which can be found in the str.c file. At first, we include the str.h header file to make sure that the newly defined data type is known:

#include "str.h"

Then we define struct Str which contains the actual data of a string object (the member variables in OO lingo).

struct Str {
  char *cstr;           /* the actual string (always '\0' terminated) */
  unsigned long length; /* currently used length (without trailing '\0') */
  size_t allocated;     /* currently allocated memory */
};

Finally, we code the constructor

Str* str_new(void)
{
  Str *s = xmalloc(sizeof(Str));      /* create new string object */
  s->cstr = xcalloc(1, sizeof(char)); /* init the string with '\0' */
  s->length = 0;                      /* set the initial length */
  s->allocated = 1;                   /* set the initially allocated space */
  return s;                           /* return the new string object */
}

and the destructor

void str_free(Str *s)
{
  if (!s) return;           /* return without action if 's' is NULL */
  free(s->cstr);            /* free the stored the C string */
  free(s);                  /* free the actual string object */
}

Our string class implementation so far looks like this

#include "str.h"
#include "xansi.h"

struct Str {
  char *cstr;           /* the actual string (always '\0' terminated) */
  unsigned long length; /* currently used length (without trailing '\0') */
  size_t allocated;     /* currently allocated memory */
};

Str* str_new(void)
{
  Str *s = xmalloc(sizeof(Str));      /* create new string object */
  s->cstr = xcalloc(1, sizeof(char)); /* init the string with '\0' */
  s->length = 0;                      /* set the initial length */
  s->allocated = 1;                   /* set the initially allocated space */
  return s;                           /* return the new string object */
}

void str_free(Str *s)
{
  if (!s) return;           /* return without action if 's' is NULL */
  free(s->cstr);            /* free the stored the C string */
  free(s);                  /* free the actual string object */
}

Since this string objects are pretty much useless so far, we define a couple more (object) methods in the header file str_api.h.

Because C does not allow the traditional object.methodname syntax often used in object-oriented programming, we use the convention to pass the object always as the first argument to the function (methodname(object, ...)). To make it clear that a function is a method of a particular class classname, we prefix the method name with classname_. That is, we get classname_methodname(object, ...) as the generic form of method names in C. The constructor is always called classname_new() and the destructor classname_free(). See str.c for examples.

Reference counting

Adding reference counting to our newly created string class is pretty simple. At first we add a new function to the header file, which returns a new reference to the given object:

Str*          str_ref(Str*);

To implement this object, we add a new reference_count member variable in to the Str implementation

struct Str {
  char *cstr;           /* the actual string (always '\0' terminated) */
  unsigned long length; /* currently used length (without trailing '\0') */
  size_t allocated;     /* currently allocated memory */
  unsigned int reference_count;
};

implement the str_ref() function

Str* str_ref(Str *s)
{
  if (!s) return NULL;
  s->reference_count++; /* increase the reference counter */
  return s;
}

and finally modify the destructor to take care of reference counting:

void str_free(Str *s)
{
  if (!s) return;           /* return without action if 's' is NULL */
  if (s->reference_count) { /* there are multiple references to this string */
    s->reference_count--;   /* decrement the reference counter */
    return;                 /* return without freeing the object */
  }
  free(s->cstr);            /* free the stored the C string */
  free(s);                  /* free the actual string object */
}

Modules

Modules bundle related functions which do not belong to a class. Examples:

dynalloc.h, the low level module for dynamic allocation, e.g., used to implement arrays in array.c and the above-mentioned strings
sig.h, bundles signal related functions (high level)
xansi_api.h, contains wrappers for the standard ANSI C library
xposix.h, contains wrappers for POSIX functions we use

When designing new code, it is not very often the case that one has to introduce new modules. Usually defining a new class is the better approach.

The `genometools.h` header file

The genometools.h header file includes all other header files of the GenomeTools library. That is, to write programs employing the library, it suffices to include the genometools.h header file.

Unit tests

Many classes and modules contain a *_unit_test(void) function which performs a unit test of the class/module and returns 0 in case of success and -1 in case of failure. The unit test components are loaded into the GenomeTools runtime in the function gtr_register_components() and can be executed on the command line with:

$ gt -test

Tools

A ``tool'' is the most high-level type of component GenomeTools has to offer. We consider the ``eval'' tool here as an example. It evaluates a gene prediction against a given annotation. In principle a tool could be compiled as a single binary linking against the ``libgenometools''. Therefore the header files gt_*.h for tools only contain a single function which resemble a main() function (see gt_eval.h) and the gt_*.c files include only the genometools.h header file (see gt_eval.c). All tools are linked into the single gt binary, though. They are also loaded into the runtime via the gtr_register_components() function. All tools can be called like the eval tool in the following example:

$ gt eval

Getting started

To get started with GenomeTools development yourself, we recommend the following:

Install the Git version control system.
Read the Git documentation.

Clone the GenomeTools Git repository with:

$ git clone git://genometools.org/genometools.git

Start hacking on your own feature branch:

$ cd genometools
$ git checkout -b my_feature_branch_name

Have fun!

Acknowledgment

We want to thank Patrick Maaß for introducing us to some techniques described in this document.