The GenomeTools design
Major design goals of the GenomeTools system:
- Correctness
- built-in unit tests
- extensive test suite
- Portability
- should run on every POSIX compliant UNIX system
- depend only on a C/C++ compiler and the GNU Make build system
- Efficiency
- Minimalism (if in doubt, use the simpler solution)
The GenomeTools C code is split into components plus the runtime and the main program. There are the following types of components:
- classes
- modules
- unit tests
- tools
It is assumed that the reader is familiar with the terminology of object oriented software design.
The runtime
The runtime class gtr.h
/gtr.c
is the nucleus of the GenomeTools
system. An object of this class is the place of execution in GenomeTools,
like a process in an operating system. A new runtime object can be created with
the gtr_new()
function. gtr_register_components()
registers
all build in components to the runtime, and gtr_run()
starts it.
Rationale
Having an explicit runtime class unifies the design — a runtime instance is just another object. In theory, this would allow to start multiple instances of the runtime class in parallel. For example, in threads.
The runtime class is also the place were we start an embedded Lua script language interpreter.
The main program
The GenomeTools main program given in the file gt.c
simply
creates a new runtime object and starts it.
Classes
The central component type in GenomeTools is the class. Structuring the C code into classes and modules gives us a unified design approach which simplifies the thinking about design issues and avoids that the codebase becomes monolithic, a problem often encountered in C programs.
Simple classes
For most classes, a simple class suffices. A simple class is a class which does
not inherit from other classes and from which no other classes inherit. Using
mostly simple classes avoids the problems of large class hierarchies, namely the
interdependence of classes which inherit from one another.
The major advantage of simple classes over simple C struct
s is
information hiding.
Implementing simple classes
We describe now how to implement a simple classes using the string class str.[ch]
of GenomeTools as an example. The interface to a class is always given in
the .h
or _api.h
header file
(str_api.h
in our example). To achieve information hiding the header file
cannot contain implementation details of the class. The implementation can
always be found in the corresponding .c
file (str.c
in our example).
Therefore, we start with the following C construct to define our Str
class in str.h
:
typedef struct Str Str;
This seldomly used feature of C introduces a new data type named
Str
which is a synonym for the struct Str
data type,
which needs not to be known at this point. In the scope of the header
file, the new data type Str
cannot be used, since it's size is
unknown to the compiler at this point. Nevertheless, pointers of type
Str
can still be defined, because in C all pointers have the same
size, regardless of it's type. Using this fact, we can define a constructor
function:
Str* str_new(void);
which returns a new string object and a destructor function
void str_free(Str*);
which destroys a given string object. This gives us the basic structure of the string class header file: A new data type (which represents the class and it's objects), a constructor function, and a destructor function.
#ifndef STR_H #define STR_H /* the string class, string objects are strings which grow on demand */ typedef struct Str Str; Str* str_new(void); void str_free(Str*); #endif
Now we look at the implementation side of the story, which can be found in the
str.c
file. At first, we include the str.h
header file to make
sure that the newly defined data type is known:
#include "str.h"
Then we define struct Str
which contains the actual data of a
string object (the member variables in OO lingo).
struct Str { char *cstr; /* the actual string (always '\0' terminated) */ unsigned long length; /* currently used length (without trailing '\0') */ size_t allocated; /* currently allocated memory */ };
Finally, we code the constructor
Str* str_new(void) { Str *s = xmalloc(sizeof(Str)); /* create new string object */ s->cstr = xcalloc(1, sizeof(char)); /* init the string with '\0' */ s->length = 0; /* set the initial length */ s->allocated = 1; /* set the initially allocated space */ return s; /* return the new string object */ }
and the destructor
void str_free(Str *s) { if (!s) return; /* return without action if 's' is NULL */ free(s->cstr); /* free the stored the C string */ free(s); /* free the actual string object */ }
Our string class implementation so far looks like this
#include "str.h" #include "xansi.h" struct Str { char *cstr; /* the actual string (always '\0' terminated) */ unsigned long length; /* currently used length (without trailing '\0') */ size_t allocated; /* currently allocated memory */ }; Str* str_new(void) { Str *s = xmalloc(sizeof(Str)); /* create new string object */ s->cstr = xcalloc(1, sizeof(char)); /* init the string with '\0' */ s->length = 0; /* set the initial length */ s->allocated = 1; /* set the initially allocated space */ return s; /* return the new string object */ } void str_free(Str *s) { if (!s) return; /* return without action if 's' is NULL */ free(s->cstr); /* free the stored the C string */ free(s); /* free the actual string object */ }
Since this string objects are pretty much useless so far, we define a couple
more
(object)
methods in the header file str_api.h
.
Because C does not allow the traditional
object.methodname
syntax often used in object-oriented programming,
we use the convention to pass the object always as the first argument to the
function (methodname(object, ...)
). To make it clear that a
function is a method of a particular class classname
, we prefix the
method name with classname_
. That is, we get
classname_methodname(object, ...)
as the generic form of method
names in C. The constructor is always called classname_new()
and
the destructor classname_free()
.
See str.c
for examples.
Reference counting
Adding reference counting to our newly created string class is pretty simple. At first we add a new function to the header file, which returns a new reference to the given object:
Str* str_ref(Str*);
To implement this object, we add a new reference_count
member variable in to the Str
implementation
struct Str { char *cstr; /* the actual string (always '\0' terminated) */ unsigned long length; /* currently used length (without trailing '\0') */ size_t allocated; /* currently allocated memory */ unsigned int reference_count; };
implement the str_ref()
function
Str* str_ref(Str *s) { if (!s) return NULL; s->reference_count++; /* increase the reference counter */ return s; }
and finally modify the destructor to take care of reference counting:
void str_free(Str *s) { if (!s) return; /* return without action if 's' is NULL */ if (s->reference_count) { /* there are multiple references to this string */ s->reference_count--; /* decrement the reference counter */ return; /* return without freeing the object */ } free(s->cstr); /* free the stored the C string */ free(s); /* free the actual string object */ }
Modules
Modules bundle related functions which do not belong to a class. Examples:
-
dynalloc.h
, the low level module for dynamic allocation, e.g., used to implement arrays inarray.c
and the above-mentioned strings -
sig.h
, bundles signal related functions (high level) -
xansi_api.h
, contains wrappers for the standard ANSI C library -
xposix.h
, contains wrappers for POSIX functions we use
When designing new code, it is not very often the case that one has to introduce new modules. Usually defining a new class is the better approach.
The genometools.h
header file
The genometools.h
header file includes all other header files of the GenomeTools library.
That is, to write programs employing the library, it suffices to include the
genometools.h
header file.
Unit tests
Many classes and modules contain a *_unit_test(void)
function which performs a
unit test of the
class/module and returns 0
in case of success and -1
in case of failure.
The unit test components are loaded into the
GenomeTools runtime in the function gtr_register_components()
and can be executed on the command line with:
$ gt -test
Tools
A ``tool'' is the most high-level type of component GenomeTools has to
offer. We consider the ``eval'' tool here as an example. It evaluates a gene
prediction against a given annotation.
In principle a tool could be compiled as a single binary linking against the
``libgenometools''. Therefore the header files gt_*.h
for tools only contain
a single function which resemble a main()
function (see gt_eval.h
)
and the gt_*.c
files include only the genometools.h
header file
(see gt_eval.c
).
All tools are linked into the single gt
binary,
though. They are also loaded into the runtime via the gtr_register_components()
function. All tools can be called like the eval tool in the following example:
$ gt eval
Getting started
To get started with GenomeTools development yourself, we recommend the following:
- Install the Git version control system.
- Read the Git documentation.
- Clone the GenomeTools Git repository with:
$ git clone git://genometools.org/genometools.git
- Start hacking on your own feature branch:
$ cd genometools $ git checkout -b my_feature_branch_name
- Have fun!
Acknowledgment
We want to thank Patrick Maaß for introducing us to some techniques described in this document.