Major design goals of the GenomeTools system:
The GenomeTools C code is split into components plus the runtime and the main program. There are the following types of components:
It is assumed that the reader is familiar with the terminology of object oriented software design.
The runtime class gtr.h
/gtr.c
is the nucleus of the GenomeTools
system. An object of this class is the place of execution in GenomeTools,
like a process in an operating system. A new runtime object can be created with
the gtr_new()
function. gtr_register_components()
registers
all build in components to the runtime, and gtr_run()
starts it.
Having an explicit runtime class unifies the design — a runtime instance is just another object. In theory, this would allow to start multiple instances of the runtime class in parallel. For example, in threads.
The runtime class would be the place to start an embedded script language interpreter, like Lua. This would allow to load new components during runtime.
The GenomeTools main program given in the file gt.c
simply
creates a new runtime object and starts it.
The central component type in GenomeTools is the class. Structuring the C code into classes and modules gives us a unified design approach which simplifies the thinking about design issues and avoids that the codebase becomes monolithic, a problem often encountered in C programs.
For most classes, a simple class suffices. A simple class is a class which does
not inherit from other classes and from which no other classes inherit. Using
mostly simple classes avoids the problems of large class hierarchies, namely the
interdependence of classes which inherit from one another.
The major advantage of simple classes over simple C struct
s is
information hiding.
We describe now how to implement a simple classes using the string class str.[ch]
of GenomeTools as an example. The interface to a class is always given in
the .h
header file
(str.h in our example). To achieve information hiding the header file
cannot contain implementation details of the class. The implementation can
always be found in the corresponding .c
file (str.c
in our example).
Therefore, we start with the following C construct to define our Str
class in str.h
:
typedef struct Str Str;
This seldomly known feature of C introduces a new data type named
Str
which is a synoym for the struct Str
data type,
which needs not to be known at this point. In the scope of the header
file, the new data type Str
cannot be used, since it's size is
unknown to the compiler at this point. Nevertheless, pointers of type
Str
can still be defined, because in C all pointers have the same
size, regardless of it's type. Using this fact, we can define a constructor
function:
Str* str_new(void);
which returns a new string object and a destructor function
void str_free(Str*);
which destroys a given string object. This gives us the basic structure of the string class header file: A new data type (which represents the class and it's objects), a constructor function, and a destructor function.
#ifndef STR_H #define STR_H /* the string class, string objects are strings which grow on demand */ typedef struct Str Str; Str* str_new(void); void str_free(Str*); #endif
Now we look at the implementation side of the story, which can be found in the
str.c
file. At first, we include the str.h
header file to make
sure that the newly defined data type is known:
#include "str.h"
Then we define struct Str
which contains the actual data of a
string object (the member variables in OO lingo).
struct Str { char *cstr; /* the actual string (always '\0' terminated) */ unsigned long length; /* currently used length (without trailing '\0') */ size_t allocated; /* currently allocated memory */ };
Finally, we code the constructor
Str* str_new(void) { Str *s = xmalloc(sizeof(Str)); /* create new string object */ s->cstr = xcalloc(1, sizeof(char)); /* init the string with '\0' */ s->length = 0; /* set the initial length */ s->allocated = 1; /* set the initially allocated space */ return s; /* return the new string object */ }
and the destructor
void str_free(Str *s) { if (!s) return; /* return without action if 's' is NULL */ free(s->cstr); /* free the stored the C string */ free(s); /* free the actual string object */ }
Our string class implementation so far looks like this
#include "str.h" #include "xansi.h" struct Str { char *cstr; /* the actual string (always '\0' terminated) */ unsigned long length; /* currently used length (without trailing '\0') */ size_t allocated; /* currently allocated memory */ }; Str* str_new(void) { Str *s = xmalloc(sizeof(Str)); /* create new string object */ s->cstr = xcalloc(1, sizeof(char)); /* init the string with '\0' */ s->length = 0; /* set the initial length */ s->allocated = 1; /* set the initially allocated space */ return s; /* return the new string object */ } void str_free(Str *s) { if (!s) return; /* return without action if 's' is NULL */ free(s->cstr); /* free the stored the C string */ free(s); /* free the actual string object */ }
Since this string objects are pretty much useless so far, we define a couple more (object) methods in the header file str.h.
Because C does not allow the traditional
object.methodname
syntax often used in object-oriented programming,
we use the convention to pass the object always as the first argument to the
function (methodname(object, ...)
). To make it clear that a
function is a method of a particular class classname
, we prefix the
method name with classname_
. That is, we get
classname_methodname(object, ...)
as the generic form of method
names in C. The constructor is always called classname_new()
and
the destructor classname_free()
.
See str.c
for examples.
Adding reference counting to our newly created string class is pretty simple. At first we add a new function to the header file, which returns a new reference to the given object:
Str* str_ref(Str*);
To implement this object, we add a new reference_count
member variable in to the Str
implementation
struct Str { char *cstr; /* the actual string (always '\0' terminated) */ unsigned long length; /* currently used length (without trailing '\0') */ size_t allocated; /* currently allocated memory */ unsigned int reference_count; };
implement the str_ref()
function
and finally modify the destructor to take care of reference counting:
void str_free(Str *s) { if (!s) return; /* return without action if 's' is NULL */ if (s->reference_count) { /* there are multiple references to this string */ s->reference_count--; /* decrement the reference counter */ return; /* return without freeing the object */ } free(s->cstr); /* free the stored the C string */ free(s); /* free the actual string object */ }
Modules bundle related functions which do not belong to a class. Examples:
dynalloc.h
,
the low level module for dynamic allocation, e.g., used to implement arrays in array.c
and the above-mentioned strings
sig.h
,
bundles signal related functions (high level)
xansi.h
,
contains wrappers for the standard ANSI C library
xposix.h
,
contains wrappers for POSIX functions we use
When desiging new code, it is not very often the case that one has to introduce new modules. Usually defining a new class is the better approach.
gt.h
header file
The gt.h
header file includes all other header files of the GenomeTools library.
That is, to write programs employing the library, it suffices to include the
gt.h
header file.
Many classes and modules contain a *_unit_test(void)
function which performs a
unit test of the
class/module and returns 0
in case of success and -1
in case of failure.
The unit test components are loaded into the
GenomeTools runtime in the function gtr_register_components()
and can be executed on the command line with:
$ gt -test
A ``tool'' is the most high-level type of component GenomeTools has to
offer. We consider the ``eval'' tool here as an example. It evaluates a gene
prediction against a given annotation.
In principle a tool could be compiled as a single binary linking against the
``libgt''. Therefore the header files gt_*.h
for tools only contain
a single function which resemble a main()
function (see gt_eval.h
)
and the gt_*.c
files include only the gt.h
header file
(see gt_eval.c
).
For coolness reasons all tools are linked into the single gt
binary,
though. They are also loaded into the runtime via the gtr_register_components()
function. All tools can be called like the eval tool in the following example:
$ gt eval
To get started with GenomeTools development yourself, I recommend the following:
$ git clone http://genometools.org/genometools.git
$ cd genometools $ git checkout -b my_feature_branch_name
I want to thank Patrick Maaß for introducing me to some techniques described in this document.