C For Smarties: User-defined types

User-defined (and abstract) types

C has three ways of defining new types. The most general one is to use the struct keyword.^[1] C's union and enum keywords also create new types. Perhaps surprisingly, the typedef keyword does not create a new type. Instead, it makes an alias for some existing type. This shows through the type system because a pointer to the aliased type is considered identical to a pointer to the original type-name. For instance, the code fragment:

typedef int x;
int *a;
x *b;
int **p = &b;

makes ‘x’ an alias for int, and then creates two pointers of type int * and one of type int ** (pointer to ‘pointer to int’). Since both a and b have type ‘pointer to int’, p can point to either one, and the initializer for p is entirely valid.

This is a crucial point: typedef defines aliases, not types.

All three of these type-defining keywords share some syntax. In particular, structures, unions, and enumerations use tags. These tags live in the ‘tag’ name space [link to namespaces]; all three keywords share a single tag name space, so if you have a struct blat, you cannot also have a union blat, nor an enum blat. While the tags are optional, there is rarely a good reason to omit them, and I prefer to think of the tag as the ‘true name’ of the type. In essence, struct blat { int x; }; really means ‘define a new type named blat’ (with one integer as its sole member).

We can refer back to these types by repeating the type-name at any point. A future struct blat this_blat; defines a variable using the type we just defined. Importantly, we can also refer forward to types we have not yet defined. This allows self-referential and mutually-referential types, including simple data structures like linked lists.

Type names (as declared or defined by the struct keyword and a tag) spring into being as soon as they are mentioned. This is often convenient, but carries a price: misspelled type names are not caught, as they simply create a new type. A C compiler is usually not able to distinguish between the case in which the programmer wanted to create a new type, and the one in which the programmer simply misspelled the type name. For this reason, some C programmers prefer to use typedef aliases.

Consider the following code fragment:

struct temperature;
void apply_heat(struct temperature *);
void remove_heat(struct tempure *);

The ‘vacuous’ declaration of struct temperature serves to declare that the type exists, without actually defining it. This creates an incomplete type. We then declare two functions taking pointers to this incomplete type (this is OK as we can always use a pointer to an incomplete type, whether or not we will eventually complete it in this translation unit). Unfortunately, we dropped quite a few keystrokes from one of the function declarations: remove_heat() actually takes a struct tempure *. Since type names spring up as soon as they are mentioned, this is not actually an error.

Better compilers will often produce a warning here, because the type we created here was declared in an inner scope [insert link to scopes and linkage rules once I write the html page]. Specifically, anything inside the parentheses of a function prototype [insert link] has ‘prototype scope’. An incomplete type can only be completed in the same scope in which it was created, and can only be referred-to (as with pointers) in that scope or any more-deeply-nested scope. Since the prototype scope ends immediately, we can never refer to that particular type again, and there is no way to pass a valid argument to the remove_heat() function.^[2] Thus, we are probably well-served by a compiler that complains that this type's scope is so limited as to be useless.

One of the biggest drawbacks I find with C's typedef syntax is that it mucks up C's declarations (and C's ‘declaration mirrors use’ syntax is already one of C's most confusing features). For all of C's built-in types, we declare variables by naming the base type first, using a keyword. Since the set of keywords is fixed, we can always be sure that it is a type name. This rule even holds for user-defined types, as long as we start with struct (or union or enum): we just have a tag inserted before the rest of the declaration. With typedef names, however, we see only an ordinary identifier, indistinguishable from any other ordinary identifier, except that it has been declared as an alias, perhaps in some #include header that we cannot easily inspect. We need to be able to tell -- preferably at a glance -- that it is in fact a typedef-name. Otherwise we end up with the situation exemplified by this (C89-specific) code fragment:

void f(x);

Can you tell from this code fragment whether x is a typedef name, or a variable name whose type is defaulted to int? If it is a variable, this defines f() as a function taking a single int parameter, but if it is a typedef, this defines f() as a function taking a single x-type parameter.

For this reason, those who use typedefs almost invariably invent a typographic convention (or several conventions) to make them stand out. If we can tell at a glance that some identifier is a typedef-alias, the syntactic problem vanishes. In my experience, the three most common conventions are:

Appending _t to the name, producing identifiers like uid_t, pthread_t, and so on. (The T here stands for ‘type’. Note that POSIX reserves all such identifiers.)
Using an initial capital letter for the identifier (provided, of course, that initial capitals are never used for other purposes).
Spelling the name in all-capitals, like Standard C's FILE type.

While I personally dislike typedef and am entirely willing to write out the struct keyword every time, we can actually use this syntactic quirk to our advantage. Suppose we rewrite the earlier code fragment to use a typedef-name to alias the incomplete structure type, and then use the alias in the function prototypes:

typedef struct temperature TEMPERATURE;
void apply_heat(TEMPERATURE *);
void remove_heat(TEMPURE *);

While the actual type-declaration still happens because of the struct keyword, we now get a compiler diagnostic for the third line, because the misspelled identifier is not a typedef-name. Of course, the quality of any given diagnostic from any particular compiler is problematic -- we might get anything from ‘syntax error’ to ‘did you misspell TEMPERATURE?’ -- but it is nice to be guaranteed something. Moreover, we get a diagnostic from any such misspelling, not just those in prototype scopes seen by good compilers (although in my experience, prototype scopes are where the vast majority of such misspellings occur anyway, and compilers have gotten much better in the last few decades).

Creating self- and mutually-referential types

It is always safe to write a ‘vacuous’ declaration (one without the braces and contents) for any structure type, but it is often not required. A type-name springs into being (at the current scope) the first time it is mentioned, unless, of course, it already exists at this or some outer scope. Thus, a vacuous declaration is required only if:

the first mention of the type will be at an inner scope, such as a prototype declaration, or
we are in some inner scope (such as a function or a block within a function), and we want to define a new type that has the same name as an existing type at some outer scope, for use in mutually-referential structures.

The second situation above hardly ever occurs, and is generally a bad idea, for much the same reason as it is a bad idea to have a file-scope (‘global’) variable with the same name as a block-scope (local) variable: it becomes too confusing to talk or reason about code with radically different entities which all have the same name. (It is much like going to a party at which everyone is named Chris. While you may never forget each person's name, you may have a hard time remembering whether you were speaking to Chris, Chris, or Chris.) The first situation, however, is not uncommon. In any case, they are always safe, and I will use them in these examples.

The classic self-referential structure is a linked list. Let us consider a somewhat more complicated data structure, in which we have a list of items that themselves have sub-lists, and the sub-lists can refer back to the top level lists. For instance, in an operating system kernel, we might have a list of files, in which each file contains of a list of cached file-blocks. At the same time, each cached file-block needs to point back to its containing file.

struct fileinfo;
struct cached_block;

struct fileinfo {
    struct fileinfo *fi_next;     /* list of all files */
    struct cached_block *fi_blks; /* cached blocks for this file */
    /* ... */
};

struct cached_block {
    int    cb_lbn;                /* logical block number */
    struct cached_block *cb_next; /* next cached block for this file */
    struct fileinfo *cb_file;     /* file containing this block */
    /* ... */
};

Here, each data structure needs a pointer to the other data structure. There is no way we can completely define both types without mentioning the other type. Fortunately, C's type-names simply spring into being as needed, so this works whether or not we pre-declare the types.

If we wish to use typedefs, however, we have a problem. At least one of typedef must occur first, but because typedef-names do not simply ‘spring into being’ we cannot use both typedef names until we have defined both. But this is not really a serious problem after all: all we have to do is define the type-names and create the aliases, then complete the two types:

struct fileinfo;
struct cached_block;

typedef struct fileinfo FILEINFO;
typedef struct cached_block CACHED_BLOCK;

struct fileinfo {
    FILEINFO     *fi_next; /* list of all files */
    CACHED_BLOCK *fi_blks; /* cached blocks for this file */
    /* ... */
};

struct cached_block {
    int          cb_lbn;   /* logical block number */
    CACHED_BLOCK *cb_next; /* next cached block for this file */
    FILEINFO     *cb_file; /* file containing this block */
    /* ... */
};

The two keys here are to remember that it is the struct keyword that defines the type, and that we can use incomplete type names whenever we like, provided we stick with pointers to those incomplete types. C programmers get into trouble when they think -- logically enough, but incorrectly -- that it is the typedef keyword that defines the types, and attempt to omit the structure tags, giving something like:

typedef struct {
    FILEINFO     *fi_next; /* list of all files */
    CACHED_BLOCK *fi_blks; /* cached blocks for this file */
} FILEINFO;

typedef struct {
    int          cb_lbn;   /* logical block number */
    CACHED_BLOCK *cb_next; /* next cached block for this file */
    FILEINFO     *cb_file; /* file containing this block */
} CACHED_BLOCK;

which of course does not work at all. A simple set of rules to remember, that always work, are:

Always use structure tags.
Write vacuous declarations up front.
If you are going to use typedefs at all, write all of them up front too, after the vacuous declarations.

Note that if you do use typedefs, and you avoid the situation I describe as a ‘bad idea’ -- that is, you always avoid defining an inner-scope type with the same name as some outer-scope type -- you can always omit the initial vacuous structure declarations. (Exercise: why?)

back

^[1] I like to claim that the keyword is an abbreviation for strange spelling for user-defined abstract type. Of course, this is actually a tortured retronym. It really is just short for structure, and it is mostly luck that -- at least as of C99 -- it really works 100% as a user-defined abstract type, with full strong typing. In C89, using structures as user-defined abstract types makes it impossible to pass constants of an appropriate type to functions. We can still fake it with const-qualified variables, but this is rather awkward.
^[2] Technically, we could call it with a value of type void *, but this is unlikely to be what we intended.