A codebook specifies the codes used to organize data.
For quantitative data, code books describe what each data series represents, linking the variable names (for rectangular data such as spreadsheets, these are the column names) to more details about the data series’ provenance (for example how those data were obtained, which categories are represented by which values, et cetera).
For qualitative data, such as coded sources from interview studies (e.g., transcribed audio recordings) or systematic reviews (e.g., extracted text fragments), code books describe what each code represents, linking the code identifiers as used in the coded sources to the code label, definition, and coding instructions.
This blog post concerns qualitative code books, and will describe which elements a code book should have and why.
What is a code?
In this blog post, a code is defined as the interface between a theoretical concept and empirical data. The code is used to label bits of a qualitative data source (such as data fragments in a transcribed interview or parts of a text fragment extracted from a scientific article). To fulfill their function of linking a theoretical concept to empirical data, codes have a number of components.
Code identifier
The code identifier is a short sequence of characters that uniquely identifies the code in a project. The identifier (or id for short) is used to label (or annotate or mark) data fragments as being coded with that code. That act signifies the coder’s observation that the data describes or otherwise expressed the theoretical concept that the code represents.
Code identifiers may only consist of lower ([a-z]
) and upper case ([A-Z]
) Latin letters, Arabic digits ([0-9]
), and underscores (_
) and must start with a letter. Because code identifiers must be unique, if you use a hierarchical organizational mode for your codes (i.e. tree coding as opposed to flat coding or network coding), it is often a good idea to include an abbreviation for the parent code in the code identifier.
For example, let’s say you have codes for advantages (let’s use code identifier adv
) of various holiday destinations (let’s use identifier slugs beach
and city
). You use holiday destinations as parent codes (or clusters, or themes). You cannot then use the code identifier adv
to represent both the advantages of beach holidays and the advantages of city holidays: those refer to different things and so should have unique identifiers. You can use for example beach_adv
and city_adv
as identifiers so it’s unequivocal what you code a data fragment with.
Code label
The code label is the human-readable name for a code. Because identifiers secure the ability to uniquely refer to a code, code labels do not need to be unique. In the above advantage, you could use the code label Advantages
for both beach_adv
and city_adv
. However, this would still be messy; future you (and other researchers) will probably not be very grateful for that particular decision.
Code labels have no rules; they just have to be short. You can describe the concept that the code represents more fully in the code description.
Code description
The code description describes the concept that the code represent in as much detail as you want. Or, more accurately, maybe more detail than you want, but as much details as possible and useful. If you give somebody the code label and this description, they should obtain the same representation as you have. This means it’s important to be careful to avoid tacit knowledge, jargon, and other terms that themselves may be heterogeneously defined by people.
In qualitative work, it is often said that the coding phase represents the analysis of the data. That work does not consist of applying codes; that work consists of iterative improvement of the codes, as specified in the code book. It can be tempting to pay lip service to the code book once you and your project team have consensus about the codes you want to use. After all, documenting things is time consuming; and by making your assumptions about your shared representations explicit, it may turn out you didn’t quite agree as much as you thought beforehand. This prompts more discussion and maybe changes of the code description.
It is important to keep in mind that that is a feature, not a bug. The result of those discussions will be a comprehensive description of the theoretical concept you are interested in; one that is clear to all project members as well, hopefully, to others. The product of the coding exercise is the development of the theoretical concepts the codes represent, and that work is done through code explication.
For example, imagine a project where we are interested in why people have the sleeping patterns they do. One code may be Descriptive norms (i.e. that’s the code label), with identifier descriptive_norms
, and its description could be:
Descriptive norms are defined as somebody’s perception of the behavior of others in their environment (so-called “social referents”). Note that this is different from injunctive norms, which refer to somebody’s perception of the approval or disapproval of social referents.
(For more information about descriptive norms, see https://psycore.one/descriptiveNorms_73dnt5zp.)
Coding instructions
The coding instructions differ from the code description in that they describe when to apply the code, when not to apply the code, and which other codes to consider before deciding which code to apply. Coding instructions therefore do not describe the concept the code represents (as code descriptions do) but instead provide specific instructions to follow when evaluating data fragments as ones you potentially want to code with a code.
An example from the hypothetical project into sleeping patterns would be the instruction for the descriptive_norms
code:
Apply this code when a data fragment expresses that somebody engages in certain sleeping pattern-related behavior because (they believe) other people do the same. This code should not be applied to people going to bed earlier or later in response to other behaviors of others (e.g. going to bed later because housemates play music until a late time).
Contrast this with the description: this is much more focused on the operational act of coding, not on defining what the code represents.
Examples
Code development often happens at the hand of concrete examples: data fragments that you decide do or do not “fall within” a code. Similarly, concrete examples can help people to understand what your code represents “in the real world” and what it doesn’t. Together with the label, description, and coding instructions, the examples furnish the reader (which includes future you) with the full understanding of what the code represents, both conceptually and in its interface to empirical data.
Examples fall within four categories, defined by two characteristics. An example can be either an edge case (where whether the code should be applied or not is not directly obvious) or a core case (where the code should clearly be applied or not applied); and an example can be either a match (where the code should be applied) or a mismatch (where the code should not be applied).
The four categories of code examples, then, are core matches, core mismatches, edge matches, and edge mismatches. By having a few examples of each, you make it clear to everybody what exactly you mean by the code (and what you don’t mean).
A machine-readable code book specification
There’s a standard for machine readable code book specifications. There is an example at https://rock.science/codebook.