ROCK Basics

open science
research
qualitative methods
open standards
open source
Author

Gjalt-Jorn Peters

Published

December 7, 2024

Warning

This blog post is still a work in progress!

The Reproducible Open Coding Kit, the ROCK, allows you to do qualitative research in an Open Science way. It is a so-called Open Standard and a number of Open Source applications exist to work with ROCK files. This blog post is meant as a brief introduction to this standard.

Starting from coded ROCK sources

The simplest ROCK source

A very very minimal coded ROCK source might look something like this:

Being proud of oneself: is a
self-conscious emotion or self-directed [[emotion]]
emotion. It is based on considerations
of social standards and/or evaluation [[social]]
by oneself or by others.

Two codes are applied here. The first is identified by code identifier “emotion”, and the second by code identifier “social”. Each code identifier is delimited by two square brackets to designate that they represent the application of a code to a data fragment.

The key of the ROCK standard is that this coded source can be read by both humans and computers. The first enables easy sharing of coded data: people don’t need to install specific software to be able to understand what you did. The second enables processing by computers to do everything that software for qualitative analysis can do.

Each line of this source is a data fragment. In the ROCK, such a data fragment is called an “utterance”, and utterances are the shortest codable segment of data. They are separated by line breaks.

Adding more information

It is good practice to attach unique identifiers to each utterance as well, so as to enable unequivalent reference to them as well as easily compare coding from independent coding efforts. Such identifiers are prepended to the utterances (so they don’t distract from the data). They are called UIDs (Utterance Identifiers), and you can see examples in the source below.

In addition, you often want to encode data provenance (e.g. the data provider). These are attached using so-called class instance identifiers. These identify instances of classes, where a class can be, for example, participants, the time of the data collection, the questions asked in an interview, or who coded the interview. Class instance identifiers look the same as utterance identifiers: they start with the class identifier (e.g., participantId), followed by an equals sign (=) or a colon (:), and end with the instance identifier (e.g., base).

If we add this to the rudimentary example source, it looks like this:

[[uid=7xf3dl1g]] [[participantId=base]] [[tssid=20240901T0700Z]]
[[uid=7xf3dl1k]] 
[[uid=7xf3dl1l]] [[taskId=label]]
[[uid=7xf3dl1m]] 
[[uid=7xf3dl1n]] pride
[[uid=7xf3dl1p]] 
[[uid=7xf3dl1q]] [[taskId=definition]]
[[uid=7xf3dl1r]] 
[[uid=7xf3dl1s]] Being proud of oneself: is a
[[uid=7xf3dl1t]] self-conscious emotion or self-directed [[emotion]]
[[uid=7xf3dl1w]] emotion. It is based on considerations
[[uid=7xf3dl1x]] of social standards and/or evaluation [[social]]
[[uid=7xf3dl1y]] by oneself or by others.

As you can see, one of the benefits of UIDs is that they enable efficient reference to a specific data fragment (‘utterance’). Within a source, the last three characters of a UID suffice as unique references, so we will use these to further discuss this source.

On the line identified by l1g we see only two class instance identifiers. [[participantId=base]] encodes that these data were provided by a participant with identifier base, and [[tssid=20240901T0700Z]] is the so-called “time-stamped source identifier” (the TSSID). This is the minute the data collection started in ISO 8601 standard format and converted to the UTC timezone. Together with UIDs, TSSIDs are a virtually guaranteed unique reference to any data fragment from any qualitative data collection.1

On the line identified by l1l, we see [[taskId=label]]. Apparently, what follows now are data provided for that task (label). Knowing what that task entails requires knowing more about the study itself (these data are from the Alice study, see here). The data that follows is one the line identified by l1n: the word “pride”, which has not been coded yet.

On the line identified by l1q, [[taskId=definition]] tells us that the next section of data relates to the second task, identified as ‘definition’. These data are the data fragments we saw earlier, on the lines identified by l1s to l1y.

The ROCK standard describes how software can process ROCK sources such as this example. Software that does that (such as the {rock} package in R, or the Shiny ROCK Feldspar web app that processes one or more sources into what’s called the Qualitative Data Table) will then label all data following a class instance identifier with that identifier.

As such, after processing the data, it’s clear that the data on the line identified with l1n belongs to task ‘label’, and the data on the lines identified by l1s to l1y belongs to task ‘definition’. For all data, it’s clear they belong to the participant identified by base, and that they came from the source identified by 20240901T0700Z.

These class instance identifiers can also be used to attach attributes to data fragments. For example, participant base may be somebody who, at the time of data collection, was aged between 22 and 30, identified as a woman, and resided in a rural area. They will be attached to all data fragments coded with the corresponding class instance identifier - in our example, all data, since this class instance identifier was on the first line.

If another class instance identifier for another instance of the same class appears later in the source, from that point on the attributes of that class instance will be attached instead. Attributes can be specified in a ROCK source in the YAML format, which will be explained below.

Now, it’s time to dive a bit into the idea of the ROCK standard.

The ROCK fundamentals

The ROCK format has a number of features, but for now I’ll focus on the most commonly used ones.

A ROCK file is a so-called plain text file. Plain text file are pretty self-explanatory: they’re files with text - plain text. So unlike text processor files, they have no markup: no bold, no italic, no bulleted lists, no tables, no hyperlinks, nothing. Kieran Healy wrote an excellent introduction with The Plain Person’s Guide to Plain Text Social Science, as did Tim Elder with Introduction to Plaintext for Research and Writing.

The ROCK uses plain text because it’s the most open file format there is. Anybody can open plain text files; in fact, pretty much all operating systems come with plain text editors.

The ROCK file extension

File extensions are the last characters of the filename, the ones following the period (.). They tell computers and humans about the type of information in a file. Some common file extensions are .jpg for photographs, .html for webpages, .mp3 for audio files, and .txt for regular plain text files. Some operating systems even go so far so as to hide the file extension from users, instead showing icons to represent the file type. The extension for ROCK files is .rock, so this is how you can recognize ROCK files.

ROCK codes

The ROCK standard consists of two things: concepts and conventions on how to represent them in plain text files. The concepts are the most important part, but also invisible: they embody a set of decisions on how to represent coded qualitative data.2

The way the ROCK represents coding (i.e. the application of a code to data), on the other hand, is very visible. Most codings are delimited by double square brackets, which contain one or more identifiers, such as code identifiers or class identifiers and class instance identifiers.

Identifiers (“ids” for short) are unique character sequences that represent something; for example, a code or a person. Identifiers start with a Latin letter (a-z or A-Z), followed by one or more Latin letters, Arabic digits (0-9), and/or underscores (_).

Coding patterns

The following coding patterns are defined.

Basic coding patterns

First the three we already saw:

  • [[codeId]]: the simplest coding just applies a flat code, without specifying any relationship to other codes.

  • [[classId:instanceId]] or [[classId=instanceId]]: class instance identifiers encode data provenance; either the origin of the data, such as the data provider (e.g. a participant), data collection location (e.g. a venue), or coder (e.g. a researcher). Class instances can be used to efficiently attach attributes to data fragments; for example, label all data provided by a participant from a rural area as such.

  • [[uid:xxx]] or [[uid=xxx]]: Utterance identifiers enable unequivocal reference to specific data fragments. They are also required to compare or merge codings from independent coders.

Advanced coding patterns

There are three more advanced patterns. The first is supported by most qualitative software:

  • [[parentCodeId>childCodeId]]: a hierarchical coding, or tree coding, specifies a parent-child relationship between codes.

The other two are as yet unique to the ROCK, and are mostly used for Advanced Qualitative/Unified Analysis (see below):

  • [[codeId||value]]: this specifies a value for the code referenced by the code identifier. This is useful for registering values inductively. For example, age or gender identity, but also categories or degrees, for example to code the intensity of statements.

  • [[fromCodeId>toCodeId||edgeCodeId]]: a network coding specifies a generic relationship. Unlike with hierarchical/tree coding, it is possible to specify the type of relationship. In addition, structures don’t need to be nested necessarily. This is discussed more in detail inb the Qualitative Network Approach (QNA) section below.

Structural coding patterns

In addition to specifying code identifiers to indicate how you want to code data, you can also specify the data structure in the ROCK. To do this, you use structural coding patterns.

Segmenting the data

The first type are section breaks, used to segment the data. These look as follows:

--<<sectionBreakId>>--

They enable analysing the data per segment, for example collapsing the codings per segment. You can use multiple segmentation schemes in parallel by using different section break identifiers (note that segmentation decisions can have far-reaching implications; see doi.org/j5sx).

Nested data

For data that are nested (e.g. social media data where posts reply to other posts), this nesting can be indicated using tildes (~), for example:

~ This is a post.
~~ This is a reply to that post.
~~~ This is a third reply.

Anchors to enable synchronization

Finally, anchors can be used to attach timestamps, useful to synchronize multiple data streams. Anchors look like this:

--+-{ anchorId }-+--

As with all identifiers, the anchor identifiers have to be unique; and they have to have the exact same order in all sources to be synchronized.

Advanced Qualitative/Unified Analysis (AQUA)

As mentioned above, the ROCK supports Advanced Qualitative/Unified Analysis (AQUA). In this blog post, three AQUA analyses will be briefly covered: Anchor-based Stream Synchronization, the Qualitative Network Approach, and Qualitative/Unified Exploration of State Transitions.

Anchor-based Stream Synchronization

Sometimes you have multiple data streams that you want to code and analyze together. In that case you will want the data streams to be synchronized, so that the codes correspond to roughly the same moments in time.

To this end, you can specify anchors as described above. This makes it possible for software that supports this functionality to synchronize multiple streams to a primary stream. If you want to do this, you have to include, in addition to anchors, class instance identifiers to identify the source and the stream. Specifically, the class identifiers to use are sourceId and streamId.

The following fragment of a coded source shows how this can look:

[[uid=7xy80zrp]] [[sourceId=participant6]] [[tssid=20241219T0954Z]]  
[[uid=7xy80zrq]] [[streamId=interviews]]
[[uid=7xy80zrr]]
[[uid=7xy80zrs]] --+-{ 2024-12-19 09:54 }-+--
[[uid=7xy80zrt]] 
[[uid=7xy80zrw]] At first, we were just friends.
[[uid=7xy80zrx]] However, soon I started developing feelings

For an application of Anchor-based Stream Synchronization, see doi.org/nxdj.

Qualitative Network Approach (QNA)

To explain QNA, we’ll first zoom out a bit. In qualitative research, your data collection is usually aimed at being as “lossless” as possible: you aim to register everything that is relevant, erring on the side of registration. As a consequence, you often collect audio or video data. To systematically detect patterns in the data (in a way where we make it as hard as possible to fool ourselves), we devise codes that define those patterns, and we attach those codes to the data where we identify the pattern. This simplest mode of code organization is called flat coding.

Because “everything is related”, as it were, the phenomena we study are also usually related to each other. Therefore, we often want to also code relationships. The simplest way to do this is to nest codes within each other. This mode of code organization is called hierarchical or tree coding.

However, hierarchical or tree coding requires that code relationships are all the same type (and the types of this relationship is usually not specified). Often, codes can be related in different ways; and not all relationships are nested.

In those situations, you can use network coding, as implemented by the Qualitative Network Approach (QNA). When you use QNA, you always specify three codes: one representing at which code the association starts; one representing at which code the association ends; and one represention what type of association you observed. For example, if you code an interview transcript and somebody expresses the belief that drinking coffee causes them to feel energetic, you could use [[coffee->energetic||causal_positive]].

Software (like the {rock} R package or the Shiny ROCK Crystal web app can then parse a coded source and produce the network that was coded, visualizing the relationships you observed.

For example, this fragment of coded data:

[[uid=7yww3938]] When I'm tired, I often get cranky. [[tired->cranky||causal_pos||1]]
[[uid=7yww3939]] In general, such things matter for my mood. [[cranky->mood||structural||1]]
[[uid=7yww393b]] For example, when I'm hungry, I also get cranky. [[hungry->cranky||causal_pos||1]] [[hungry->cranky||causal_pos||1]]
[[uid=7yww393c]] And when I have coffee, I feel cheerful. [[cheerful->mood||structural||1]] [[coffee->cheerful||causal_pos||1]]
[[uid=7yww393d]] But that's also because it's a drug I think. [[coffee->drug||structural||1]]
[[uid=7yww393f]] And coffee also makes me feel less tired of course. [[coffee->tired||causal_neg||1]]
[[uid=7yww393g]] My mood is also influenced by the weather. [[weather->mood||causal||1]]
[[uid=7yww393h]] Actually, the weather also matter for how hungry I get. [[weather->hungry||causal||1]]
[[uid=7yww393j]] For example, if it's very warm, I get much less hungry. [[warmWeather->weather||structural||1]] [[warmWeather->hungry||causal_neg||1]]

Produces this network:

Code
source <- "

[[uid=7yww3938]] When I'm tired, I often get cranky. [[tired->cranky||causal_pos||1]]
[[uid=7yww3939]] In general, such things matter for my mood. [[cranky->mood||structural||1]]
[[uid=7yww393b]] For example, when I'm hungry, I also get cranky. [[hungry->cranky||causal_pos||1]] [[hungry->cranky||causal_pos||1]]
[[uid=7yww393c]] And when I have coffee, I feel cheerful. [[cheerful->mood||structural||1]] [[coffee->cheerful||causal_pos||1]]
[[uid=7yww393d]] But that's also because it's a drug I think. [[coffee->drug||structural||1]]
[[uid=7yww393f]] And coffee also makes me feel less tired of course. [[coffee->tired||causal_neg||1]]
[[uid=7yww393g]] My mood is also influenced by the weather. [[weather->mood||causal||1]]
[[uid=7yww393h]] Actually, the weather also matter for how hungry I get. [[weather->hungry||causal||1]]
[[uid=7yww393j]] For example, if it's very warm, I get much less hungry. [[warmWeather->weather||structural||1]] [[warmWeather->hungry||causal_neg||1]]

";

parsedSource <- rock::parse_source(text = source);

DiagrammeR::render_graph(
  parsedSource$networkCodes$network$graph
);

Qualitative/Unified Exploration of State Transitions (QUEST)

When you want to use Qualitative/Unified Exploration of State Transitions (QUEST), you have to specify state identifiers in the data to code the state. State identifiers are class instance identifiers for class state.3

For example, the following fragment of a coded source contains four state transitions:

[[uid=7xy80zrw]] At first, we were just friends. [[state=friends]]
[[uid=7xy80zrx]] However, soon I started developing feelings [[state=crush]]
[[uid=7xy80zry]] for them. For a few weeks, I wasn't sure
[[uid=7xy80zrz]] how they felt, but eventually we started dating. [[state=dating]]
[[uid=7xy80zs0]] As we got to know each other better, we discovered
[[uid=7xy80zs1]] that we wanted very different things from
[[uid=7xy80zs2]] life. We ultimately decided to part as friends. [[state=friends]]

Analysing this source with appropriate software (e.g., the {rock} R package) can then show the state transition diagram. Specifically, the {rock} package produces the following result for this coded data fragment (admittedly not incredibly exciting:

Code
source <- "

[[uid=7xy80zrw]] At first, we were just friends. [[state=friends]]
[[uid=7xy80zrx]] However, soon I started developing feelings [[state=crush]]
[[uid=7xy80zry]] for them. For a few weeks, I wasn't sure
[[uid=7xy80zrz]] how they felt, but eventually we started dating. [[state=dating]]
[[uid=7xy80zs0]] As we got to know each other better, we discovered
[[uid=7xy80zs1]] that we wanted very different things from
[[uid=7xy80zs2]] life. We ultimately decided to part as friends. [[state=friends]]

";

parsedSource <- rock::parse_source(text = source);

exampleTable <- rock::get_state_transition_table(
  parsedSource
);

exampleStateDf <- rock::get_state_transition_df(
  exampleTable
);

exampleDotCode <- rock::get_state_transition_dot(
  exampleStateDf
);

DiagrammeR::grViz(exampleDotCode);

You can then use the states and state transitions for analyzing coding patterns (for example, study what precedes transitions from a given state to another state). For an example in the wild, see doi.org/nxdj.

Getting started with the ROCK

If you would like to try the ROCK, there are roughly four routes.

First, you could just get going and start playing around. If you’re familiar with R, you can install the {rock} package and then you have access to all functions there - see rock.opens.science for the PkgDown website with the documentation.

If you’re not familiar with R, you can play with the web apps. There’s a series of Shiny ROCK apps. For example, you can prepare a ROCK source using Shiny ROCK Emerald. You can then code the source with Shiny ROCK Diamond. Finally, if you applied network coding, you can view the produced network with Shiny ROCK Crystal. You can also produce the qualitative data tabel with Shiny ROCK Feldspar.

Second, you can start a project. There’s a great template for qualitative projects, the Simple Qualitative Administration For File Organization, Licensing, and Development (SQAFFOLD), available at www.sqaffold.org. You can find both a workshop and a prepared template there.

Third, there’s a two hour workshop on the ROCK available, as well as a three hour workshop. There is also a tutorial available in the ROCK book.

Fourth, there’s an open access article discussing the ROCK standard as well as Epistemic Network Analysis, a way to visualize qualitative data that can be helpful depending on your research question.

Footnotes

  1. UIDs are always exactly eight characters, at least until 2177-11-28 13:00:00 UTC, when they become nine characters; see rock::numericToBase30(as.numeric(as.POSIXct("2177-11-28 13:00:00 UTC")) * 100) in R↩︎

  2. These decisions underlie any software. They are necessary to enable analyses; and at the same time, they constrain what is possible, and have nontrivial epistemological consequences.↩︎

  3. By default the class identifier for states is state. In theory you could use another class identifier. However, this is discouraged as it means others will not be able to understand your coded sources unless they know which state identifier you used instead.↩︎