The Shapes of Data: Associative

01 February 2025

tl;dr The shape of data that's associative, or the key-value data store, is a style of single-dimensional, making the key the biggest part of the shape.

Refresher

In a key-value store, like Redis, data is stored in a collection made up of tuples of keys and values. Keys must always be unique, at least within a single "database". (Some data stores may support multi-tenancy "databases", where each is entirely isolated from one another.) Keys are typically complex data types in their own right, often made up of multiple parts, much like filesystem directory paths are made up of pieces. Values are not given any special treatment within the database, treated entirely as an opaque binary byte array.

Key	Value
K1	AAA
K2	Call me Ishmael.
K3	CCC123DDD456

Different key-value stores will have different semantics around this fundamental data structure. Some will treat the parts of the key as a physical segregation of the dataset into smaller (and more easily retrieved) collections, while others will keep all the tuples in one large collection. Still others may allow for distinct collections, essentially providing a means to provide multi-tenancy, similar to how a relational database can host different named "databases" each with its own schema.

Key-value stores typically are entirely schemaless beyond the basic key-value data structure. Because the database infers no structure on the value, the values themselves have no database-enforced requirements on length, content, or structure.

A Thought Experiment: A schema'ed key-value store could put schematic requirements on the value, verifying that the value follows some kind of file format, like BMP or JPG, or other binary format, like Java or .NET object serialization formats. Describing the binary formatting requirements would necessitate a highly-flexible descriptor language, maybe something akin to how other binary formats are specified (there's a history of different formats and IDLs here to choose from), and it would ideally be something open-ended to allow for pluggable run-time verification (maybe), but the idea is intriguing.

Query capabilities

This one's pretty easy, since realistically, there are none: the database typically allows for retrieval of the value by looking up the key, and that's it.

Some databases will return a multivalued response, allowing the query to put wildcards in for some part of the key (much as ls /Users/Ted/Documents/* returns a list of the files and directories in that location).

Because the associative data shape doesn't do any inference in the value half of the tuple, there's really little to nothing we can do with anything except the key itself. The key, as noted earlier, can be seen as a kind of partitioning element to segregate the collection(s) in some sort of logical manner, but fundamentally this is a hierarchical partitioning, at best.

As for the value, most (if not all) associative data stores support zero sort of filtering criteria on the value. We query on the key--that's it.

Shape analysis

In many respects, the programming model here is very similar as the "untyped collections" that popularized the Java and .NET communities early in their respective lifecycles (C#/.NET 1.0, where Java didn't get strongly-typed collections until Java5). In both Java and C#, programmers clamored for type-parameterizable collections, citing the concern over errors introduced by accidental insertion of unanticipated types in the collections held. However, it's also important to note that in the "untyped" programming languages Javascript, Python, and Ruby, the same call for type-safety hasn't been as strong, likely because those developers write their code in such a way to verify their expectations before use.

With the compound nature of keys, different "parts" of the key can be hierarchically variable; for example, if the stored value is "user session data", the key can be written to consist of both static and variable parts, the static being users and the variable being the user profile ID, such that the overall appearance of the key looks like users:{user-profile-ID}. This can also easily mirror what we see in URL slugs, a la topic:{topic-name}:articles:{article-id}. So long as the key's parts are inherently hierarchical, the key makes for a convenient means to segregate the space. As a result, any desirable filter criteria should be encoded into the key.

As one of the simplest "shapes" in the database world, the associative shape's entire ignorance of the value offers some interesting options:

rather than being seen as a single value, the value can instead be an open-ended collection of values, such as a list or an array. Here we assume either some sort of structure or defining element markers (such as the comma in a comma-delimited list) which the programmer will parse after retrieving.
in a logical extension of the above, we can treat the value associated with a specific key as an "append-only" and/or "growable" storage--take the value, append additional data to it, and re-store it under the same key.
the extreme flexibility of values means the keyspace can be entirely heterogenous, storing "a little bit of anything", creating an unparalleled flexibility, but which in turn will require the most amount of programmer-required validation, verification, or maintenance, because the database offers absolutely zero guarantees about what can (or is) stored. This flexibility is also what makes associative datastores popular as caching systems.

tuplespace systems, as exemplified by the academic sysem Linda and its semi-commercial spinoffs TSpaces (from IBM) and JavaSpaces (from Sun) were essentially associative storage, using the identity of an object (extended in this case to include the node in the network from whence it came) as the key and a serialized (often binary, using the object serialization capabilities of the Java language) representation of the object as the value. Microsoft's Orleans seems to be operating along similar lines, calling the key-value tuple a "grain" and offering several different flavors of keys:

* IGrainWithGuidKey: Marker interface for grains with Guid keys.
* IGrainWithIntegerKey: Marker interface for grains with Int64 keys.
* IGrainWithStringKey: Marker interface for grains with string keys.
* IGrainWithGuidCompoundKey: Marker interface for grains with compound keys.
* IGrainWithIntegerCompoundKey: Marker interface for grains with compound keys.

Because of the specific key-based nature of the database operations, most associative stores are extremely fast, and often highly scalable, since the likelihood of two separate processes operating on the same key are highly unlikely. Some databases will, in fact, trust the developer to manage concurrency entirely. This also suggests that the data stored in an associative system should be more or less independent of one another, since modifying two key-value tuples at once may not have transactional semantics.

Variant: Tupled data stores

A variant of the associative data approach is that of the "tupled" data, such as an Entity-value-attribute mode store, or a "triplestore", such as RDF triplets. Here, the association is a multi-part mesh of data elements, such as in RDF, which is "subject-predicate-object" (a la "Bob is 35" or "Bob knows Fred"), or more formally in the entity-attribute-value model, wherein an entity has an attribute that holds a value (such as "Bob-age-35" or "Bob-list(knows)-[Fred, Barney, Wilma]").

The interesting thing about these "tupled data shapes" is that they are significantly flexible, since the metadata is inherent in the structure; a "Person", for example, would be an object consisting of three tuples or entity-attribute-value triplets, so that the "person" object would look something like:

ted-firstname-"Ted"
ted-lastname-"Neward"
ted-age-50

In some ways, this feels like a traditional object-field-value, but the inherent dynamic nature of the tuples means that not every "person"-shaped object must have the same set of attributes:

cher-name-"Cher"
cher-age-"timeless"

meghan-firstname-"Meghan"
meghan-lastname-"Markle"
meghan-title-"Princess"
meghan-court-"England"

More importantly, each of these is (somewhat) independent of any others, we can add new tuples/triplets at any point, making this effectively a highly "dynamically-typed" way to think about the organization of data and objects. That said, thogh, it's still important to remember that these are not "objects", per se, but a list, essentially of tuples/triplets that, taken collectively, can be viewed as objects.

Or, if desired, as tables:

1-firstname-"Ted"
1-lastname-"Neward"
1-age-50

... where we assume the "entity" in the entity-attribute-value tuple is a primary key value.

The reason this feels like it falls under the associative data style is that we can map another tuple, like person-primarykey-1, which, so long as this tuple is unique to the associative store, we gain an interesting degree of flexibility. We can also imagine that tuple as parts of the key, as well, such that we can envision the data looking like the following:

Key	Value
person / 1 / firstname	"Ted"
person / 1 / lastname	"Neward"
person / 1 / age	50
person / 2 / name	"Cher"
person / 3 / firstname	"Meghan"
person / 3 / lastname	"Markle"
person / 3 / title	"Princess"

... and so on. This can create some interesting flexibility in structure.

Popular implementations ¹

In alphabetical order:

https://www.dragonflydb.io/guides/key-value-databases
↩

Tags: engineering storage database