31 January 2025

tl;dr Dating all the way back to the earliest of databases, data has had a "shape" to it, a natural schematic form that defines the various "atoms" and "molecules" that the database understands naturally. Relational data, such as that described by SQL and relational databases, is only one such shape; the various "NoSQL" databases often offer a different shape, but it's not always the same. What's more, just because a database uses a non-relational shape doesn't mean it's a natural fit to your object-programming language's code. When it doesn't, you have an impedance mismatch, which has to be solved somewhere (usually to the detriment of your code and complexity budget).

(BTW, much of this material is stuff I've covered in the past in talks and other materials, but never really with this level of focus; if it sounds familiar, you've probably already heard me yak on about it before, so you might choose to move on.)

The Atom of Data: The Bit

Fundamentally, all data boils down to a single bit, 1 or 0. Everything that ever crosses a CPU, resides in RAM, or is stored on disk, is made up of dozens, hundreds, thousands, millions or even billions of these. Beyond that, though, the bit is entirely uninteresting, because with the very rare exception of either writing assembly code by hand or working with bitmasks in a higher-level language (lookin' at you, C), we don't really deal with bits very often. Even native boolean types are often "widened" out to a larger value.

Er, I mean...

So while the bit is useful and necessary to understand at a foundational level, its purpose here is to combine in various ways to create higher-level data types, which are usually made atomic at the language or data-storage level:

But these are just numbers; a good chunk of our data is stored in human text, a la strings, which are collections of 8-bit bytes (if it's stored in ASCII) or 16- to 64-bit Unicode Transfer Format chunks; this is where UTF-8, UTF-16, and/or UTF-32 comes into play. And, in many cases, since we don't know how long a string should be (are we storing a single word from the dictionary, or the entire collected works of Shakespeare?), we have to either set a special value aside to indicate the end of the string (C used the value 0, or NULL, to indicate the end of its byte-wide strings storing ASCII characters, hence giving us the term "NULL-terminated ASCII string"), or we have to capture the length somewhere (Pascal put the length in the first byte, so that the first character in the string was always at index 1, which neatly offset the C quirk of the first byte being at index 0, but also led to a max size of 255 characters for the string), or we just have to set a fixed size and assume the string falls under that length, or something.

Wait, we were talking about databases

Any database system worth its weight in weightless bits will define a set of "native" types that it understands atomically, just as programming languages do. The SQL standard, for example, defines a set of types like VARCHAR, a length-unlimited string type, or VARCHAR(2000), a string that can store up to 2000 characters. There's usually some performance- or feature-related consequences for choosing one over another, but these types form some of the core "atoms" of what we can store in the database.

Then, the database typically defines some kind of "higher", "molecule"-like, thing that allows for programmers to "assemble" larger things (often called "entities", since that's a nice vendor-agnostic term) out of the core atoms. So, for example, if we want to store demographic information about a person, we might want to store:

erDiagram
    PERSON {
        string familyName
        string givenName
        int age
    }

We sometimes even get a little nutty and suggest that persons can be in a relationship to one another:

erDiagram
    PERSON  |o--o| PERSON : marries
    PERSON {
        string familyName
        string givenName
        int age
    }

It's not all that unlike creating user-defined types in programming languages (be they classes, records, unions, whatever). The really interesting question is if and/or how the database provides support for the definition of this user-defined type. If, for example, the database natively understands that "marries" relationship defined above, then we can use the database's native facilities (query language, for example) to retrieve the data we need more precisely (and more quickly and more scalably and more efficiently and...).

(Of course, in many scenarios, scalability, efficiency, or performance aren't high-need requirements--of course they're always nice to have, but sometimes you just don't care about that as much as you do other concerns, like "development time". In those cases, developers should of course use whatever tools are most appropriate--in a shop that knows Postgresql really well, use Postgres for everything! Just be aware that there's always some cost to navigating between these different "shapes"--the "impedance mismatch"--and after a certain point, that cost will want to be recuperated.)

Schema vs schemaless

Those databases which allow us to define a formal definition for these user-defined types are said to support the notion of a database-enforced schema, and the most common examples of these are the relational databases of the world. (As a matter of fact, I can't think of an RDBMS that doesn't enforce, nay require, a schema definition for the database.) The database will have you define what your user-defined entities look like, and then go to great lenths to make sure that whatever gets stored in the database is in agreement with what's defined in the schema.

Contrast this with the recent spate of "NoSQL" databases, most of which not only changed the shape of the data they were storing, but also removed the notion of schema from the picture, because it turns out that most developers weren't really all that frustrated with SQL (well, not really), but with the fact that the databse was enforcing that the schema had to be defined, and then enforcing what was in the schema. Developers were finding that as they iterate on a project, they often need to refactor the "things" (classes, tables, files, whatever) they're defining, and while programming languages had pretty decent refactoring support, schema-enforcing relational databases did not. So the NoSQLs said, "Let's drop the schema!" And developers said, "Awesome!" (And then a typo wrecked the developers' whole week when they missed it in testing and it made its way into Prod, but that came later.)

The point to understand here is schema is not a property of the data's shape. We could have schema-less relational databases (we often call them CSV files), and we could have schema-enforced databases of other shapes (XML, for example, defined XSD, aka "XML Schema", to do exactly this). We don't see the full spectrum of all the combinations mostly because... well, I dunno why. Just seems like developers aren't interested in what some of those other combinations are (which, to be clear, I think is a mistake--those other combinations are fascinating).

Now, the last "atomic" type the database typically understands, natively, is some kind of association between atoms. It might be something as simple as a key-value pairing (associating the key to the value), or it might be as complicated as a foreign-key relationship in a relational data model, but typically each database understands that things have to relate to one another in some fashion.

And it's these relationships that typically define the "interesting" parts of what makes up the shape of the data in the database.

The known universe

As near as I can count, there's only a few data shapes, which I'll go over in separate posts:

(I haven't finished writing up the graph and hierarchical notes yet, those are next)

There's often some variance between particular examples of each, but fundamentally, these appear (to me!) to be the high-level distinguishing models. (Of course, if more models make themselves appear, I'll add them to the list.)

One of the older data shapes is that of the navigational or "network" data model, but I think, after some initial analysis/thought, that the network shape really is either entirely subsumed inside the object or graph shapes, or it sits somewhere in between the two on the Venn diagram, overlapped by each of the other two, and lying entirely within the space of the two combined. (I might be convinced otherwise, but somebody would have to work at it.)

There's also a class of database products that try to embrace more than one of these shapes simultaneously, calling themselves "multi-model" databases. In my limited experience with them, I've found that they usually embrace one model at the heart of their internal operation, and then project other models on top of that singular, internal model. (Which usually means that in a head-to-head shootout with a database product that "natively" understands a shape, the "projected" shape multimodel database loses. Badly.)

By the way, let's also make one thing very clear: How we represent these models in memory may look somewhat different from how they are serialized and stored to disk. For example, when I new up two objects in (pick-your-favorite-OO-language-here) and point them to each other, I don't see the pointer value of their location in memory because the language hides that from me (typically). If I were to store these two to disk, though, we'd need to (a) identify them and (b) put those identifiers someplace where they could be turned back into pointers later. The pointers are natively understood by the language (and, by extension, the database engine that we're talking about), so they're a "natural" part of the database's model/"shape".

Along those same lines, it's also important to recognize that the shape of the data understood by the database is not always going to be entirely reflected in how the data is stored on disk; lots of databases do some very interesting things at the storage engine level in order to gain performance, and then "make things right" when they load data into memory for analysis or processing. (In other words, the magic behind highly-performant storage engines is out-of-scope for what I'm trying to discuss here!) In other words, just because every major relational database stores their data in a B-tree (or -derived) format on disk doesn't make every relational database a hierarchical, graph, or associative shape. If we go down the stack far enough, we discover that everything is just made up of 1s and 0s! I choose to draw the line at how the data looks and interacts at the level at which the developer sees it, through either its query interface (SQL, OQL, Cypher, whatever), its API, or its storage format (if there is no consistent API--this applies to XML, JSON, CSV, TSV, and others).

And yes, we can convert the database's native format into another format, often using tools found in the database itself--most RDBMSes in 2025 can return a SQL query in JSON or XML, if you ask it nicely enough. Far from invalidating this post, though, I think it actually justifies it: If it was trivial to consume the results in the desired manner, why does the database need to provide the feature? :-)

Thought experiments

In this section, I'm going to idly wonder about other models and/or thoughts around the shape of data in general or implementation ideas. Some thought experiments appear in each of the specific data shapes, too, if they're closely aligned to a particular shape. I don't know if I have any particular point or solution here--call this the "thinking out loud" section.


Tags: engineering   storage   database