01 February 2025

tl;dr One of the "OG" data formats, the tabular data structure, aka "the flat file", is still today a handy and reasonable way of exchanging data in an automatable fashion without significant integration work required. Its shape is ideal for a multitude of data molecules that all share the exact same contents.

Refresher

In a serialized representation of the tabular format, such as a CSV file:

Shape analysis

Within most programming languages, the tabular shape is often most easily represented as an array or list of strongly-typed records/structs. However, there are some distinct differences. In most O-O languages, an "array of objects" doesn't actually store the object, but only stores pointers/reference to objects, which allows for aliased repetition of an object in a collection:

Example 1:

// This is pseudo-Java/C#/C++:

Person p = new Person("Fred", "Flintstone", 39);
List<Person> people = new List<Person>();
people.append(p); // Here's Fred!
people.append(p); // Uh, here's Fred again!
people.append(p); // Wait, Fred, third time?

Notice that while we assume the array looks like this:

flowchart TB
    people-->p1[Fred]
    people-->p2[Fred]
    people-->p3[Fred]

In truth it's more like this:

flowchart TB
    people-->p1[cell 0]-->person[Fred]
    people-->p2[cell 1]-->person[Fred]
    people-->p3[cell 2]-->person[Fred]

In other words, each cell in the array is pointing to the same object. We certainly could create the more "tabular-correct" representation, like so:

Example 2:

// This is pseudo-Java/C#/C++:

List<Person> people = new List<Person>();
people.append(new Person("Fred", "Flintstone", 39)); // Here's Fred!
people.append(new Person("Fred", "Flintstone", 39)); // Uh, here's Fred again!
people.append(new Person("Fred", "Flintstone", 39)); // Wait, Fred, third time?

But this now has three distinct instances of Fred running around, and that duplication might cause havoc in other ways.

(Hmm, I think they actually did an episode of the Flintstones where that exact thing happened.... Yup, season 4, episode 104! )

While the difference here is subtle, the key is that in the object/language scenario, we have references to Fred, rather than copies of Fred, such that if we change Example 1 slightly to read:

Example 1A:

// This is pseudo-Java/C#/C++:

Person p = new Person("Fred", "Flintstone", 39);
List<Person> people = new List<Person>();
people.append(p); // Here's Fred!
people.append(p); // Uh, here's Fred again!
people.append(p); // Wait, Fred, third time?
people[0].firstName = "Wilma";

Now the array appears to contain three Wilma Flintstones. This is clearly different than in Example 2, where have distinct copies of each means only the first Person is modified.

In addition, the lack of any reference mechanism means that if we widen the definition of "Person" to include a spouse as a fourth atom, we run smack into the problem that we cannot reference another Person in the tabular format; the best we can do is copy some of the atoms and do a "fixup" later, a la:

FirstName LastName Age SpouseFirstName SpouseLastName
Fred Flintstone 39 Wilma Flintstone
Wilma Flintstone 35 Fred Flintstone
Barney Rubble 38 Betty Rbuble
Betty Rubble 35 Barney Flintstone

(Column labels appear only as a convenience to us humans.)

Notice, however, that the mistakes (one syntactically incorrect, one semantically incorrect) in the Rubbles' records are completely acceptable to the tabular data store, leaving it up to the developer to catch "by hand". This lack of verification/validation is a rampant problem in any flat-file-based data interchange, and has plagued developers for years.


Tags: engineering   storage   database