Distributed Systems and Organization Design

22 January 2025

tl;dr Engineers (and their managers) have spent much of the last forty years learning (and sometimes re-learning and re-learning and re-learning...) the various repercussions and implications of distributed systems. As an engineering manager, I've discovered that there is a remarkable similarity between distributed systems design and engineering organization design.

For most engineering managers, if they know much about "org design", it's either Conway's Law ("A software architecture will mirror the organization that produced it") or it's what they learned from the book Team Topologies. (By the way, if you were wondering about the "truthiness" of Conway's Law, check out this paper by Harvard Business Review; tl;dr, yes, it holds up.) Beyond that, though....

It's a little beyond the scope of this post, but suffice it to say, there's a whole area of study around organizational design, and much of it has the same kind of rigor and analysis to it that we do around distributed system design. That is to say, some people (particularly academics) study it with a great deal of rigor and discipline, and the rest of us wait for somebody to summarize their findings so we can completely ignore it later in favor of some hot new tool or technique. I fully admit to being no expert on org systems, but over the last decade, as I've seen (both observed and led) more organizations from a managerial viewpoint, I've come to realize that there's a lot of parallels between the two.

If we make the basic assumption that a "team" in an org chart is the rough equivalent to a "server" or "compute node" in a distributed system diagram, and that a "team member" is roughly a compute process, we start to find some very interesting parallels.

Work is done much faster by an individual ("process") than by committee ("cluster of processes").
Where a process in distributed system will "make API calls" or "pass data" or "send a message", people will "collaborate" or "hand off" or "send a message" to each other.
It's much faster for an individual to have data memorized (cached) than to have to look it up in a book or some other external storage (database).
Changing the data that's currently memorized by an individual requires that individual spend a non-trivial amount of time to re-memorize the new information, starting from notifying them about the change (cache updates and propagation).
Passing information is exponentially more expensive the further away the intended recipient is, whether in the next chair, next room, next building, or next continent.

... and so on.

And it's really hard not to notice that a distributed system topology--a collection of nodes with links connecting some or all of them--is a remarkably similar diagram to a process flow diagram--a collection of nodes (teams) with links (communication or process handoffs) connecting some or all of them. It seems no accident that we refer to "workflow" as a term that covers both human actions and/or interactions and computer actions and/or interactions.

In the distributed systems world, it turns out to be helpful to think about the Eight Fallacies of Distributed Computing as a framework to think about distributed systems. Using these as guideposts, or more likely, warning signs, we can often ship a reasonable distributed system without having to get too academic or methodical. So, in the spirit of the Eight Fallacies of Distributed Computing, we can derive the Eight Fallaies of Distributed Teams:

The organization is reliable. We all know that nodes in a network can drop on a regular basis. Gratned, with modern computing trends and safeguards, it's rare, but it still happens. Similarly, even though we know most people will show up to work most days, there's still that non-zero possibility that a given individual won't. They'll be sick, they'll be distracted, they'll be re-tasked to something different by their management, and so on. If that weren't enough, we can of course never ignore the possibility that they'll be there at work, ready to carry out your request, if only they'd received your email/text message/inter-office memo/voicemail, because somewhere along the communication path your message was just flat-out lost. (Who says ACKs are just for low-level network protocols?)
Latency is zero. The time required to engage in communication between two people is very obviously non-zero. In fact, it can take quite a bit of time to get what's in my head over into your head. Sometimes this is because of the time required by the medium of communication (ahem inter-office memos), but....
Bandwidth is infinite. In the human case, it's not communication bandwidth we worry about as it is an individuals mental bandwidth. Most people cannot juggle more than one or two things simultaneously. Even those of us gifted with Attention Deficit Disorder find that we have limits to the number of things we can keep our minds on, and we're better at multitasking than most. A human will eventually need to prioritize and serialize the things they're working on, which means that some work goes to the bottom of the pile.
The network is secure. Sure, HR isn't going to write down a performance-management report on a piece of paper, find the first person outside their office, hand them that piece of paper, and ask that person to hand-deliver the message to the VP of Sales over in the next building--such a practice would be entirely unprofessional, deeply illegal, wildly impersonal, and hugely unethical. We get that. However, if HR is having a conversation with the VP of Sales, who else is listening to their conversation as the HR rep is sitting in the middle of the open-office plan? More importantly, when two loan officers are discussing a mortgage client's financials, who is listening in? (A long, long time ago, Intel found itself in an interesting quandry when they discovered tech news journalists were simply riding the shuttle flight between San Jose and Sacramento (where Intel had fabrication plants, in Folsom), simply listening in to the seat chatter between Intel employees, who were easily identifiable by the Intel work badges they proudly wore.) If nodes in your org chart must communicate to one another, how are they doing it?
Topology doesn't change. Organization charts are famously instable, perhaps even more so than computer network topologies. If team A depends on team B for some of the work that team A needs to get done, what happens when team B is re-orged into an entirely different division?
There is one administrator. Well, technically, if everybody works for the same company, there's the CEO, but the larger point here is that frequently teams need to collaborate "across" the org chart, where the only point where both teams are under the same manager is well up the org chart--when Product and Engineering work together, for example, the Engineering team will ultimately report to a VP of Engineering, and the Product team to a VP of Product, who in turn each report to the CEO. This means that there is no one individual close to the teams that can resolve any inter-personal conflicts or "turf battles"--the teams must sort the issues out among themselves. Neither side can effectively make an appeal to authority for resolution. This means the teams are going to have to each feel some pressure to make things work, even if they have a intellectual or moral disagreement on some part of the collaboration.
Transport cost is zero. In the original Fallacies, this one is often seen as related to the "Latency is zero" fallacy--where the first focuses on the time required to actually transmit the data, and this fallacy focuses more on the work required to prepare the data for transmission as well as interpret the data upon receipt. (In RPC terminology, for example, we talk about "serializing and deserializing" the objects transmitted across the wire.) In an organizational scenario, this transport cost is often the vastly larger time-consumer, as sometimes teams will have to create bespoke presentations or documents solely for the purpose of communicating one or more complex ideas to their collaborators (not to mention the delays introduced while we wait for everybody who must be in the meeting to have a free spot on their calendar to be able to discuss the topic at hand!).
The network is homogenous. In the distributed systems fallacies, James Gosling introduced this one to point out that (almost) no computer network is actually made up of the same kind of hardware/OS combination (it's never all-Intel/Windows, it's never all-ARM/macOS, it's never all-anything). His argument, then, was to suggest "This is why everything should be written in Java!", since that would render the fallacy moot. And, of course, we promptly wrote everything in Java, C#, Ruby, Smalltalk, Go, Dart, some C++, a little COBOL.... In the organizational sense, it's important to realize that not only is each team built for different purposes than one another, each person within the team often has different skills and perspectives from each other. This has the positive quality that a team that can take advantage of its members' strengths can often overperform to its expectations, but it also means that a team that doesn't recognize that "not everybody is great at accounting" is going to run into problems when it tries to swap one individual for another without taking into account their individual skillsets. (I'm looking very squarely at you, "full-stack engineer"-hunting recruiters....)

Note that these don't all capture all of the Dos-and-Donts of organization design (not by a tenth), but it does certainly establish an interesting start to an interesting thought project, that being, What if we thought about human-centric processes and workflows in the same way that we think about distributed systems design and architecture?

It would give us some familiar tools by which to document and/or analyze processes, a la those "workstream" efforts that certain process coaches love to conduct. (I call them "process" coaches because some of them do so in service to agile efforts, while others do them in service to efforts that are anything but guided by the Agile Manifesto.)
It helps set some expectations around expectations when work leaves an individual, a team, or an organization. (In other words, if you hand it off to another team, it's best to expect that to be orders of magnitude longer before you get a response when compared to something that you're doing yourself.)
We might want to prevent "chatty" exchanges between individuals, and we definitely want to think twice about such exchanges between teams, if we need something processed quickly.
Most of all, it behooves us to look for ways to collect everybody working on a particular large-scale task to be relatively close together on the org chart, to minimize the latency (#2) and transport cost (#6) between the teams.

It feels like this is a topic worth exploring in more detail, to me at least.

Tags: management engineering

Distributed Systems and Organization Design

A thought experiment: At some fundamental levels, the nature of organizations is very similar to the nature of distributed systems.