Weirdness of Experiment

My job in ops has always been to keep things running. I never considered myself “working in software”, but have recently begun embracing the fact that I do. What I accomplish as an operations and infrastructure engineer is part of the system, it isn’t dislocated from its composition.

Relatedly, I have been considering the nature of the experiment in Chaos Engineering. How continuous verification is becoming a crucial part of the complex systems we build because there really is no end. Developing a software system isn’t just about writing it, it’s also every bit as much about running it. Unless there is some kind of evil catastrophic end-game planned from a volcano island hideout, most of us want to keep them running.

I’m big on experimental music. You probably know what I mean when I say that, but you might not because genres, in general, are horrible overgeneralizations. Similarly, after the composer John Cage had written his “silent piece” in 1952 (see also Living 4’33”), he seemed to have a struggle with the concept of calling the music composed by him and others he admired experimental.

In science, we often think of an experiment as a method to (dis)prove a hypothesis. We perform experiments to answer a question or assertion, often during the process of reaching an end goal. To Cage, this implied that calling something an “experiment” meant it was not complete, not finished. That there is a final state determined by products of the experimentation, and he thought that his (and others’) music was complete when performed. There was no “final state” that was decided as a result of an experiment either succeeding or failing, if it was itself called experimental.

Cage revised this view, however. He began embracing the term and actually ended up preferring it. The reason for this is the way he evolved to think about the context of sound. At the beginning of the decade, he experienced an anechoic chamber (an “echoless” room) and the non-presence of total silence, because he could hear both high and low sounds — explained to him by the engineer as his nervous system in operation and blood circulating, respectively. Whether or not that is physiologically probable, he had the now famous revelation that it is impossible to remove sound completely. 4’33” and an entire philosophy about the nature of sound and silence in music was not far behind.

To him then, the moniker experimental came to mean that which is undiscovered, because even if a piece of music requires certain sounds, environmental sounds are impossible to predict. This experimental music isn’t about the search for failure or success, but an experience of discovery, where questions become more interesting than answers. When applied to composition, each performance of a musical work is always new and different due to its context and sonic environment. Indeed, it is impossible to know ahead of time any structure of the interpenetrating sounds both intentional and not, themselves independent and unique (whether or not they are consonant). It is in fact in a total state of chaos, each and every time.

When complex systems run, they do so at the hand of indeterminacy and randomness. There certainly is a “steady state”, but it is continuously in need of verification. Just like Cage observing that no performance of a musical work is a repeat, the nature (structure and form) of distributed systems we operate cannot in truth be predicted with any kind of regularity.

So while it is useful to be very specific in defining and running our Chaos experiments, the nature of what we’re doing is more about asking questions and making discoveries, not testing for answers we already think we know or think we can guess. The “breaking things in production” mantra implies we are interested in failure when what we’re really interested in is what was determined and what questions arose, good or bad.

Appendix

Here are a couple of PDFs taken from Cage’s writing that highlight his viewpoint on the subject of “experimental music” as a title of what he did.

  • Experimental Music: Doctrine (1955) ::: This article, there titled Experimental Music, first appeared in The Score and I. M. A. Magazine, London, issue of June 1955. The inclusion of a dialogue between an uncompromising teacher and an unenlightened student, and the addition of the word ”doctrine” to the original title, are references to the Huang-Po Doctrine of Universal Mind.
  • Experimental Music (1957) ::: The following statement was given as an address to the convention of the Music Teachers National Association in Chicago in the winter of 1957. It was printed in the brochure accompanying George Avakian’s recording of my twenty-five-year retrospective concert at Town Hall, New York, in 1958.

The photo above is the album cover from a release by Craque called Meat Hacker.

To Build or To Buy, That is the Contradiction.

It’s dead simple. Focus on your team and product.

Yes! An easy tweetable answer! Except that it’s loaded with questions and assumptions. Instead, what I will talk about here are technology decisions. Those that really matter are about things that elevate your ability to build what you are building at the velocity you need.

In Operations and SRE, we are not only tasked with evaluating and testing technologies, we are often requested (or sometimes instructed) to support the decisions others make. This isn’t an ideal situation, and in some cases, impractical because it feels like a policing. In fact, it might sound a little selfish to think that SRE needs the ability to say “yay” or “nay” to a decision. If you think that feels slightly siloed, you’d be right. I prefer working with all vested parties to come up with a solution that fits the problem, not just make blind leaps of faith on a particular platform. Flexibility and cooperative evaluation work really really well, believe it or not.

In my experience, it takes a minimum of six months for a team to work up a brand new (to them) technology or platform and have it be supportable in production. Even then, if the software team building it won’t be permanently oncall for it (which is something else I believe), they should at least be on the hook for a minimum of six months further as the kinks are worked out. This seems fairly straightforward for something like a development framework or a particular platform specific to operating the software system in question. For example, adopting Redis as a DBRE-supported platform in production. There are clear reasons why the software needs this kind of ultraspeedy k/v store, and building up widespread support in SRE for it means other software teams benefit. That is a highly opinionated decision and one that’s fairly easy to make especially because it’s free!

But is it? In contrast, what happens when you need to fulfill the needs of infrastructure? This is where Operations tires really hit the road. This doesn’t mean it’s a siloed decision, for example introspecting the system is as important to reliability engineering as it is to the folks developing the software. The Redis example above shows how technology decisions become a crucial part of the collaboration among everyone involved building your product, whether it be dev or sec or ops or everything in-between. However, it doesn’t quite reflect the reality of infrastructure requirements, particularly tooling that supports the needs of reliability. “We need monitoring and we need it yesterday!”

So, I posit: There is no “Build or Buy”. There is only Buy.

I build electronic instruments. My product is the music I make, and one the central philosophies in my approach to improvisation and performance is the concept of “original sound”. What music does a plucked cactus make when amplified, or how can a mercury tilt-switch and photoresistor create sound from electrons? The top picture here is of a breadboard, a prototype of a device that will eventually drive something called a “Voltage Controlled Oscillator”, or VCO. To the left is a photo of a professionally designed, tested, and built VCO. I have no need to design and build a VCO, even though fun, it is a distraction – both in time and money – from my main purpose: the creation of a new kind of analog synthesizer controller. I’d rather buy the VCO and the accompanying support of the experts who not only built it but are passionate about its success. Then I can focus resources on my creativity.

The technology decision to be made is heavily informed by what kinds of resources you want to spend on it. Building a platform in-house, using your own people, is sometimes only feasible with very large companies that have the ability to staff for this, and often have custom circumstances. It’s not hard to notice that a few very successful software platforms come out of organizations in this position. These teams also must deal with all the infrastructure costs, maintenance, reliability, and complex aggregations of data and transports. These problems have indeed been solved many times over by third party SaaS/PaaS/IaaS solutions, but sometimes the amount of customization required demands it. Or the development of these systems may be so clearly aligned with the company objectives that it is a non-question (e.g. Netflix pioneering Chaos Engineering or Google producing Borg/Kubernetes).

These days, such cases are the exceptions. Most of us are with companies where this luxury and concentration of engineering ability aren’t present. It may make a ton more sense to choose an expert in the field, whose company mission is specifically to provide this need. The argument that Ops Is A Cost Center is underlined, because the focus should really be on the product. “We’re good enough we can run a custom log aggregation stack at the same time we’re developing this completely dependent but orthogonally related product” is probably an approach that should be questioned, unless of course you are prepared to buy those resources through hiring and operational expenses.

Either way, you’re building something. Recall the decision to choose Redis, and how long it took to enable that system in production. The same will apply to any in-house platform or tooling. Nevertheless, it’s a tradeoff. To buy resources for building in-house or buy a third party SaaS, each have their own layers of complexity, complication, and frustration. Yep, it might be fun to build, but is embarking on such a project really the right choice for the team and the product?

Like many in Ops, I once considered the “Elastic Search / Logstash / Kabana” (aka ELK) stack for datacenter log aggregation. All its various pieces are a fun complex thing to put together, and it’s an extremely useful resource for gathering and displaying events. All these pieces are freely available things, but our SRE teams were already pretty busy with the task of running our own product and other custom bits of infrastructure. It would take at least a single SRE’s full time and attention to keep the stack running across multiple server farms of 10,000+ nodes. Not to mention training and documentation for others. Maintained, tested, resilience-hardened, budgeted, the list goes on. I’m definitely going to be “buying” a lot here, and it’s a messy looking BOM. ELK’s competitors at the time ranged from fairly entry-level cloud-based SaaS to “Enterprise” packages, but one vendor, in particular, would be a great solution. Was there a cost and the administrative toil of negotiating a contract and keeping a vendor relationship going? Yep. It’s also a single invoice, takes much less time from the SRE team to manage, and is much simpler to integrate with the farm.

Would you rather focus on building the instrumentation into your software and have a team of external experts guide, consult, and ultimately provide the intelligence platform to which they have dedicated their careers? Or would you rather split focus and build your own customizations into the infrastructure? It’s not black and white, either. One method may outweigh the other depending on whether you’re bare-metal or in the cloud. There may be security reasons to do one over the other. As the software product matures, these needs may blur. What needs to be bought may be simple and small or complex and large, and grow in either direction.

So let’s be clear: a third party platform isn’t automatically “more expensive” than creating an equally performant service or fulfill a particular infrastructure reliability need in-house. Do the research and compare, make the investment that makes the most sense for your team.

Don’t be fooled, though. Either way, you’re buying it.

Musical Intuition meet Technology and Chaos

Today I read through the InfoQ eMag on Chaos Engineering, and was struck by John Allspaw’s (@allspaw) contribution because it reminded me of something I jotted down on a sticky at my desk a few days ago:

Intuition is valid because it is learned like jazz changes.

I’m pretty stubborn and refuse to accept that music is merely a hobby of mine. When people ask me if electronic music or singing is my “hobby”, I am wincing inside. So a question often on my mind is: how does the intuition I have when performing and composing music connect with the work I do as a technologist?

Some musicological background might help. One concept in learning how to improvise (jazz or otherwise) is that you have developed an intuition built around internalizing the materials and form of the piece (or genre) – like scales, chord changes, or rhythm structures. This is different from the more lizard-brainy concept of instinct. Think about a blues progression, the foundation of music you hear every day, everywhere. You know intuitively the chord progression and timing is “right”, even so much that anomalies and departures come across as emotionally significant. The rest is pop history.

But you, homo sapiens, do not have this chord sequence pre-programed in your DNA, it isn’t something that is instinctual. By the same token, great technology leaders develop good intuition (expertise over hundreds of interviews) when hiring engineers but never rely on instinct (oh I just have a good “gut feeling”). The best DBAs have an intuitive understanding of their platform (you want to do X, but did you think of Y+Z?), but there’s nothing instinctual about it.

It is not a stretch, then, to recognize that intuition in improvised music can be directly compared to how Allspaw writes about the “mental map” that engineers develop. They each have their own subjective view on relevant (but overlapping) parts of the system and are challenged when relating each substrate to theirs. For instance, a phenomenon known as “fundamental common-ground breakdown” (Woods & Klein: Common Ground and Coordination in Joint Activity) happens when what I describe as intuitions (accumulated individual learnings about the system) are assumed knowledge among participants, good or bad. Part of the game is learning how to harmonize these separate threads of experience, avoiding costly coordination surprise and re-synchronization… and trust me, I have been in plenty of rehearsals and narrowly saved performances that fit this description!

The important point here is that a system becomes more complex as it grows dimensions, shrinking the capacity of any one person to comprehend the whole thing. Therefore we rely on shared and discovered knowledge to fully grok these fascinating systems. Take any ensemble of musicians: as it grows in membership, individuals gradually lose the ability to contain its myriad relationships in their mental map, so coordination and integration become a matter of listening and rehearsal experience (both modes of communication). Oh and it characterizes the music, too. Building intuition about how to play a part in an opera is much different than in a free improv vocal trio. Orchestrating ten thousand linux containers in a cloud provider doesn’t compare to managing two rows of server racks at the datacenter downtown.

Technologists grapple with the task of building and sharing intuitions about a system because understanding an entire system contributes to what we know about making it more resilient. Communication is key in either musical or engineering teams, collaboration on understanding the whole is no exception. Our mental maps should be adaptable to constant updates, and practices like Chaos Engineering that make discoveries in complex system behavior are supported by this kind of cross-pollination and proliferation of our combined understanding.

A quote from Allspaw’s article highlights it well:

Maybe the process of designing a chaos experiment is just as valuable as the actual performance of the experiment.

– John Allspaw, Recalibrating Mental Models through Design of Chaos Experiments

The use of the term “performance” is apt. We’re familiar with this concept: practice makes perfect. Taken further, the experience of practice is necessary such that the result is merely an extension of practice. It takes meticulous work to understand a piece of music to the level of having an intuition about how it operates, and the same goes for building experimentation that challenges what you think you know about complex software. The results of the “performance” can be enhanced by a focus on understanding the system’s design and steady state (i.e. nominal condition), what we would call the language of the musical work. It is as if the performance of the event naturally evolves from learnings gained preparing for it.

Imagine you are a jazz musician, you have gone through years of studying scales and changes and charts and recordings of a particular artist, and have built a capability for understanding how the language of their music works. One evening at a local club, your dreams are fulfilled, you’re in the audience and invited up for a set with them. You intuitively know how this person plays their music, as it has been a guide for your own. But when you’re jamming together, they do something indeterminately that informs your intuition in a way you would have never discovered yourself. Not only has the process of designing your inevitable collaboration been valuable to understand what you thought you needed to know to play like your biggest influence, but it also served as the basis for learning something new and unexpected.

Whether it is free improvisation or interpreting a through-composed piece of music (and everything in between), there is a certain amount of experience and training informing the performance. Eventually, when we’ve practiced enough, the music itself steps out of the way and intuition takes over. I think this is where my musical performance connection with technology starts: once you understand the fundamentals of the system, let the presentation of the system get out of the way, and you’re in a better place to evolve your mental map and gain further intuition through disciplines like Chaos Engineering.