Weirdness of Experiment

My job in ops has always been to keep things running. I never considered myself “working in software”, but have recently begun embracing the fact that I do. What I accomplish as an operations and infrastructure engineer is part of the system, it isn’t dislocated from its composition.

Relatedly, I have been considering the nature of the experiment in Chaos Engineering. How continuous verification is becoming a crucial part of the complex systems we build because there really is no end. Developing a software system isn’t just about writing it, it’s also every bit as much about running it. Unless there is some kind of evil catastrophic end-game planned from a volcano island hideout, most of us want to keep them running.

I’m big on experimental music. You probably know what I mean when I say that, but you might not because genres, in general, are horrible overgeneralizations. Similarly, after the composer John Cage had written his “silent piece” in 1952 (see also Living 4’33”), he seemed to have a struggle with the concept of calling the music composed by him and others he admired experimental.

In science, we often think of an experiment as a method to (dis)prove a hypothesis. We perform experiments to answer a question or assertion, often during the process of reaching an end goal. To Cage, this implied that calling something an “experiment” meant it was not complete, not finished. That there is a final state determined by products of the experimentation, and he thought that his (and others’) music was complete when performed. There was no “final state” that was decided as a result of an experiment either succeeding or failing, if it was itself called experimental.

Cage revised this view, however. He began embracing the term and actually ended up preferring it. The reason for this is the way he evolved to think about the context of sound. At the beginning of the decade, he experienced an anechoic chamber (an “echoless” room) and the non-presence of total silence, because he could hear both high and low sounds — explained to him by the engineer as his nervous system in operation and blood circulating, respectively. Whether or not that is physiologically probable, he had the now famous revelation that it is impossible to remove sound completely. 4’33” and an entire philosophy about the nature of sound and silence in music was not far behind.

To him then, the moniker experimental came to mean that which is undiscovered, because even if a piece of music requires certain sounds, environmental sounds are impossible to predict. This experimental music isn’t about the search for failure or success, but an experience of discovery, where questions become more interesting than answers. When applied to composition, each performance of a musical work is always new and different due to its context and sonic environment. Indeed, it is impossible to know ahead of time any structure of the interpenetrating sounds both intentional and not, themselves independent and unique (whether or not they are consonant). It is in fact in a total state of chaos, each and every time.

When complex systems run, they do so at the hand of indeterminacy and randomness. There certainly is a “steady state”, but it is continuously in need of verification. Just like Cage observing that no performance of a musical work is a repeat, the nature (structure and form) of distributed systems we operate cannot in truth be predicted with any kind of regularity.

So while it is useful to be very specific in defining and running our Chaos experiments, the nature of what we’re doing is more about asking questions and making discoveries, not testing for answers we already think we know or think we can guess. The “breaking things in production” mantra implies we are interested in failure when what we’re really interested in is what was determined and what questions arose, good or bad.

Appendix

Here are a couple of PDFs taken from Cage’s writing that highlight his viewpoint on the subject of “experimental music” as a title of what he did.

  • Experimental Music: Doctrine (1955) ::: This article, there titled Experimental Music, first appeared in The Score and I. M. A. Magazine, London, issue of June 1955. The inclusion of a dialogue between an uncompromising teacher and an unenlightened student, and the addition of the word ”doctrine” to the original title, are references to the Huang-Po Doctrine of Universal Mind.
  • Experimental Music (1957) ::: The following statement was given as an address to the convention of the Music Teachers National Association in Chicago in the winter of 1957. It was printed in the brochure accompanying George Avakian’s recording of my twenty-five-year retrospective concert at Town Hall, New York, in 1958.

The photo above is the album cover from a release by Craque called Meat Hacker.

To Build or To Buy, That is the Contradiction.

It’s dead simple. Focus on your team and product.

Yes! An easy tweetable answer! Except that it’s loaded with questions and assumptions. Instead, what I will talk about here are technology decisions. Those that really matter are about things that elevate your ability to build what you are building at the velocity you need.

In Operations and SRE, we are not only tasked with evaluating and testing technologies, we are often requested (or sometimes instructed) to support the decisions others make. This isn’t an ideal situation, and in some cases, impractical because it feels like a policing. In fact, it might sound a little selfish to think that SRE needs the ability to say “yay” or “nay” to a decision. If you think that feels slightly siloed, you’d be right. I prefer working with all vested parties to come up with a solution that fits the problem, not just make blind leaps of faith on a particular platform. Flexibility and cooperative evaluation work really really well, believe it or not.

In my experience, it takes a minimum of six months for a team to work up a brand new (to them) technology or platform and have it be supportable in production. Even then, if the software team building it won’t be permanently oncall for it (which is something else I believe), they should at least be on the hook for a minimum of six months further as the kinks are worked out. This seems fairly straightforward for something like a development framework or a particular platform specific to operating the software system in question. For example, adopting Redis as a DBRE-supported platform in production. There are clear reasons why the software needs this kind of ultraspeedy k/v store, and building up widespread support in SRE for it means other software teams benefit. That is a highly opinionated decision and one that’s fairly easy to make especially because it’s free!

But is it? In contrast, what happens when you need to fulfill the needs of infrastructure? This is where Operations tires really hit the road. This doesn’t mean it’s a siloed decision, for example introspecting the system is as important to reliability engineering as it is to the folks developing the software. The Redis example above shows how technology decisions become a crucial part of the collaboration among everyone involved building your product, whether it be dev or sec or ops or everything in-between. However, it doesn’t quite reflect the reality of infrastructure requirements, particularly tooling that supports the needs of reliability. “We need monitoring and we need it yesterday!”

So, I posit: There is no “Build or Buy”. There is only Buy.

I build electronic instruments. My product is the music I make, and one the central philosophies in my approach to improvisation and performance is the concept of “original sound”. What music does a plucked cactus make when amplified, or how can a mercury tilt-switch and photoresistor create sound from electrons? The top picture here is of a breadboard, a prototype of a device that will eventually drive something called a “Voltage Controlled Oscillator”, or VCO. To the left is a photo of a professionally designed, tested, and built VCO. I have no need to design and build a VCO, even though fun, it is a distraction – both in time and money – from my main purpose: the creation of a new kind of analog synthesizer controller. I’d rather buy the VCO and the accompanying support of the experts who not only built it but are passionate about its success. Then I can focus resources on my creativity.

The technology decision to be made is heavily informed by what kinds of resources you want to spend on it. Building a platform in-house, using your own people, is sometimes only feasible with very large companies that have the ability to staff for this, and often have custom circumstances. It’s not hard to notice that a few very successful software platforms come out of organizations in this position. These teams also must deal with all the infrastructure costs, maintenance, reliability, and complex aggregations of data and transports. These problems have indeed been solved many times over by third party SaaS/PaaS/IaaS solutions, but sometimes the amount of customization required demands it. Or the development of these systems may be so clearly aligned with the company objectives that it is a non-question (e.g. Netflix pioneering Chaos Engineering or Google producing Borg/Kubernetes).

These days, such cases are the exceptions. Most of us are with companies where this luxury and concentration of engineering ability aren’t present. It may make a ton more sense to choose an expert in the field, whose company mission is specifically to provide this need. The argument that Ops Is A Cost Center is underlined, because the focus should really be on the product. “We’re good enough we can run a custom log aggregation stack at the same time we’re developing this completely dependent but orthogonally related product” is probably an approach that should be questioned, unless of course you are prepared to buy those resources through hiring and operational expenses.

Either way, you’re building something. Recall the decision to choose Redis, and how long it took to enable that system in production. The same will apply to any in-house platform or tooling. Nevertheless, it’s a tradeoff. To buy resources for building in-house or buy a third party SaaS, each have their own layers of complexity, complication, and frustration. Yep, it might be fun to build, but is embarking on such a project really the right choice for the team and the product?

Like many in Ops, I once considered the “Elastic Search / Logstash / Kabana” (aka ELK) stack for datacenter log aggregation. All its various pieces are a fun complex thing to put together, and it’s an extremely useful resource for gathering and displaying events. All these pieces are freely available things, but our SRE teams were already pretty busy with the task of running our own product and other custom bits of infrastructure. It would take at least a single SRE’s full time and attention to keep the stack running across multiple server farms of 10,000+ nodes. Not to mention training and documentation for others. Maintained, tested, resilience-hardened, budgeted, the list goes on. I’m definitely going to be “buying” a lot here, and it’s a messy looking BOM. ELK’s competitors at the time ranged from fairly entry-level cloud-based SaaS to “Enterprise” packages, but one vendor, in particular, would be a great solution. Was there a cost and the administrative toil of negotiating a contract and keeping a vendor relationship going? Yep. It’s also a single invoice, takes much less time from the SRE team to manage, and is much simpler to integrate with the farm.

Would you rather focus on building the instrumentation into your software and have a team of external experts guide, consult, and ultimately provide the intelligence platform to which they have dedicated their careers? Or would you rather split focus and build your own customizations into the infrastructure? It’s not black and white, either. One method may outweigh the other depending on whether you’re bare-metal or in the cloud. There may be security reasons to do one over the other. As the software product matures, these needs may blur. What needs to be bought may be simple and small or complex and large, and grow in either direction.

So let’s be clear: a third party platform isn’t automatically “more expensive” than creating an equally performant service or fulfill a particular infrastructure reliability need in-house. Do the research and compare, make the investment that makes the most sense for your team.

Don’t be fooled, though. Either way, you’re buying it.