For a while now I’ve sensed the emergence of a new species of software engineer, adding a third node to the DevOps dichotomy (though from a skills/capabilities perspective it’s actually more of a continuum).
The third vertex joining the Dev and Ops elements, is Data. DataOps has a nice ring to it, though DevData (or DataDev?) unfortunately doesn’t rolls-off-the-tongue quite so nicely. But more importantly, I think there is substance to it.
(Disclaimer: this is stream of consciousness, so a bit unrefined, and I loathe that I haven’t included a single citation/link in this, but I decided to get it out there now anyway to start somewhere)
Software is complex and becoming more complex. When software and infrastructure become intertwined, the complexity is compounded. This complexity makes it very hard to craft systems that conform to our understanding of how they should behave.
Our understanding of systems is frequently challenged in the area of performance. This is something that I wrestle with in my current role (where technically I am a “Software Engineer”). Sure we have benchmarks and best practices and back-of-envelopes and intuition about which locks will cause pain, and yes we have end-to-end fine-grained event-based performance monitoring, but as our computer systems increasingly behave more like biological systems it can all seem frustratingly imprecise and inconsistent. There’s always that one system memory spike that reproduces 90% of the time when you flip some switch in automation, but that remaining 10% makes you wonder if you really know what’s going on when you flip that switch.
Just like a developer needs to pick up new skills and tool knowledge to graduate to DevOps, there are skills and tool knowledge that a developer can pick up to graduate to DataDev. Those skills and tools allow the engineer to design experiments, set up data collection from automation and the field, think through how to organise that data, figure out the right things to plot along the way, and if you get all that right… eventually wind up with a usable set of data to run through that logistic regression you had in mind when you started it all months ago.
It all feels a bit hand-wavey which is why it’s taken me this long to write anything down about this, but today I noticed something about myself which might suggest this is a real thing.
My Pull Requests look very odd compared to everyone else’s. The diff is small, but the description is quite large, with numerous plots. Basically the description reads like a small tech report. Typically my colleagues don’t even have a description for their feature PRs, believing (rightly IMHO) that the code in tandem with their commit messages should be self-explanatory.
But when you attack a problem with data, the resulting code does not speak for itself. The data must speak for the code.
So if you find yourself making PRs with plots and graphs and descriptions about sample counts and outliers and excuses about how/why some plots don’t conform with your hypotheses, then maybe you have the Data vertex too.