When I was a PhD student, we had a regular internal seminar series for postgrads to present anything relevant to their work, most typically their intermediate findings. At one such session, a mobile telecommunications researcher was presenting his findings on energy consumption. He had created a simulated annealing model and crunched some data and presented some graphs. One Professor in the audience was apparently paying closer attention than most. He leaned in, squinted, then observed: “So basically what you found was that with two batteries the thing lasts twice as long?”
Upon reflection, the audience realized that was basically what the presentation boiled down to. The researcher tried to draw attention back to his model, to be fair I guess knowing how to set up a simulated annealing model is a non-trivial skill – I personally know nothing about simulated annealing beyond an awareness that the term exists – but the Professor was not swayed. The finding was banal.
In this case we can all appreciate the insignificance of the finding because we all have first-hand experience with batteries and battery-operated appliances. But it’s not hard to imagine a slightly different situation, pertaining to some domain with established wisdom, and some green data scientist talking up models, showing graphs, only to present some finding that gets dismissed. “Well, yeah, that’s been industry best practice knowledge for 3 years now.”
Or perhaps one might focus on the wrong issue. Recently someone posted an interesting problem to the Big Data Malaysia discussion group (closed Facebook group): she wants to categorize news articles into having a positive or negative effect on the prices of NASDAQ listed instruments. The majority of the feedback from the group has (so far) focused on the sentiment analysis part of the problem (does an article say nice things or not nice things about a particular instrument) but as anyone in the habit of tracking news articles to manage their own investments can tell you, the link between a news article and stock prices is not so simple, at least not for those who invest based on fundamentals (automatic traders are perhaps a simpler species, but there too the complexity of the algorithms is always increasing, and unlike aggregate human behavior, algorithmic behavior can change very suddenly). Therefore the common initial reaction to start with sentiment analysis may be misplaced when perhaps the issue of accurate attribution may be the more compelling problem to address in the domain.
There is a school of thought in data science that one should “follow the data“. There is obvious merit to this line of reasoning, but to divorce research from domain knowledge is, in my opinion, career-limiting. This is because in some shape, presentation of insight is an inescapable aspect to the role of a data scientist.
When people hear the word “presentation”, the caricature that most often comes to mind is PowerPoint-ing a boardroom, but that’s just a small part of the broader world of professional presentation. I present insight every week, in the form of a couple of sentences in email or Skype; “The data suggests this phenomenon only affects SSD nodes, and won’t repro on HDD nodes“, or “the bottleneck seems to be X rather than Y“, or “that bug repros most easily on loads with redirects but also sometimes without redirects, so maybe it’s a race in component A“… no PowerPoint, and usually I don’t even share the graphs. Yet it all counts as presentation.
Less is always more, provided the context is precise and accurate, and context comes from domain knowledge.
Code and models are forgiving (especially in automation). But human beings, especially management, are not. Confidence and trust are very fragile things.
To embrace domain knowledge is to de-risk data science.