Tuesday, August 30, 2016

Do You Trust Your Machine or Your Mind?

Data science is the new buzzword.  The promise of machine learning is to be able to predict anything and everything.  Yet, It seems like the more data we have, the harder the truth is to find.  We hear about some data that doesn't sound right to us.  We ask questions and find out that there are assumptions and biases all over the data.  Even if the data was true, once it is analyzed, it becomes contaminated in some way.  With such things, how can we possibly trust it?  Instead, as Adam Savage put it, the best course of action seems: "I reject your reality and substitute my own."

The reality of your mind is: "Your mind is crazy and tells you lies."  Your brain has to do the same thing the machine does in assembling data into a complete picture that a data analysis process does.  (An analogy would be assembling the building blocks to the right into a single creation like a castle or whale.)  It can do it, but the reality is it takes a lot of skill and a lot of thought.

Pieces for a mind to assemble into a single picture.

The downside to doing it in your brain is:

  • There is no documentation of how the picture was formed from the data
  • There is no record of what data your mind included and excluded as it assembled its picture
  • It is much harder to question the process your mind used in creating it's picture
  • Is is very hard to maintain consistency so that that the picture your mind creates today is the one it will create a year from now given the same data
Your mind is a black box.  As Andy Ellis put it, "Systems are becoming too complex for risk analysis to be performed by System 1." (gut instinct).  He termed it "The Approaching Complexity Apocalypse".

This doesn't mean data doesn't have it's faults.  No data is the knowledge it represents.  All data requires analysis to produce the picture from the data.  All data has underlying assumptions and biases.  You should expect your data sources to:

  • Publish the methodologies they use to product the pictures from the data
  • The provenance of the data
  • The known assumptions and biases, both of the data and of the methodology
Also, data science is not quite classic science.  Classically, science follows the scientific method.  In classic science, a hypothesis is first established and then tests are created collect data to disprove that hypothesis. If the tests fail, they hypothesis is accepted.  Normally in data science, we start with the data and use it to identify hypotheses that are true.  XKCD highlighted the issue with this nicely:


There will always be unknown assumptions and biases in data, but if you use them to ignore the data you put yourself at a disadvantage.  If you conduct 100 studies, none of which are statistically significant, but all predicting the same thing, you have strong evidence that the thing is true.

On the other hand, this does not mean you should accept all data-based conclusions that come your way.  As multiple speakers in the bSides Las Vegas Ground Truth track suggested, machines and minds should work together.  The mind can help identify potential biases and assumptions, as well as potential improvements in the machine's methodology.  The machine can produce reproducible results to inform the mind's decisions.

The worst thing you can do is identify biases, assumptions, and flaws in the machine and then use them to justify the validity of your mind.  If you were to do so, you would need to document the methodology of your mind and subject it to the same scrutiny for biases, assumptions, and flaws.  At which point, the methodology would then be in the machine.

And if you can't make your mind and the machine agree, my preference is to trust whichever system is most thoroughly documented, investigated, and validated.  And that tends to be the machine.