Thursday, August 10, 2017

PowerBI vs Tableau vs R

Yesterday at the Nashville Analytics Summit I had the pleasure of demonstrating the strengths, weaknesses, similarities, and differences between Microsoft PowerBI, Tableau, and R.

The Setup

Last year when I spoke at the summit, I provided a rather in-depth review of of the DBIR data workflow.  One thing I noticed is the talk was further along in the data science process from most attendees who were still working in Tableau or even trying to decide what tool to use for their organization.  This year I decided to try and address that gap.

I recruited Kindall (a daily PowerBI user) and Ian (a daily Tableau user) to help me do a bake-off.  Eric, our moderator, would give us all a dataset we'd never seen (and it turned out, in a domain we don't work in) and some questions to answer.  We'd get them at 8:30 in the morning and then spend the day up until our talk at 4:15 analyzing the dataset and answering the questions.  (I got the idea from the fuzzing vs reverse engineering panel at Defcon a few years ago.)

The dataset was about 100,000 rows and 50 or so columns (about half medications given) related to medical stays involving diabetes.  The features were primarily factors of various sorts with a continuous feature for time in the hospital (the main variable of interest).

The Results

I'll skip most of the findings from the data as that wasn't really the point.  Instead I'll focus on the tools.  At a basic level, all three tools can create bar charts very quickly including color and alpha.  Tableau and PowerBI were very similar so I'll start there.

Tableau and PowerBI Similarities

  • Both are dashboard based
  • Both are driven from the mouse, dragging and dropping features into the dashboard
  • Both have a set of visualization types pre-defined that can be used
  • Both allow interactivity out of the box with clicking one chart subsetting others

Tableau and PowerBI Differences:

  • PowerBI is a bit more web-based.  It was easy to move from local to cloud and back.
  • PowerBI has more robust integration with other MS tools and will be familiar to excel users (though the formulas have some differences compared to excel as they are written in DAX).
    PowerBI keeps a history of actions that allow you to go backwards and see how you got where you are.
  • To share a dashboard in PowerBI you simply share a link to it.
  • Finally, PowerBI is pretty easy to use for free until you need to share dashboards.
  • Tableau Is more desktop application based.
  • You can publish dashboards to a server if you have the enterprise version or you can install the Tableau viewer app (however that still requires the receiver install software).  Also, sharing the actual workbook basically removes any security associated with your data.
  • Tableau dashboards can also be exported as PDFs but it is not the primary approach.
  • Tableau allows good organization of data within the GUI to help facilitate building the dashboard.
  • Tableau lacks the history though so there is no good way of telling how you did what you did.

Differences between R and Tableau/PowerBI

Most differences came between R and the other two tools
  • While PowerBI and Tableau are driven by the mouse and interact with a GUI, R is driven from the keyboard and interacts with a command-line.
  • In PowerBI or Tableau, initial investigation basically involves throwing features on the x and y axis and looking at the result.  Both provide the ability to look at the data table behind the dashboard but it's not really part of the workflow.  In R, you normally start at the data with something like `dplyr::glimpse()`, `summary()`, or `str()` which give you some summary statistics about the actual data.
  • In R you can build a dashboard similar to PowerBI or Tableau using the Shiny package, but it is _much_ harder.  Rather than be drag-and-drop, it is very manual.  To share the dashboard, the other person either needs Rstudio to run the app or you need a shiny server. (Shiny servers are free for a single concurrent user but cost money beyond that.)
  • R dashboards allow interaction, but it is again, more laborious.
  • R, however, you can actually do pretty much anything you want.  As an example, we discussed plotting the residuals of a regression.  In R it's a few lines.  In Tableau and PowerBI there was no straight-forward method at all.  The only options were to create a plot with a trend line (but no access to the underlying trend line model).  We discussed building more robust models such as a decision tree for classification.  Kindall found an option for it in PowerBI, but when she clicked it, it was basically just a link to R code.  Finally, the concept of tidyr::gather() (which combines a set of columns into two columns, 1 for the column names, and one for the column values) was both unknown and very appealing to Ian but unavailable in Tableau.)
  • R can install packages.  As far as we could tell, Tableau and PowerBI do not.  That means someone can add Joy plots to R on a whim.
  • In R, making the initial image is harder.  It's at least data plus an aesthetic plus a geom.  To get it to match the basic figure in PowerBI and Tableau is a lot harder, potentially adding theme information, possibly additional geoms for labeling columns, etc.  However, the amount of work to improve a figure in R scales linearly.  After you have matching figures across all three tools, if you wanted to, say, put a plot of points in the background with a lower opacity, that's a single line similar to `geom_jitter(alpha=0.01) + `.  Thats about the same amount of work as to make any other change.  In Tableau or PowerBI, it would be hours of messing with things to make such simple additions or modifications (if it's possible at all).  This is due to R's use of the Grammar of Graphics for figure generation.
  • Using the Grammar of Graphics, R can make incredible reports.  PDFs can be consumer quality. (Figures for the DBIR are mostly created in R with only minor updates to most figures by the layout team.)


  1. The most important takeaway is that R is appropriate if you verbalize what you want to do, Tableau/PowerBI are appropriate if you can visualize the final outcome but don't know how to get there.  
    •  For example "I want to select subjects over 30, group them by gender, and calculate average age."  That can quickly be translated to R/dplyr verbs and implemented.  Regardless of how many things you want to do, if you can verbalize them, you can probably do them. 
    •  If you can visualize your final figure, you can drag and drop parts until you get to something close to what you want to do.  It's trial and error, but it's quick and  easy.  On the other hand, it only works for fairly straight-forward outcomes.
  2. PowerBI and Tableau are useful to quickly explore data.  R is useful if you want to dig deeper.
  3. Anything you can do in PowerBI and Tableau, you can do in R.  It's just going to be a lot harder.
  4. On the other hand, VERY quickly you hit things that R can do but Tableau or PowerBI cannot (at least directly).  The solution is that PowerBI and Tableau both support running R code internally.  This has it's own issues:
    • It requires a bit of setup.
    • If you learn the easy stuff in PowerBI or Tableau, but try to do the hard stuff in R, it'll be even harder because you don't know how to do the basics in R.
    • That said, once you've done the setup, you can probably just find how someone else has solved the problem in R and copy and paste it into your dashboard
    • Then, after the fact, you can go back through and teach yourself how the code actually did whatever hard thing you had it do.
  5. From a data model perspective, R is like excel while PowerBI and Tableau are like a database.  Let me demonstrate what I mean by example: 
    • When we started analyzing, the first thing the other two did was add a unique key to the data.  The reason is that without a key they aren't able to reference rows individually.  They tend toward bar charts because their tools automatically aggregate data.  They don't even think that they are summing/averaging the groups they are dragging in as it's done automatically
    • For myself using R, each row is inherently an observation.  As such as I only group explicitly and first create visualizations that are scatter plots, density plots, etc given my categorical variables by a single continuous variable.  On the other hand, Tableau and PowerBI make it very simple to link multiple tables and use all columns across all tables in a single figure.  In R, if you want to combine two data frames, you have to manually join them.
  6. Output: Tableau and PowerBI are designed primarily to produce dashboards.  Everything else is tacked on.  R is the opposite.  It's designed to produce reports of various types.  Dashboards and interactivity are tacked on.  That said, there is a lot of work going on to make R more interactive through the production of javascript-based visualizations.  I think are likely to see good dashboards in R with easy modification before we see easy high-quality report generation from PowerBI and Tableau.

Final Thoughts

This was a very good experience.  It was not a competition but an opportunity to see and discuss how the tools differed.  It was a lot of fun (though in some ways felt like a CTF, sitting in the vendor area doing the analysis.  Being under some pressure as you don't want to embarrass your tool (and by extension its other users).  I really wish I'd included Kibana or Splunk as I think they would have been different enough from PowerBI/Tableau or R to provide a unique perspective.  Ultimately I'm hoping it's something that I or the conference can do again as it was a great learning opportunity!

Wednesday, May 3, 2017

Elasticsearch. Logstash. R. What?!


At bsidesNash, Chris Sanders gave a great talk on threat hunting.  One of his recommendations was to try out an ELK (Elasticsearch, Logstash, Kibana) stack for searching for threats in log data.  ELK is an easy way to stand up a distributed, scalable, stack capable of storing and searching full text records.  The benefit is it's easy ingestion (Logstash), schema-agnostic storage ability, (Elasticsearch), and robust search and dashboards (Kibana) makes it easy platform for threat hunters.

However, because of it's ease, ELK tends to be a one-size-fits-all solution for many tasks.  I had asked Chris about using other tools for analysis such as R by way of Rstudio and dplyr or Microsoft Power BI.  Chris hadn't tried it and, at the time, neither had I.  (My day job is mostly historic data analysis rather than operational monitoring.)


However, the DBIR Cover Challenge presented an opportunity.  For those who are unaware, each year there is a code or codes hidden on the DBIR cover.  That code then leads to a puzzle challenge which has resulted in some nice rewards for the winners; (iPad minis, auto-follow telescopes, Yeti coolers, quadcopters, 3D printers, and more).  The challenge has multiple puzzles of which players must complete 8.  So that they check their answers as they go, the site is a dynamic webapp hosted at Heroku.  Because it is dynamic, I can add my own log messages into the endpoint functions.

But I needed a place to store and search the logs.  Heroku provides some great plugins for this, but, given the conversation with Chris, I figured I'd try to roll my own, starting with ELK.  The first hurdle was that, though there is a lot of hosted Elasticsearch and Kibana, there was much less hosted Logstash (the part I really needed).  Elastic cloud didn't have it.  AWS had their own tools.  Finally I found which works perfectly.  They provide a full ELK stack as a cloud service for around $20 at the low end with a 14 day trial. I signed up for the trail and was up-and-running in minutes.  They even have an easy one-line instruction on how to set up a Heroku drain to send logs to a Logstash endpoint.  From there, it is automatically stored in Elasticsearch and searchable through Kibana.

Going beyond ELK

The problem I quickly found out, was that Kibana didn't have the robust manipulation I was used to using R.  While it could find entries and make basic dashboards, I simply couldn't cut the data like I wanted once I'd found the subset of data I was interested in.  I tried passing the data to PowerBI, but on first blush, the streaming API setup was too limited to ingest a heroku drain using the basic setup tools.  Finally, I decided to try and keep the Logstash and Elasticsearch underpinnings, but switch to R for analysis.  R allows for simple pipeline analysis of data as well as robust charting.

Doin it with R

The first step was to install the packages I'd need:
install.packages("dplyr") # for simple piped data processing
install.packages("elastic") # for talking to the Elasticsearch store
install.packages(flexdashboard) # for creating a dashboard to monitor
install.packages("DT") # for displaying a HTML data table in the dashboard
install.packages("stringr") # simple string manipulation
install.packages("ggmaps", "viridis", "rgeolocate", "leaflet") # geocoding IPs and displaying them on a map
install.packages("devtools", "treemap") # create treemaps
devtools::install_github("Timelyportfolio/d3treeR") # create treemaps
After installing packages, the next step was to set up the Elasticsearch connection:
elastic::connect(es_host="<my ES endpoint>", es_port=443, es_path="", es_transport_schema = 'https', headers=list(apikey="<my api key>"))
I also manually visited: "https://<my ES endpoint>/_cat/indices?v&apikey=<my API key>&pretty=true" to see what indexes Logstash was creating.  It appears to create an index per day and keep four indexes in the default setup.  I stored them into a variable and then ran a query, in this case for the line log line indicating a player had submitted a specific key:
indexes <- c("logstash-2017.04.28", "logstash-2017.04.29", "logstash-2017.04.30", "logstash-2017.05.01") # I should be able to get this from `elastic::cat_indices()`, but it did not apply my apikey correctly
query <- elastic::Search(index=indexes, q="logplex_message:submitted", size=10000)$hits$hits
The following thing we need to do is remove only the fields we want from the query.  The result is a list of query results, each itself a list of key:value pairs.  I used the `lapply` function to extract _just_ the logplex_message field.  (`lapply` takes a function and applies it to each item of a list in R.)  `lapply` returns a list and so I `unlist` the results and make them a column in a dataframe:
submissions <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$logplex_message))
In our puzzle challenge, we have 'trainers' who use 'keys' to indicate they've caught Breachemon.  I can use my normal R skills to separate the trainer name and key from the log message and count how many times each trainer has submitted each key:
submissions <- submissions %>%
    mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract 'trainer'
    mutate(key = gsub(".*submitted key (.*) to the bank.$", "\\1", text)) %>% # extract 'key'
    group_by(trainer, key) %>% # group each trainer-key pair
    tally() # short cut for `summarize(n=n())`.  For each trainer-key pair, create a column 'n' with the number of times that pair occurred
From there we can visualize the table with:
We could also visualize the total submissions per trainer:
submitters <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$logplex_message)) %>% # extract the log message and produce a dataframe
mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract the trainer
group_by(trainer) %>% # create a group per trainer
tally() # shortcut for `summarize(n=n())`. Count the events per group
d3treeR::d3tree2(treemap::treemap(submitters, "trainer", "n", aspRatio=5/3, draw = FALSE)) # produce a treemap of submissions per person

Dashboard Time

To wrap this all together, I decided to make a simple dashboard.  In the Rstudio menu, File->New File->R Markdown...  In the menu, choose 'From Template' and then Template: 'Flex Dashboard'.  You'll get something like:
title: "Untitled"
    orientation: columns
    vertical_layout: fill
```{r setup, include=FALSE}
Column {data-width=650}
### Chart A
Column {data-width=350}
### Chart B
### Chart C
Lets add our two charts:
title: "Breachemon"
    orientation: columns
    vertical_layout: fill
```{r setup, include=FALSE}
elastic::connect(es_host="<my ES endpoint>", es_port=443, es_path="", es_transport_schema = 'https', headers=list(apikey="<my api key>"))
query <- elastic::Search(index=indexes, q="logplex_message:submitted", size=10000)$hits$hits
Column {data-width=650}
### Submissions
```{r fig.keep='none'}
submitters <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$logplex_message)) %>% # extract the log message and produce a dataframe
mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract the trainer
group_by(trainer) %>% # create a group per trainer
tally() # shortcut for summarize(n=n()).  Count the events per group
d3treeR::d3tree2(treemap::treemap(submitters, "trainer", "n", aspRatio=5/3, draw = FALSE)) # produce a treemap of submissions per person
### Submitters
data.frame(text = unlist(lapply(query, function(l) {l$`_source`$logplex_message}))) %>%
    mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract 'trainer'
    mutate(key = gsub(".*submitted key (.*) to the bank.$", "\\1", text)) %>% # extract 'key'
    group_by(trainer, key) %>% # group each trainer-key pair
    tally() # short cut for `summarize(n=n())`.  For each trainer-key pair, create a column 'n' with the number of times that pair occurred
Column {data-width=350}

### Map
ips <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$msg_fwd))
geo <- rgeolocate::db_ip(as.character(unique(ips$text)), "<my free api key>") # geocode unique IPs, returns a list
geo <-, geo) # bind the list together as a dataframe
names(geo) <- c("IP", "Country", "State", "City") # set the dataframe column names
geo <- ips %>%
    group_by(text) %>%
    tally() %>% # count per IP
    rename(IP = text) %>%
    right_join(geo, by="IP") # join with geolocation
cities <- unique(as.character(geo$City)) # unique list of cities
cities <- cbind(ggmap::geocode(cities), cities) # geo code the cities
geo <- right_join(geo, cities, by=c("City" = "cities")) #join it back together
pal <- leaflet::colorFactor(viridis::viridis_pal(option = "C")(2), domain = geo$n) # create a color range
leaflet::leaflet(geo) %>% # make a map
  leaflet::addTiles() %>% # add some default shapes to it
  leaflet::addCircleMarkers(color = ~pal(n)) # add a circle with a color based on the count of submissions for each IP
Resulting in:

The last block pulls the msg_fwd field which contains the source IP adddress, splits it (as some have multiple), and stores it in a dataframe.  It then geolocates the IPs and binds the cities.  After that it geocodes latitude and longitude and joins it.  Finally it places the geolocated and coded IPs as dots on a map.


That's not to say there aren't hang-ups.  You _are_ pulling the data from the remote cluster to your local machine which is a relatively costly action.  (The queries I ran returned in a fraction of the second, but I can imagine querying a billion record store, returning tens of thousands of hits, would be slower.)  However, as Chris noted during his talk, not being selective in what you retrieve to search is one of the signs of a junior analyst.  Also, I have not automated retrieval of more than 10,000 records or the automatic tracking of indexes as they are created.  Finally, the dashboard must be refreshed manually.  There's a little button to do so in the Rstudio browser, however I think it may make more sense to provide a Shiny button to use to update all or selected portions instead.  Unfortunately, most of this goes beyond the few hours I was willing to put into this. proof of concept.

In the end, it was well worth the experimentation.  It required no hardware and brings the robust slicing and dicing of data that the R ecosystem provides to the easy and scalable storage of ELK. Though the service doesn't allow direct configurability of most of the ELK stack, they seem responsive to requests.  I'm actually not sure that the ES portion of ELK is really necessary.  If you are working with a limited number of well-defined data sources, a structured store such as Postgres, or a key:value store such as hive/hbase might make more sense.  R has nearly the repository of packages that Python does.  On my mac pro I can work with datasets in the 10's of millions of records, providing all sorts of complex analysis.  All in an easily-documentable and repeatable way.

In the future, I'd love to see the same thing done with MS PowerBI.  It's not a platform I know, but I think it would definitely be an interesting one to explore.  If anyone has any ideas on how to stream data to it, please let me know!

Tuesday, November 29, 2016

How to Handle Being Questioned

In my post, How to Converse Better in Infosec, I laid out some rules for better infosec discussions.  A key tenent of that blog post was asking questions.  But what if you are on the receiving end of that?

To the questioned:

When expressing a view, being questioned feels like a challenge.  For me, it feels as if the other person doesn't believe me and is trying to catch me in a lie.  Frankly, maybe I did embellish a bit.  Maybe I made a statement based on something I thought I remembered hearing but don't quite remember where I heard it.  Or maybe I feel the statement is so obvious, the only reason someone would question it is if the other person wanted to try and take me down a rung.

It's OK.  If, as speakers, we feel we are in the right, we can treat all questions as if the questioner doesn't know the answer and is seeking help learning, or there is some ambiguity in the questioner's mind and they are just trying to help clarify it.  (Remember, for topics we are knowledgeable on, it is hard to see the subject from the perspective of a less-informed person.)  Answer with the intent of being as genuinely helpful as possible.  Have fun!  This is our chance to help someone out!

And if we don't have the answer, we can be polite and say so.  "I honestly can't demonstrate it right now.  If you'll allow me the time, I'll collect the information for you and get back to you.  And, in the event I can't, I'll let you know."  Everyone is wrong at some point.  Big people can admit it and only weak people don't accept it from others.

And to the questioner:

Be aware that you may be unintentionally putting the questioned person in an emotionally defensive position.  They may have all the answers and be able to clearly explain it.  They may be right, but need time to collect the evidence to demonstrate it.  They may be flat out wrong but not prepared to say so.

Be a good participant in the social dynamic.  If the other person can't answer, is evasive, or is demonstrating some technique to avoid answering, give them an out.  Say, "It's OK, let's pick this up again later."  Or "If you find/remember the answer, please message it to me."  If the question is unimportant to you, you lose nothing by letting it go until the questioned person brings it up to you again.  And if it is truly relevant to you, you can look it up yourself.  If you feel you can't let it go, ask yourself if you're truly practicing the principle of charity.

In conclusion

Remember, a conversation involves multiple people. You're all in it together. Either everyone wins or everyone loses. So help everyone win.

Tuesday, November 22, 2016

What is most important in infosec?

"To crush your enemies -- See them driven before you, and to hear the lamentation of their women!" - Conan the Barbarian

Maybe not.


Recently I asked if vulnerabilities were the most important aspect of infosec.  Most people said 'no', and the most common answer instead was risk.  Risk is likelihood and consequence (impact). (Or here for a more infosec'y reference.)  And as FAIR points out, likelihood is threat and vulnerability. (Incidentally, this is a good time to point out, when we say 'vulnerability', we aren't always saying the same thing.)  While in reality, as @SpireSec points outthreat is probably more important, I suspect most orgs make it a constant 'TRUE' in which case 'likelihood' simply becomes 'vulnerability' in disguise.  I doubt many appreciate the economic relationship between vulnerability and threat.  As many people pointed out, the impact of the risk is also important.  Yet as with 'threat', I suspect it is rarely factored into risk in more than a subjective manner.  There were other aspects of risk such as vulnerable configurationsasset management and user vulnerability.  And there were other opinions such as communication, education and law.


The first big take-away is that, while we agree conceptually that risk is complex and that all its parts are important, practically we reduce 'risk' down to 'vulnerability' by not dynamically managing 'threat' or 'impact'.  While most organizations may say they're managing risk, very likely they're really just managing vulnerabilities.  At best, when we say 'managing', we probably mean 'patching'.  At worst, it's buying and blindly trusting a tool of some kind.  Because, without understanding how those vulnerabilities fit into the greater attack-surface of our organization, all we can do is patch and buy.  Which leads to the second take-away...

Attack Surface

The second take-away "I think we need to change the discussion from vulns to attack surface." Without understanding its attack surface, an organization can never move beyond swatting flies.  If an organization is a city and they want to block attackers coming in, what we do is like blocking one lane of every road in.  Sure, you shut down a lot of little roads, but the interstates still have three lanes open.  And what about the airport, busses, and beaches?

Our Challenges

Unfortunately, if we can't move from vulns to full risk, our chances of moving beyond simple risk to attack surface are slim.  At least in FAIR, we have the methodology to manage based on full risk, if not attack surface.  However, while vulnerabilities are the data is not easy to collect.  It's not easy to combine and clean.  And it's not easy to analyze and act upon.  (All the things vulnerability data is.)  We don't even have national strategic initiatives for threat and impact, let alone attack surface the way we do for vulnerabilities, (for example bug bounties, and I Am The Cavalry).

In Conclusion

Yet we continue to spend our money and patch vulnerabilities with little understanding of the risk it addressed, let alone how that risk fits into our overall attack surface.  But for those willing to put in the work, the tools do exist.  And eventually we will make assessing attack surface as easy as a vulnerability assessment.  Until then though, we will continue to waste our our infosec resources, wandering blindly in the dark.


The third and final take-away is that the whole discussion completely ignores operations, (the DFIR type vs the installing-patches type).  In reality, it may be a strategic decision, but the trade-offs between risk and operations based security are better left for another day blog.

Tuesday, October 18, 2016

Why Phishing Works

Why Phishing Works

I've been asked many times why old attacks like phishing or use of stolen credentials still work.  It's a good, simple, question.  We are fully aware of these types of attacks and we have good ways of solving them.  Unfortunately, there's just as simple an answer:
"The reason attackers use the same methods of attack is we assume they won't work."
 We conduct phishing training.  We install mail filters. And when something gets through, we treat it as an anomaly.  A trouble ticket.  Yet, from the 2016 DBIR, about 12% of recipients clicked the attachment or link in a phishing email.  Imagine if that happened in airplanes; for example, if 12% of bolts in an airplane failed every flight.  They wouldn't simply take the plane in for repairs when bolts failed.  They'd build the plane to fly even if the bolts failed.

This leads to a fundamental tenant of information security:

"Your security strategy CANNOT assume perfection.  Not in people. Not in processes. Not in tools.  Not in defended systems."

When you assume anything will work perfectly and treat failures as a trouble ticket, you cede an advantage to the attacker.  They are well aware that if they fire off 100 phishing emails, 10 will hit the mark.

What To Do

Do what engineers have been doing for generations, engineer resilience and graceful degradation into the system.  Assume phishing, credential theft, malware, and other common attacks WILL succeed and plan accordingly.  Build around an operational methodology.  Work under the assumption that phishing has succeeded in your organization, that credentials have been stolen, that malware is present, and that your job is to find the attacker before they find what they're looking for.

Attackers are just some other guy or gal, sitting in their version of a cube, somewhere else in the world.  They want their attacks to happen quickly and with as little additional effort as possible.  They take advantage of the fact that we treat their initial action succeeding as an anomaly.  If we assume that initial action will be partially successful and force them to exert additional effort and actively work to remain undetected, we decrease their efficiency and improve the economics of infosec in our favor.

Thursday, September 22, 2016

How to Converse Better in Infosec

In a previous blog, I spoke a bit about what to do when the data doesn't seem to agree with what we think.  But what if it's not data you disagree with, but another person?

We've grown up in a world where the only goal in a conversation is to simply be right. It is all around us and, unfortunately, drives how we converse with other professionals.  Whether it's a twitter thread or questions at the end of a conference talk, we tend to look to tear down others to build ourselves up.  The mantra "Defense has to be perfect, offense only has to succeed once" pushes us to expect it in our technical dialog even though no one and no thing is perfect.

Let's change that.  The next time you are on twitter, at a conference, or engaging in discussion with colleagues, try and follow the Principle of Charity.  I highly recommend you read the link, but the basic premise is:
Accept what the other says if it could be true.
Now, obviously it's more complex than that. It's more like "dato non concesso" which means "given, not conceded". You are accepting their statements where logic otherwise does not prevent you from doing so, not because you believe they are true, but simply because you believe they were given in good faith. It also means interpreting statements in the way most likely to be true.
If the other says something that sounds conditionally untrue, ask questions that would help clarify that it is true.
It doesn't mean you have to accept statements that can't be true. It doesn't mean you can't confirm your interpretation. And it doesn't mean you can't ask clarifying questions.  If the other's statement could be conditionally true, ask questions that help clarify that the conditions are those that make the statement true.
Do not ask questions or make statements to try and prove the other's assertion false.
It does, however, mean not nitpicking.  It does mean not taking statements out of context or requiring all edge cases be true.  If the other's position truly is false, you will simply fail at clarifying it as true.

And if we do we should be doing this, we should do one more thing:
Expect others to follow the same principles.
We should not, as a community, accept members not following this principle.  Conversations contradictory to the Principle of Charity bring our community down and they inhibit growth.  However, we will only root it out if we take a stand and speak out against it.  Whether at conferences, in blogs, in podcasts, on twitter, or anywhere else, it improves us none to tear down rather than build up.  I challenge you to adopt the Principle of Charity in your conversations, starting today, and make it a goal for the entire year!

Update: Also check out the follow-on blog: How to Handle Being Questioned!

Tuesday, August 30, 2016

Do You Trust Your Machine or Your Mind?

Data science is the new buzzword.  The promise of machine learning is to be able to predict anything and everything.  Yet, It seems like the more data we have, the harder the truth is to find.  We hear about some data that doesn't sound right to us.  We ask questions and find out that there are assumptions and biases all over the data.  Even if the data was true, once it is analyzed, it becomes contaminated in some way.  With such things, how can we possibly trust it?  Instead, as Adam Savage put it, the best course of action seems: "I reject your reality and substitute my own."

The reality of your mind is: "Your mind is crazy and tells you lies."  Your brain has to do the same thing the machine does in assembling data into a complete picture that a data analysis process does.  (An analogy would be assembling the building blocks to the right into a single creation like a castle or whale.)  It can do it, but the reality is it takes a lot of skill and a lot of thought.

Pieces for a mind to assemble into a single picture.

The downside to doing it in your brain is:

  • There is no documentation of how the picture was formed from the data
  • There is no record of what data your mind included and excluded as it assembled its picture
  • It is much harder to question the process your mind used in creating it's picture
  • Is is very hard to maintain consistency so that that the picture your mind creates today is the one it will create a year from now given the same data
Your mind is a black box.  As Andy Ellis put it, "Systems are becoming too complex for risk analysis to be performed by System 1." (gut instinct).  He termed it "The Approaching Complexity Apocalypse".

This doesn't mean data doesn't have it's faults.  No data is the knowledge it represents.  All data requires analysis to produce the picture from the data.  All data has underlying assumptions and biases.  You should expect your data sources to:

  • Publish the methodologies they use to product the pictures from the data
  • The provenance of the data
  • The known assumptions and biases, both of the data and of the methodology
Also, data science is not quite classic science.  Classically, science follows the scientific method.  In classic science, a hypothesis is first established and then tests are created collect data to disprove that hypothesis. If the tests fail, they hypothesis is accepted.  Normally in data science, we start with the data and use it to identify hypotheses that are true.  XKCD highlighted the issue with this nicely:

There will always be unknown assumptions and biases in data, but if you use them to ignore the data you put yourself at a disadvantage.  If you conduct 100 studies, none of which are statistically significant, but all predicting the same thing, you have strong evidence that the thing is true.

On the other hand, this does not mean you should accept all data-based conclusions that come your way.  As multiple speakers in the bSides Las Vegas Ground Truth track suggested, machines and minds should work together.  The mind can help identify potential biases and assumptions, as well as potential improvements in the machine's methodology.  The machine can produce reproducible results to inform the mind's decisions.

The worst thing you can do is identify biases, assumptions, and flaws in the machine and then use them to justify the validity of your mind.  If you were to do so, you would need to document the methodology of your mind and subject it to the same scrutiny for biases, assumptions, and flaws.  At which point, the methodology would then be in the machine.

And if you can't make your mind and the machine agree, my preference is to trust whichever system is most thoroughly documented, investigated, and validated.  And that tends to be the machine.