Information Security Analytics Blog: August 2017

Several years ago, I blogged about Balkanizing the Internet. More than ever it appears that a digital feudalism is emerging. A driver that I didn't necessarily consider is the automation of security.

Automation in Infosec

The future of security is speed and persuasiveness. Whoever accomplishes the OODA loop (or additive factors if you like) first has an incredible advantage. In information security, that means automation and machine learning making contextual decisions faster than humans ever could. It will be defense's algorithms against offense's. The second part is probably more interesting. Machine learning is output generated from input. In essence, humans are a much less predictable version of the same. As such, any actor or algorithm, offensive or defensive, that can figure out what input to the opposing side produces the outcome they want, and provide that input before losing will win. Because it needs to happen at speed, it's also likely to be algorithmic. We already train adversarial models to do this.

Infosec 1%'ers

The need for speed and persuasiveness driving automation and artificial intelligence in information security is it's own blog. I touch on it here because, in reality, it only describes the infosec 1%'ers. While a Google or Microsoft may be able to guard their interests with robust automation and machine learning, the local app developer, law office, or grocery store will not.

Which brings us to the recent malware. It should be a wake-up call to all information security professionals. It utilizes no new knowledge, but it provides a datapoint in the trend of automation. While the 1%, or even 50% defender might not be affected, the publicly known level of automation in infosec attack is easily ahead of a large portion of the internet and appears to be growing faster than defensive automation due to adherence to engineering practices for system management. Imagine malware automating the analysis process in bloodhound. Imagine an attack graph, knowledgeable about how to turn emails/credentials/vulnerabilities into attacks/malware, and malware/attacks into email/credentials, was built into a piece of malware, causing it to spread, unhindered as it creeps across the trust relationships that connect everyone on the planet. This could easily be implemented as a plugin for a tool such as armitage.

Balkanization

This is brings us back to the Balkanization of the Internet. In the near future, the only way to defend systems may be to cede control, regardless of the obligations, to the infosec 1%ers'. The only people protected will be those who allow automated systems to guard, modify, and manage their systems. Your choice may be to allow google to monitor all traffic on your internal network to allow their models to defend it, or quickly fall victim to roving automated threats. The internet will have devolved into roaming threats, only kept at bay by feudal lords able to oppose them.

Yesterday at the Nashville Analytics Summit I had the pleasure of demonstrating the strengths, weaknesses, similarities, and differences between Microsoft PowerBI, Tableau, and R.

The Setup

Last year when I spoke at the summit, I provided a rather in-depth review of of the DBIR data workflow. One thing I noticed is the talk was further along in the data science process from most attendees who were still working in Tableau or even trying to decide what tool to use for their organization. This year I decided to try and address that gap.

I recruited Kindall (a daily PowerBI user) and Ian (a daily Tableau user) to help me do a bake-off. Eric, our moderator, would give us all a dataset we'd never seen (and it turned out, in a domain we don't work in) and some questions to answer. We'd get them at 8:30 in the morning and then spend the day up until our talk at 4:15 analyzing the dataset and answering the questions. (I got the idea from the fuzzing vs reverse engineering panel at Defcon a few years ago.)

The dataset was about 100,000 rows and 50 or so columns (about half medications given) related to medical stays involving diabetes. The features were primarily factors of various sorts with a continuous feature for time in the hospital (the main variable of interest).

The Results

I'll skip most of the findings from the data as that wasn't really the point. Instead I'll focus on the tools. At a basic level, all three tools can create bar charts very quickly including color and alpha. Tableau and PowerBI were very similar so I'll start there.

Tableau and PowerBI Similarities

Both are dashboard based
Both are driven from the mouse, dragging and dropping features into the dashboard
Both have a set of visualization types pre-defined that can be used
Both allow interactivity out of the box with clicking one chart subsetting others

Tableau and PowerBI Differences:

PowerBI is a bit more web-based. It was easy to move from local to cloud and back.
PowerBI has more robust integration with other MS tools and will be familiar to excel users (though the formulas have some differences compared to excel as they are written in DAX).
PowerBI keeps a history of actions that allow you to go backwards and see how you got where you are.
To share a dashboard in PowerBI you simply share a link to it.
Finally, PowerBI is pretty easy to use for free until you need to share dashboards.
Tableau Is more desktop application based.
You can publish dashboards to a server if you have the enterprise version or you can install the Tableau viewer app (however that still requires the receiver install software). Also, sharing the actual workbook basically removes any security associated with your data.
Tableau dashboards can also be exported as PDFs but it is not the primary approach.
Tableau allows good organization of data within the GUI to help facilitate building the dashboard.
Tableau lacks the history though so there is no good way of telling how you did what you did.

Differences between R and Tableau/PowerBI

Most differences came between R and the other two tools

While PowerBI and Tableau are driven by the mouse and interact with a GUI, R is driven from the keyboard and interacts with a command-line.
In PowerBI or Tableau, initial investigation basically involves throwing features on the x and y axis and looking at the result. Both provide the ability to look at the data table behind the dashboard but it's not really part of the workflow. In R, you normally start at the data with something like `dplyr::glimpse()`, `summary()`, or `str()` which give you some summary statistics about the actual data.
In R you can build a dashboard similar to PowerBI or Tableau using the Shiny package, but it is _much_ harder. Rather than be drag-and-drop, it is very manual. To share the dashboard, the other person either needs Rstudio to run the app or you need a shiny server. (Shiny servers are free for a single concurrent user but cost money beyond that.)
R dashboards allow interaction, but it is again, more laborious.
R, however, you can actually do pretty much anything you want. As an example, we discussed plotting the residuals of a regression. In R it's a few lines. In Tableau and PowerBI there was no straight-forward method at all. The only options were to create a plot with a trend line (but no access to the underlying trend line model). We discussed building more robust models such as a decision tree for classification. Kindall found an option for it in PowerBI, but when she clicked it, it was basically just a link to R code. Finally, the concept of tidyr::gather() (which combines a set of columns into two columns, 1 for the column names, and one for the column values) was both unknown and very appealing to Ian but unavailable in Tableau.)
R can install packages. As far as we could tell, Tableau and PowerBI do not. That means someone can add Joy plots to R on a whim.
In R, making the initial image is harder. It's at least data plus an aesthetic plus a geom. To get it to match the basic figure in PowerBI and Tableau is a lot harder, potentially adding theme information, possibly additional geoms for labeling columns, etc. However, the amount of work to improve a figure in R scales linearly. After you have matching figures across all three tools, if you wanted to, say, put a plot of points in the background with a lower opacity, that's a single line similar to `geom_jitter(alpha=0.01) + `. Thats about the same amount of work as to make any other change. In Tableau or PowerBI, it would be hours of messing with things to make such simple additions or modifications (if it's possible at all). This is due to R's use of the Grammar of Graphics for figure generation.
Using the Grammar of Graphics, R can make incredible reports. PDFs can be consumer quality. (Figures for the DBIR are mostly created in R with only minor updates to most figures by the layout team.)

Take-Aways

The most important takeaway is that R is appropriate if you verbalize what you want to do, Tableau/PowerBI are appropriate if you can visualize the final outcome but don't know how to get there.

For example "I want to select subjects over 30, group them by gender, and calculate average age." That can quickly be translated to R/dplyr verbs and implemented. Regardless of how many things you want to do, if you can verbalize them, you can probably do them.
If you can visualize your final figure, you can drag and drop parts until you get to something close to what you want to do. It's trial and error, but it's quick and easy. On the other hand, it only works for fairly straight-forward outcomes.

PowerBI and Tableau are useful to quickly explore data. R is useful if you want to dig deeper.
Anything you can do in PowerBI and Tableau, you can do in R. It's just going to be a lot harder.
On the other hand, VERY quickly you hit things that R can do but Tableau or PowerBI cannot (at least directly). The solution is that PowerBI and Tableau both support running R code internally. This has it's own issues:

It requires a bit of setup.
If you learn the easy stuff in PowerBI or Tableau, but try to do the hard stuff in R, it'll be even harder because you don't know how to do the basics in R.
That said, once you've done the setup, you can probably just find how someone else has solved the problem in R and copy and paste it into your dashboard
Then, after the fact, you can go back through and teach yourself how the code actually did whatever hard thing you had it do.

From a data model perspective, R is like excel while PowerBI and Tableau are like a database. Let me demonstrate what I mean by example:

When we started analyzing, the first thing the other two did was add a unique key to the data. The reason is that without a key they aren't able to reference rows individually. They tend toward bar charts because their tools automatically aggregate data. They don't even think that they are summing/averaging the groups they are dragging in as it's done automatically
For myself using R, each row is inherently an observation. As such as I only group explicitly and first create visualizations that are scatter plots, density plots, etc given my categorical variables by a single continuous variable. On the other hand, Tableau and PowerBI make it very simple to link multiple tables and use all columns across all tables in a single figure. In R, if you want to combine two data frames, you have to manually join them.

Output: Tableau and PowerBI are designed primarily to produce dashboards. Everything else is tacked on. R is the opposite. It's designed to produce reports of various types. Dashboards and interactivity are tacked on. That said, there is a lot of work going on to make R more interactive through the production of javascript-based visualizations. I think are likely to see good dashboards in R with easy modification before we see easy high-quality report generation from PowerBI and Tableau.

Final Thoughts

This was a very good experience. It was not a competition but an opportunity to see and discuss how the tools differed. It was a lot of fun (though in some ways felt like a CTF, sitting in the vendor area doing the analysis. Being under some pressure as you don't want to embarrass your tool (and by extension its other users). I really wish I'd included Kibana or Splunk as I think they would have been different enough from PowerBI/Tableau or R to provide a unique perspective. Ultimately I'm hoping it's something that I or the conference can do again as it was a great learning opportunity!

Information Security Analytics Blog

Thursday, August 24, 2017

The Haves and the Have-Nots - Automation of Infosec