Monday, September 18, 2017

Building a ggplot2 stat layer

Introduction

This blog will be about my experience building a simple stat layer for ggplot2.  (For those not familiar, ggplot2 is a plotting library for the R programming language that his highly flexible and extendable.  Layers are things like bars or lines.)
Figure 1: Example exploratory analysis figure

When we write the DBIR, we have automated analysis reports that slice and the data and generate figures that we look at to identify interesting concepts to write about. (See Figure 1.  And Jay, if you're reading this, I know, I haven't changed it much from what you originally did.) (Also, for those wondering, 'ignored' is anything that shouldn't be in the sample size.  That includes 'unknown', which we treat as 'unmeasured' and possibly 'NA' for 'not applicable.)

In Figure 1, you'll notice light blue confidence bars (Wilson binomial tests at a 95% confidence interval.) If you look at 'Former employee', 'Activist', and 'Nation-state', you can see they overlap a bit.  What I wanted to do was visualize Turkey groups (similar to multicomp::plot.cld) to make it easy for myself and the other analysts to tell if we could say something like "Former employee was more common than Activist" (implying the difference is statistically significant).

First step: find documentation

The first step was actually rather hard.  I looked for good blogs of others creating ggplot stats layers, but didn't turn anything up.  (I'm not implying they don't exist.  I'm terrible at googling.). Thankfully someone on twitter pointed me to a vignette on extending ggplot in the ggplot package.  It's probably the best resource but it really wasn't enough for me to decide how to attack the problem.  I also picked up ggplot2:  Elegant graphics for data analysis in the hopes of using it to understand some of the ggplot2 internals.  It primarily dealt with scripting ggplot functionality rather than extending it so I settled on using the example in the vignette as a reference.  With that, I made my first attempt:


### Attempt at making a ggplot stat for independence bars
#' Internal stat to support stat_ind
#' 
#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @export
StatInd <- ggplot2::ggproto("StatInd", ggplot2::Stat,
  required_aes=c("x", "y", "n"), 
  compute_group = function(data, scales) {
    band_spacing <- -max(data$y/data$n, na.rm=TRUE) / 5
    # band_spacing <- 0.1
    bands <- testIndependence(data$y, data$n, ind.p=0.05, ind.method="fisher")

    bands <- tibble::as.tibble(bands) # Necessary to preserve column names in the cbind below
    bands <- cbind("x"=data[!is.na(data$n), "x"], bands[, grep("^band", names(bands), value=T)]) %>%
      tidyr::gather("band", "value", -x) %>%
      dplyr::filter(value) %>%
      dplyr::select(-value)
    y_locs <- data.frame("band"=unique(bands$band), "y"=(1:dplyr::n_distinct(bands$band))/dplyr::n_distinct(bands$band) * band_spacing) # band spacing v2
    bands <- dplyr::left_join(bands, y_locs, by="band")
    bands[ , c("x", "y", "band")]
  }
)


#' ggplot layer to produce independence bars
#' 
#' @rdname stat_ind
#' @inheritParams ggplot2::stat_identity
#' @param na.rm Whether to remove NAs
#' @export
stat_ind <- function(mapping = NULL, data = NULL, geom = "line", 
                     position = "identity", na.rm = FALSE, show.legend = NA,
                     inherit.aes = TRUE, ...) {
  ggplot2::layer(
    stat = StatInd, data = data, mapping = mapping, geom = geom,
    position = position, show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}


While it didn't work, it turns out, I was fairly close.  I just didn't know it.

Since it didn't work, I looked into multiple non-layer approaches including those in the book, as well as simply drawing on top of the layer.  Comparing the three options I had:

  • There approaches that didn't involve actually building a stat layer that probably would have worked, however they were all effectively 'hacks' for what a stack layer would have done.  
  • The problem with simply drawing on the layer was that the labels for the bars would not be adjusted. 
  • The problem with using a stat (other than it not currently working) was I was effectively drawing a discrete value (band #) onto a continuous axis (the bar height) on the other side of the axis line.  
Ultimately I decided to stick with this though it's technically anti-ggplot to mix axes.  (In the Tukey plot, the 'band' is technically a categorical variable, however we are plotting it on the long side of the bars, which is discrete.)

Fixing the stat layer

I decided to look at some of the geoms that exist within ggplot.  I'd previously looked at stat_identity() to no avail.  It turned out that was a mistake as stat_identity() basically does nothing.  When I looked at stat_sum() I found what I needed.  In retrospect, it was in the vignette, but I missed it.  Since I was returning 'band' as a major feature, I needed to define `default_aes=ggplot2::aes(color=..band..)`.  (At first I used `c(color=..band..)` though I quickly learned that doesn't work.)


StatInd <- ggplot2::ggproto("StatInd", ggplot2::Stat,
  default_aes=ggplot2::aes(color=..band..),
  required_aes=c("x", "y", "n"),
  compute_panel = function(data, scales) {
    band_spacing <- -max(data$y/data$n, na.rm=TRUE) / 5
    # band_spacing <- 0.1
    bands <- testIndependence(data$y, data$n, ind.p=0.05, ind.method="fisher")
    bands <- tibble::as.tibble(bands) # Necessary to preserve column names in the cbind below
    bands <- cbind("x"=data[!is.na(data$n), "x"], bands[, grep("^band", names(bands), value=T)]) %>%
      tidyr::gather("band", "value", -x) %>%
      dplyr::filter(value) %>%
      dplyr::select(-value)
    y_locs <- tibble::tibble("band"=unique(bands$band), "y"=(1:dplyr::n_distinct(bands$band))/dplyr::n_distinct(bands$band) * band_spacing) # band spacing v2
    bands <- dplyr::left_join(bands, y_locs, by="band")
    bands[ , c("x", "y", "band")]
  }
)


With that, I had a working stat.

Figure 2: It works!

Fixing the exploratory analysis reports

Unfortunately, our (DBIR) exploratory reports are a bit more complicated.  When I added `stat_ind()` to our actual figure function (which adds multiple other ggplot2 pieces), I got:


This lead a ridiculous amount of hunting.  I fairly quickly narrowed it down to this line in the analysis report:
gg <- gg + ggplot2::scale_y_continuous(expand = c(0, 0), limits=c(0, yexp), labels=scales::percent) # yexp = slight increase in width
Specifically, the `limits=c(0, yexp)` portion.  Unfortunately, I got stuck there.  Things I tried:
  • Changing the limits (to multiple different things, both values and NA)
  • Setting debug in the stat_ind() `compute_panel()` function
  • Setting debug on the scale_y_continuous() line and trying to step through it
  • Rereading the vignette
  • Rereading other stat_* functions
  • Adding the required aesthetics to the default aesthetics 
  • Adding group to the default aesthetics
What finally worked with this:
  1. Google the error message
  2. Find it in the remove_missing() function in ggplot2
  3. Find what calls remove_missing(): compute_layer()
  4. Copy compute_layer() from the stat prototype into stat_ind().  Copy over internal ggplot2 functions that are needed by the default compute_layer() function.  Hook it for debug.
  5. Looking at the data coming into the compute_layer() function, I see the figure below with 'y' all 'NA'.  Humm.  That's odd...
  6. Look at the data coming in without the ylimit set.  This time 'y' data exists.
  7. Go say 'hi' to my 2yo daughter.  While carrying her around the house, realize that 'y' is no longer within the limits ....
Figure 3: Data parameter into compute_layer() with ylimits set
Figure 4: Data parameter into compute_layer() without ylimits set
So, what happened is, when `limits=c(0, yexp)` was applied,  the 'y' data was replaced with 'NA' because the large integer values of 'y' was not within the 0-1 limits.  `compute_layer()` was then called, which called `remove_missing()` which removes all the rows with `NA`.  This caused the removed rows.

The reason this was happening is I'd accidentally overloaded the 'y' aesthetic.  `y` meant something different to ggplot than it did to testIndependence(), (the internal function which calculated the bands).  The solution was to replace 'y' with 's' as a required aesthetic.  Now the function looks like this:


StatInd <- ggplot2::ggproto("StatInd", ggplot2::Stat,
  default_aes=ggplot2::aes(color=..band..), 
  required_aes=c("x", "s", "n"), 
  compute_panel = function(data, scales) {
    band_spacing <- -max(data$s/data$n, na.rm=TRUE) / 5
    # band_spacing <- 0.1
    bands <- testIndependence(data$s, data$n, ind.p=0.05, ind.method="fisher")
    bands <- tibble::as.tibble(bands) # Necessary to preserve column names in the cbind below
    bands <- cbind("x"=data[!is.na(data$n), "x"], bands[, grep("^band", names(bands), value=T)]) %>%
      tidyr::gather("band", "value", -x) %>%
      dplyr::filter(value) %>%
      dplyr::select(-value)
    y_locs <- tibble::tibble("band"=unique(bands$band), "y"=(1:dplyr::n_distinct(bands$band))/dplyr::n_distinct(bands$band) * band_spacing) # band spacing v2
    bands <- dplyr::left_join(bands, y_locs, by="band")
    bands[ , c("x", "y", "band")]
  } 
)

With that it finally started working!

Figure 5: Success!

Conclusion

Ultimately, I really stumbled through this.  Given how many new ggplot2 geoms are always popping up, I expected this to be much more straight-forward.  I expected to find multiple tutorials but really came up short.  There was a lot of reading ggplot2 source code.  There was a lot of trial and error.  In the end though, the code isn't complicated.  Much of the documentation however is within the code, or the code itself.  Still, I'm hoping that the first is the roughest, and the next layer I create will come more smoothly.

Wednesday, September 13, 2017

The end of risk

Introduction

I think risk is hurting infosec and may need to go.

First, a quick definition of risk:

Risk is the probable frequency and probable magnitude of future loss.
(Note: If this is not your definition of risk, the rest of the blog is probably going to make much less sense.  Unfortunately why this is the definition of risk is outside the scope of this blog so will have to wait.)

In practicality, risk is a way to measure security by measuring the likelihood and impact of something bad happening that could have been prevented by information security.

Outcomes

That last line is a bit nebulous though right?  Risk measures the opposite of what we're doing.  So let's better define what we're doing.  Let's call what we're doing an outcome: the end result of our structure and processes (for example, in healthcare, heart disease is a general term for negative cardiovascular outcomes).  Next, let's define what we want:
(A statistically insignificant and heavily biased survey. Obviously.  Why is this a good outcome? See the addendum.)

Measures, Markers, and Key Risk Indicators

Where we can directly measure this outcome, we call it a 'measure' (for example ejection fraction for heart disease) and life is a lot easier.  For risk we have to use surrogate markers (for example cholesterol for heart disease), sometimes called a Key Risk Indicators in risk terms.  Now when we say 'risk', we normally mean the indicators we use to predict risk.  The most well respected methodology is FAIR though if you are currently using an excel spreadsheet with a list of questions for your risks, you can easily improve simply by switching to Binary Risk Assessment.

The problems with risk.

The first problem with risk is not with risk per se, but with the surrogate markers we measure to predict risk.  In other fields such as medicine, before using a surrogate marker, there would be non-controversial studies linking the surrogate marker and the outcome.  In security, I'm not aware of any study which shows that the surrogate markers we measure to determine risk, actually predict the outcome in a holistic way.  In my opinion, there's a specific reason:
Because there are more legitimate targets than attackers, what determines an organization's outcome (at least on the attack side) is attacker choice.
You can think of it like shooting fish in a barrel.  Your infosec actions may take you out of the barrel, but you'll never truly know if you're out of the barrel or your in the barrel and just weren't targeted.   This, I think, is the major weakness in risk as a useful marker of outcomes.

This ignores multiple additional issues with risk such as interrelationships between risks, the impact of rational actors, and difficulty in capturing context, let alone problems in less mature risk processes that are solved in mature processes such as FAIR.

The second problem with risk is related to defense:
Risk does not explicitly help measure minimization of defense.  It tells us nothing about how to decrease (or minimally increase) the use of resources in infosec defense.   
I suspect we then implicitly apply an equation that Tim Clancy brought to my attention as foundational in the legal field: Burden < Cost of Injury × Probability of occurrence, (i.e. if Burden < Risk, pay the burden). It sounds good in theory, but is fraught with pitfalls.  The most obvious pitfall is that it doesn't scale.  Attacks are paths, except the paths are not in isolation.  At any given step, they can choose to go left or right, in effect creating more paths than can be counted.  As such, while one burden might be affordable, the sum of all burdens would bankrupt the organization.

What happens when markers aren't linked to outcomes?

As discussed in this thread, I think a major amount of infosec spending is socially driven.  Either the purchaser is making a purchase to signals success ("We only use the newest next gen products!"), signal inclusion in a group ("I'm a good CISO"), or is purchasing due to herd mentality ("Everyone else buys this so it must be the best option to buy").  Certainly, as we've shown above, the spending is not related to the outcome.  This begs the question of who sets the trends for the group or steers the herd.  I like Marcus Carey's suggestion: The analysts and Value Added Resellers. Maybe this is why we see so much marketing money in infosec.

The other major driver is likely other surrogate markers.  Infosec decision makers are starved for concrete truth, and so almost any number is clung to.  The unfortunate fact is that, like risk, most of these numbers have no demonstrable connection to the outcome.  Take hypothetical threat intelligence solution for example.  This solution includes all IPv4 addresses as indicators.  It has a 100% success rate in identifying threats.  (It also has a near 100% false positive rate.)  I suspect, with a few minor tweaks, it would be readily purchased even though it adds no value.

What can we do?

There are three questions we need to ask ourselves when evaluating metrics/measures/surrogate markers/KRI/however you refer to them.  From here on out, I'll refer to 'them' as 'metrics'.
  1. What is the outcome?
  2. Why is this the right outcome? (Is it actionable?)
  3. How do you know the metric is predicting this outcome?
For any metric you are considering using to base your security strategy on (the metric you use to make decisions of projects, purchases, etc with), you should be able to answer these three questions definitively.  (In fact, this blog answers the first question and part of the second above in the first section.)  I think there are at least three potential areas for future research that may yield acceptable metrics.

Operational metrics

I believe operational metrics have a lot of potential.  They are easy to collect with a SEIM.  The are actionable.  They can directly predict the outcome above.  ("The ideal outcome of infosec is minimizing infosec, attack and defense.")  Our response process:

  1. Prevent
  2. Detect
  3. Respond
  4. Recover

should minimize infosec.  With that in mind we can measure it:

  • Absolute count of detections. (Should go down with mitigations.)
  • Time to detect (Should go down with improved detection.) (Technically, absolute count of detections should go up with improved detection as well, but should also be correlated with an improved time to detect)
  • Percent detected in under time T.  (Where time T is set such that, above time T, the attack likely succeeded.)
  • Percent responded to. (Depending on the classification of the incidents, this can tell you both how much time you are wasting responding to false positives and what portion of true attacks are are resolving.)
  • Time to respond. (Goal of responding in under T where T represents time necessary for the attack to succeed.)
  • Successful response. (How many attacks are you preventing from having an impact)
The above metrics are loosely based on those Sandia National Labs uses in physical security assessments.  You can capture additional resource-oriented metrics:
  • Time spent on attacks by type.  (This can help identify where your resources are being spent so you can prioritize projects to improve resource utilization)
  • Recovery resources used. (This can help assess the impact of failure in the Detect and Respond metrics.)
  • Metrics on escalated incidents. (Time spent at tier 1, type, etc.  This may suggest projects to minimize tier 2 use and, therefore, overall resource utilization.)

Combine this with data from infosec projects and other measurable infosec resource costs and the impact of infosec on the organization (both attack and defense) can be measured.

Relative risk

Risk has a lot of good qualities.  One way to get around it's pitfalls may be to not track risk in absolute terms (probability of impact size in FAIR's case), but in relative terms.  Unfortunately, it removes the ability to give the business a single, "this is your risk" score except in terms relative to other organizations.  But relative may be enough,.  For implementing a security strategy where the goal is to pick the most beneficial course of action, relative risk may be enough to choose a course of action.  The actions can even have defensive costs associated with them as well.  The problem is that the defensive costs and the relative risk are not in the same units, making it hard to understand if purchasing a course of action is a net benefit.

Attacker cost

Finally, I think attacker cost may be a worthwhile area of research.  However, I don't think it is an area that has been well explored.  As such, a connection between maximizing attacker cost and "minimizing infosec" (from our outcome) has not been demonstrated.  I suspect a qualified economist could easily show that as attacker costs go up, some attackers will be priced out of the market, and those that still could afford to attack, will choose less expensive sources to fulfill their needs.  However a qualified economist, I am not.  Second, I don't know that we have a validated way to measure attack 'cost'.  It makes intuitive sense that we could estimate these costs.  (We know how they are done and, as such, can estimate what it would cost us to accomplish the attack and any differences between us and attackers.) But before this is accepted, academic research in pricing attacks will be necessary.

Conclusion

So, from this blog, I want you to take a two things:
  1. The fact that attackers pick targets means no-one really knows if their mitigations make them secure.
  2. Three easy questions to ask when looking at metrics to guide your organization's security strategy
With a well-defined outcome, and good metric(s) to support it, you can truly build a data-driven security strategy.  But that's a talk for another day.

Addendum

Why is this the right outcome?  Good question.  It captures multiple things at once.  Breaches can be considered a cost associated with infosec and so are captured.  However, it'd be naive to think that all costs attackers cause are associated with breaches (or even incidents).  The generality of the definition allows it to be inclusive.  It also captures the flip side: the goal of minimizing defenses. This is easy to miss, but critical to organizations.  There is no benefit to infosec if the cost of stopping attacks is worse than the cost of the attacks.  Ideally, stopping attacks would have zero cost of resources (though that is practically impossible).  This outcome is also vague about the unit to be minimized allowing flexibility.  (It doesn't say 'minimize cost' or 'minimize time'.) Ultimately it's up to the organization to measure this outcome.  How they choose to do so will determine their success.

Thursday, August 24, 2017

The Haves and the Have-Nots - Automation of Infosec

Several years ago, I blogged about Balkanizing the Internet. More than ever it appears that a digital feudalism is emerging.  A driver that I didn't necessarily consider is the automation of security.

Automation in Infosec

The future of security is speed and persuasiveness.  Whoever accomplishes the OODA loop (or additive factors if you like) first has an incredible advantage.  In information security, that means automation and machine learning making contextual decisions faster than humans ever could.  It will be defense's algorithms against offense's.  The second part is probably more interesting.  Machine learning is output generated from input.  In essence, humans are a much less predictable version of the same.  As such, any actor or algorithm, offensive or defensive, that can figure out what input to the opposing side produces the outcome they want, and provide that input before losing will win.  Because it needs to happen at speed, it's also likely to be algorithmic.  We already train adversarial models to do this.

Infosec 1%'ers

The need for speed and persuasiveness driving automation and artificial intelligence in information security is it's own blog.  I touch on it here because, in reality, it only describes the infosec 1%'ers. While a Google or Microsoft may be able to guard their interests with robust automation and machine learning, the local app developer, law office, or grocery store will not.

Which brings us to the recent malware.  It should be a wake-up call to all information security professionals.  It utilizes no new knowledge, but it provides a datapoint in the trend of automation. While the 1%, or even 50% defender might not be affected, the publicly known level of automation in infosec attack is easily ahead of a large portion of the internet and appears to be growing faster than defensive automation due to adherence to engineering practices for system management.  Imagine malware automating the analysis process in bloodhound.  Imagine an attack graph, knowledgeable about how to turn emails/credentials/vulnerabilities into attacks/malware, and malware/attacks into email/credentials, was built into a piece of malware, causing it to spread, unhindered as it creeps across the trust relationships that connect everyone on the planet.  This could easily be implemented as a plugin for a tool such as armitage.

Balkanization

This is brings us back to the Balkanization of the Internet.  In the near future, the only way to defend systems may be to cede control, regardless of the obligations, to the infosec 1%ers'.  The only people protected will be those who allow automated systems to guard, modify, and manage their systems.  Your choice may be to allow google to monitor all traffic on your internal network to allow their models to defend it, or quickly fall victim to roving automated threats.  The internet will have devolved into roaming threats, only kept at bay by feudal lords able to oppose them.


Thursday, August 10, 2017

PowerBI vs Tableau vs R

Yesterday at the Nashville Analytics Summit I had the pleasure of demonstrating the strengths, weaknesses, similarities, and differences between Microsoft PowerBI, Tableau, and R.

The Setup

Last year when I spoke at the summit, I provided a rather in-depth review of of the DBIR data workflow.  One thing I noticed is the talk was further along in the data science process from most attendees who were still working in Tableau or even trying to decide what tool to use for their organization.  This year I decided to try and address that gap.

I recruited Kindall (a daily PowerBI user) and Ian (a daily Tableau user) to help me do a bake-off.  Eric, our moderator, would give us all a dataset we'd never seen (and it turned out, in a domain we don't work in) and some questions to answer.  We'd get them at 8:30 in the morning and then spend the day up until our talk at 4:15 analyzing the dataset and answering the questions.  (I got the idea from the fuzzing vs reverse engineering panel at Defcon a few years ago.)

The dataset was about 100,000 rows and 50 or so columns (about half medications given) related to medical stays involving diabetes.  The features were primarily factors of various sorts with a continuous feature for time in the hospital (the main variable of interest).

The Results

I'll skip most of the findings from the data as that wasn't really the point.  Instead I'll focus on the tools.  At a basic level, all three tools can create bar charts very quickly including color and alpha.  Tableau and PowerBI were very similar so I'll start there.

Tableau and PowerBI Similarities

  • Both are dashboard based
  • Both are driven from the mouse, dragging and dropping features into the dashboard
  • Both have a set of visualization types pre-defined that can be used
  • Both allow interactivity out of the box with clicking one chart subsetting others

Tableau and PowerBI Differences:

  • PowerBI is a bit more web-based.  It was easy to move from local to cloud and back.
  • PowerBI has more robust integration with other MS tools and will be familiar to excel users (though the formulas have some differences compared to excel as they are written in DAX).
    PowerBI keeps a history of actions that allow you to go backwards and see how you got where you are.
  • To share a dashboard in PowerBI you simply share a link to it.
  • Finally, PowerBI is pretty easy to use for free until you need to share dashboards.
  • Tableau Is more desktop application based.
  • You can publish dashboards to a server if you have the enterprise version or you can install the Tableau viewer app (however that still requires the receiver install software).  Also, sharing the actual workbook basically removes any security associated with your data.
  • Tableau dashboards can also be exported as PDFs but it is not the primary approach.
  • Tableau allows good organization of data within the GUI to help facilitate building the dashboard.
  • Tableau lacks the history though so there is no good way of telling how you did what you did.

Differences between R and Tableau/PowerBI

Most differences came between R and the other two tools
  • While PowerBI and Tableau are driven by the mouse and interact with a GUI, R is driven from the keyboard and interacts with a command-line.
  • In PowerBI or Tableau, initial investigation basically involves throwing features on the x and y axis and looking at the result.  Both provide the ability to look at the data table behind the dashboard but it's not really part of the workflow.  In R, you normally start at the data with something like `dplyr::glimpse()`, `summary()`, or `str()` which give you some summary statistics about the actual data.
  • In R you can build a dashboard similar to PowerBI or Tableau using the Shiny package, but it is _much_ harder.  Rather than be drag-and-drop, it is very manual.  To share the dashboard, the other person either needs Rstudio to run the app or you need a shiny server. (Shiny servers are free for a single concurrent user but cost money beyond that.)
  • R dashboards allow interaction, but it is again, more laborious.
  • R, however, you can actually do pretty much anything you want.  As an example, we discussed plotting the residuals of a regression.  In R it's a few lines.  In Tableau and PowerBI there was no straight-forward method at all.  The only options were to create a plot with a trend line (but no access to the underlying trend line model).  We discussed building more robust models such as a decision tree for classification.  Kindall found an option for it in PowerBI, but when she clicked it, it was basically just a link to R code.  Finally, the concept of tidyr::gather() (which combines a set of columns into two columns, 1 for the column names, and one for the column values) was both unknown and very appealing to Ian but unavailable in Tableau.)
  • R can install packages.  As far as we could tell, Tableau and PowerBI do not.  That means someone can add Joy plots to R on a whim.
  • In R, making the initial image is harder.  It's at least data plus an aesthetic plus a geom.  To get it to match the basic figure in PowerBI and Tableau is a lot harder, potentially adding theme information, possibly additional geoms for labeling columns, etc.  However, the amount of work to improve a figure in R scales linearly.  After you have matching figures across all three tools, if you wanted to, say, put a plot of points in the background with a lower opacity, that's a single line similar to `geom_jitter(alpha=0.01) + `.  Thats about the same amount of work as to make any other change.  In Tableau or PowerBI, it would be hours of messing with things to make such simple additions or modifications (if it's possible at all).  This is due to R's use of the Grammar of Graphics for figure generation.
  • Using the Grammar of Graphics, R can make incredible reports.  PDFs can be consumer quality. (Figures for the DBIR are mostly created in R with only minor updates to most figures by the layout team.)

Take-Aways

  1. The most important takeaway is that R is appropriate if you verbalize what you want to do, Tableau/PowerBI are appropriate if you can visualize the final outcome but don't know how to get there.  
    •  For example "I want to select subjects over 30, group them by gender, and calculate average age."  That can quickly be translated to R/dplyr verbs and implemented.  Regardless of how many things you want to do, if you can verbalize them, you can probably do them. 
    •  If you can visualize your final figure, you can drag and drop parts until you get to something close to what you want to do.  It's trial and error, but it's quick and  easy.  On the other hand, it only works for fairly straight-forward outcomes.
  2. PowerBI and Tableau are useful to quickly explore data.  R is useful if you want to dig deeper.
  3. Anything you can do in PowerBI and Tableau, you can do in R.  It's just going to be a lot harder.
  4. On the other hand, VERY quickly you hit things that R can do but Tableau or PowerBI cannot (at least directly).  The solution is that PowerBI and Tableau both support running R code internally.  This has it's own issues:
    • It requires a bit of setup.
    • If you learn the easy stuff in PowerBI or Tableau, but try to do the hard stuff in R, it'll be even harder because you don't know how to do the basics in R.
    • That said, once you've done the setup, you can probably just find how someone else has solved the problem in R and copy and paste it into your dashboard
    • Then, after the fact, you can go back through and teach yourself how the code actually did whatever hard thing you had it do.
  5. From a data model perspective, R is like excel while PowerBI and Tableau are like a database.  Let me demonstrate what I mean by example: 
    • When we started analyzing, the first thing the other two did was add a unique key to the data.  The reason is that without a key they aren't able to reference rows individually.  They tend toward bar charts because their tools automatically aggregate data.  They don't even think that they are summing/averaging the groups they are dragging in as it's done automatically
    • For myself using R, each row is inherently an observation.  As such as I only group explicitly and first create visualizations that are scatter plots, density plots, etc given my categorical variables by a single continuous variable.  On the other hand, Tableau and PowerBI make it very simple to link multiple tables and use all columns across all tables in a single figure.  In R, if you want to combine two data frames, you have to manually join them.
  6. Output: Tableau and PowerBI are designed primarily to produce dashboards.  Everything else is tacked on.  R is the opposite.  It's designed to produce reports of various types.  Dashboards and interactivity are tacked on.  That said, there is a lot of work going on to make R more interactive through the production of javascript-based visualizations.  I think are likely to see good dashboards in R with easy modification before we see easy high-quality report generation from PowerBI and Tableau.

Final Thoughts

This was a very good experience.  It was not a competition but an opportunity to see and discuss how the tools differed.  It was a lot of fun (though in some ways felt like a CTF, sitting in the vendor area doing the analysis.  Being under some pressure as you don't want to embarrass your tool (and by extension its other users).  I really wish I'd included Kibana or Splunk as I think they would have been different enough from PowerBI/Tableau or R to provide a unique perspective.  Ultimately I'm hoping it's something that I or the conference can do again as it was a great learning opportunity!

Wednesday, May 3, 2017

Elasticsearch. Logstash. R. What?!

Motivation

At bsidesNash, Chris Sanders gave a great talk on threat hunting.  One of his recommendations was to try out an ELK (Elasticsearch, Logstash, Kibana) stack for searching for threats in log data.  ELK is an easy way to stand up a distributed, scalable, stack capable of storing and searching full text records.  The benefit is it's easy ingestion (Logstash), schema-agnostic storage ability, (Elasticsearch), and robust search and dashboards (Kibana) makes it easy platform for threat hunters.

However, because of it's ease, ELK tends to be a one-size-fits-all solution for many tasks.  I had asked Chris about using other tools for analysis such as R by way of Rstudio and dplyr or Microsoft Power BI.  Chris hadn't tried it and, at the time, neither had I.  (My day job is mostly historic data analysis rather than operational monitoring.)

Opportunity

However, the DBIR Cover Challenge presented an opportunity.  For those who are unaware, each year there is a code or codes hidden on the DBIR cover.  That code then leads to a puzzle challenge which has resulted in some nice rewards for the winners; (iPad minis, auto-follow telescopes, Yeti coolers, quadcopters, 3D printers, and more).  The challenge has multiple puzzles of which players must complete 8.  So that they check their answers as they go, the site is a dynamic webapp hosted at Heroku.  Because it is dynamic, I can add my own log messages into the endpoint functions.

But I needed a place to store and search the logs.  Heroku provides some great plugins for this, but, given the conversation with Chris, I figured I'd try to roll my own, starting with ELK.  The first hurdle was that, though there is a lot of hosted Elasticsearch and Kibana, there was much less hosted Logstash (the part I really needed).  Elastic cloud didn't have it.  AWS had their own tools.  Finally I found logit.io which works perfectly.  They provide a full ELK stack as a cloud service for around $20 at the low end with a 14 day trial. I signed up for the trail and was up-and-running in minutes.  They even have an easy one-line instruction on how to set up a Heroku drain to send logs to a logit.io Logstash endpoint.  From there, it is automatically stored in Elasticsearch and searchable through Kibana.

Going beyond ELK

The problem I quickly found out, was that Kibana didn't have the robust manipulation I was used to using R.  While it could find entries and make basic dashboards, I simply couldn't cut the data like I wanted once I'd found the subset of data I was interested in.  I tried passing the data to PowerBI, but on first blush, the streaming API setup was too limited to ingest a heroku drain using the basic setup tools.  Finally, I decided to try and keep the Logstash and Elasticsearch underpinnings, but switch to R for analysis.  R allows for simple pipeline analysis of data as well as robust charting.

Doin it with R

The first step was to install the packages I'd need:
install.packages("dplyr") # for simple piped data processing
install.packages("elastic") # for talking to the Elasticsearch store
install.packages(flexdashboard) # for creating a dashboard to monitor
install.packages("DT") # for displaying a HTML data table in the dashboard
install.packages("stringr") # simple string manipulation
install.packages("ggmaps", "viridis", "rgeolocate", "leaflet") # geocoding IPs and displaying them on a map
install.packages("devtools", "treemap") # create treemaps
devtools::install_github("Timelyportfolio/d3treeR") # create treemaps
After installing packages, the next step was to set up the Elasticsearch connection:
elastic::connect(es_host="<my ES endpoint>", es_port=443, es_path="", es_transport_schema = 'https', headers=list(apikey="<my api key>"))
I also manually visited: "https://<my ES endpoint>/_cat/indices?v&apikey=<my API key>&pretty=true" to see what indexes Logstash was creating.  It appears to create an index per day and keep four indexes in the default logit.io setup.  I stored them into a variable and then ran a query, in this case for the line log line indicating a player had submitted a specific key:
indexes <- c("logstash-2017.04.28", "logstash-2017.04.29", "logstash-2017.04.30", "logstash-2017.05.01") # I should be able to get this from `elastic::cat_indices()`, but it did not apply my apikey correctly
query <- elastic::Search(index=indexes, q="logplex_message:submitted", size=10000)$hits$hits
The following thing we need to do is remove only the fields we want from the query.  The result is a list of query results, each itself a list of key:value pairs.  I used the `lapply` function to extract _just_ the logplex_message field.  (`lapply` takes a function and applies it to each item of a list in R.)  `lapply` returns a list and so I `unlist` the results and make them a column in a dataframe:
submissions <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$logplex_message))
In our puzzle challenge, we have 'trainers' who use 'keys' to indicate they've caught Breachemon.  I can use my normal R skills to separate the trainer name and key from the log message and count how many times each trainer has submitted each key:
submissions <- submissions %>%
    mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract 'trainer'
    mutate(key = gsub(".*submitted key (.*) to the bank.$", "\\1", text)) %>% # extract 'key'
    group_by(trainer, key) %>% # group each trainer-key pair
    tally() # short cut for `summarize(n=n())`.  For each trainer-key pair, create a column 'n' with the number of times that pair occurred
From there we can visualize the table with:
DT::datatable(submissions)
We could also visualize the total submissions per trainer:
submitters <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$logplex_message)) %>% # extract the log message and produce a dataframe
mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract the trainer
group_by(trainer) %>% # create a group per trainer
tally() # shortcut for `summarize(n=n())`. Count the events per group
d3treeR::d3tree2(treemap::treemap(submitters, "trainer", "n", aspRatio=5/3, draw = FALSE)) # produce a treemap of submissions per person

Dashboard Time

To wrap this all together, I decided to make a simple dashboard.  In the Rstudio menu, File->New File->R Markdown...  In the menu, choose 'From Template' and then Template: 'Flex Dashboard'.  You'll get something like:
---
title: "Untitled"
output:
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
---
```{r setup, include=FALSE}
library(flexdashboard)
```
Column {data-width=650}
-----------------------------------------------------------------------
### Chart A
```{r}
```
Column {data-width=350}
-----------------------------------------------------------------------
### Chart B
```{r}
```
### Chart C
```{r}
```
Lets add our two charts:
---
title: "Breachemon"
output:
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
---
```{r setup, include=FALSE}
library(flexdashboard)
library(dplyr)
elastic::connect(es_host="<my ES endpoint>", es_port=443, es_path="", es_transport_schema = 'https', headers=list(apikey="<my api key>"))
query <- elastic::Search(index=indexes, q="logplex_message:submitted", size=10000)$hits$hits
```
Column {data-width=650}
-----------------------------------------------------------------------
### Submissions
```{r fig.keep='none'}
submitters <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$logplex_message)) %>% # extract the log message and produce a dataframe
mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract the trainer
group_by(trainer) %>% # create a group per trainer
tally() # shortcut for summarize(n=n()).  Count the events per group
d3treeR::d3tree2(treemap::treemap(submitters, "trainer", "n", aspRatio=5/3, draw = FALSE)) # produce a treemap of submissions per person
```
### Submitters
```{r}
data.frame(text = unlist(lapply(query, function(l) {l$`_source`$logplex_message}))) %>%
    mutate(trainer = gsub("Trainer ([^[:space:]]*).*$", "\\1", text)) %>% # extract 'trainer'
    mutate(key = gsub(".*submitted key (.*) to the bank.$", "\\1", text)) %>% # extract 'key'
    group_by(trainer, key) %>% # group each trainer-key pair
    tally() # short cut for `summarize(n=n())`.  For each trainer-key pair, create a column 'n' with the number of times that pair occurred
  DT::datatable()
```
Column {data-width=350}
-----------------------------------------------------------------------

### Map
```{r}
ips <- data.frame(text = purrr::map_chr(query, ~ .$`_source`$msg_fwd))
geo <- rgeolocate::db_ip(as.character(unique(ips$text)), "<my free db-ip.com api key>") # geocode unique IPs, returns a list
geo <- do.call(rbind.data.frame, geo) # bind the list together as a dataframe
names(geo) <- c("IP", "Country", "State", "City") # set the dataframe column names
geo <- ips %>%
    group_by(text) %>%
    tally() %>% # count per IP
    rename(IP = text) %>%
    right_join(geo, by="IP") # join with geolocation
cities <- unique(as.character(geo$City)) # unique list of cities
cities <- cbind(ggmap::geocode(cities), cities) # geo code the cities
geo <- right_join(geo, cities, by=c("City" = "cities")) #join it back together
pal <- leaflet::colorFactor(viridis::viridis_pal(option = "C")(2), domain = geo$n) # create a color range
leaflet::leaflet(geo) %>% # make a map
  leaflet::addTiles() %>% # add some default shapes to it
  leaflet::addCircleMarkers(color = ~pal(n)) # add a circle with a color based on the count of submissions for each IP
```
Resulting in:

The last block pulls the msg_fwd field which contains the source IP adddress, splits it (as some have multiple), and stores it in a dataframe.  It then geolocates the IPs and binds the cities.  After that it geocodes latitude and longitude and joins it.  Finally it places the geolocated and coded IPs as dots on a map.

Wrapup

That's not to say there aren't hang-ups.  You _are_ pulling the data from the remote cluster to your local machine which is a relatively costly action.  (The queries I ran returned in a fraction of the second, but I can imagine querying a billion record store, returning tens of thousands of hits, would be slower.)  However, as Chris noted during his talk, not being selective in what you retrieve to search is one of the signs of a junior analyst.  Also, I have not automated retrieval of more than 10,000 records or the automatic tracking of indexes as they are created.  Finally, the dashboard must be refreshed manually.  There's a little button to do so in the Rstudio browser, however I think it may make more sense to provide a Shiny button to use to update all or selected portions instead.  Unfortunately, most of this goes beyond the few hours I was willing to put into this. proof of concept.

In the end, it was well worth the experimentation.  It required no hardware and brings the robust slicing and dicing of data that the R ecosystem provides to the easy and scalable storage of ELK. Though the logit.io service doesn't allow direct configurability of most of the ELK stack, they seem responsive to requests.  I'm actually not sure that the ES portion of ELK is really necessary.  If you are working with a limited number of well-defined data sources, a structured store such as Postgres, or a key:value store such as hive/hbase might make more sense.  R has nearly the repository of packages that Python does.  On my mac pro I can work with datasets in the 10's of millions of records, providing all sorts of complex analysis.  All in an easily-documentable and repeatable way.

In the future, I'd love to see the same thing done with MS PowerBI.  It's not a platform I know, but I think it would definitely be an interesting one to explore.  If anyone has any ideas on how to stream data to it, please let me know!