Information Security Analytics Blog: September 2017

Monday, September 18, 2017

Building a ggplot2 stat layer

Introduction

This blog will be about my experience building a simple stat layer for ggplot2. (For those not familiar, ggplot2 is a plotting library for the R programming language that his highly flexible and extendable. Layers are things like bars or lines.)

Figure 1: Example exploratory analysis figure

When we write the DBIR, we have automated analysis reports that slice and the data and generate figures that we look at to identify interesting concepts to write about. (See Figure 1. And Jay, if you're reading this, I know, I haven't changed it much from what you originally did.) (Also, for those wondering, 'ignored' is anything that shouldn't be in the sample size. That includes 'unknown', which we treat as 'unmeasured' and possibly 'NA' for 'not applicable.)

In Figure 1, you'll notice light blue confidence bars (Wilson binomial tests at a 95% confidence interval.) If you look at 'Former employee', 'Activist', and 'Nation-state', you can see they overlap a bit. What I wanted to do was visualize Turkey groups (similar to multicomp::plot.cld) to make it easy for myself and the other analysts to tell if we could say something like "Former employee was more common than Activist" (implying the difference is statistically significant).

First step: find documentation

The first step was actually rather hard. I looked for good blogs of others creating ggplot stats layers, but didn't turn anything up. (I'm not implying they don't exist. I'm terrible at googling.). Thankfully someone on twitter pointed me to a vignette on extending ggplot in the ggplot package. It's probably the best resource but it really wasn't enough for me to decide how to attack the problem. I also picked up ggplot2: Elegant graphics for data analysis in the hopes of using it to understand some of the ggplot2 internals. It primarily dealt with scripting ggplot functionality rather than extending it so I settled on using the example in the vignette as a reference. With that, I made my first attempt:

### Attempt at making a ggplot stat for independence bars
#' Internal stat to support stat_ind
#' 
#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @export
StatInd <- ggplot2::ggproto("StatInd", ggplot2::Stat,
  required_aes=c("x", "y", "n"), 
  compute_group = function(data, scales) {
    band_spacing <- -max(data$y/data$n, na.rm=TRUE) / 5
    # band_spacing <- 0.1
    bands <- testIndependence(data$y, data$n, ind.p=0.05, ind.method="fisher")

    bands <- tibble::as.tibble(bands) # Necessary to preserve column names in the cbind below
    bands <- cbind("x"=data[!is.na(data$n), "x"], bands[, grep("^band", names(bands), value=T)]) %>%
      tidyr::gather("band", "value", -x) %>%
      dplyr::filter(value) %>%
      dplyr::select(-value)
    y_locs <- data.frame("band"=unique(bands$band), "y"=(1:dplyr::n_distinct(bands$band))/dplyr::n_distinct(bands$band) * band_spacing) # band spacing v2
    bands <- dplyr::left_join(bands, y_locs, by="band")
    bands[ , c("x", "y", "band")]
  }
)


#' ggplot layer to produce independence bars
#' 
#' @rdname stat_ind
#' @inheritParams ggplot2::stat_identity
#' @param na.rm Whether to remove NAs
#' @export
stat_ind <- function(mapping = NULL, data = NULL, geom = "line", 
                     position = "identity", na.rm = FALSE, show.legend = NA,
                     inherit.aes = TRUE, ...) {
  ggplot2::layer(
    stat = StatInd, data = data, mapping = mapping, geom = geom,
    position = position, show.legend = show.legend, inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}

While it didn't work, it turns out, I was fairly close. I just didn't know it.

Since it didn't work, I looked into multiple non-layer approaches including those in the book, as well as simply drawing on top of the layer. Comparing the three options I had:

There approaches that didn't involve actually building a stat layer that probably would have worked, however they were all effectively 'hacks' for what a stack layer would have done.
The problem with simply drawing on the layer was that the labels for the bars would not be adjusted.
The problem with using a stat (other than it not currently working) was I was effectively drawing a discrete value (band #) onto a continuous axis (the bar height) on the other side of the axis line.

Ultimately I decided to stick with this though it's technically anti-ggplot to mix axes. (In the Tukey plot, the 'band' is technically a categorical variable, however we are plotting it on the long side of the bars, which is discrete.)

Fixing the stat layer

I decided to look at some of the geoms that exist within ggplot. I'd previously looked at stat_identity() to no avail. It turned out that was a mistake as stat_identity() basically does nothing. When I looked at stat_sum() I found what I needed. In retrospect, it was in the vignette, but I missed it. Since I was returning 'band' as a major feature, I needed to define `default_aes=ggplot2::aes(color=..band..)`. (At first I used `c(color=..band..)` though I quickly learned that doesn't work.)

StatInd <- ggplot2::ggproto("StatInd", ggplot2::Stat,
  default_aes=ggplot2::aes(color=..band..),
  required_aes=c("x", "y", "n"),
  compute_panel = function(data, scales) {
    band_spacing <- -max(data$y/data$n, na.rm=TRUE) / 5
    # band_spacing <- 0.1
    bands <- testIndependence(data$y, data$n, ind.p=0.05, ind.method="fisher")
    bands <- tibble::as.tibble(bands) # Necessary to preserve column names in the cbind below
    bands <- cbind("x"=data[!is.na(data$n), "x"], bands[, grep("^band", names(bands), value=T)]) %>%
      tidyr::gather("band", "value", -x) %>%
      dplyr::filter(value) %>%
      dplyr::select(-value)
    y_locs <- tibble::tibble("band"=unique(bands$band), "y"=(1:dplyr::n_distinct(bands$band))/dplyr::n_distinct(bands$band) * band_spacing) # band spacing v2
    bands <- dplyr::left_join(bands, y_locs, by="band")
    bands[ , c("x", "y", "band")]
  }
)

With that, I had a working stat.

Figure 2: It works!

Fixing the exploratory analysis reports

Unfortunately, our (DBIR) exploratory reports are a bit more complicated. When I added `stat_ind()` to our actual figure function (which adds multiple other ggplot2 pieces), I got:

This lead a ridiculous amount of hunting. I fairly quickly narrowed it down to this line in the analysis report:

gg <- gg + ggplot2::scale_y_continuous(expand = c(0, 0), limits=c(0, yexp), labels=scales::percent) # yexp = slight increase in width

Specifically, the `limits=c(0, yexp)` portion. Unfortunately, I got stuck there. Things I tried:

Changing the limits (to multiple different things, both values and NA)
Setting debug in the stat_ind() `compute_panel()` function
Setting debug on the scale_y_continuous() line and trying to step through it
Rereading the vignette
Rereading other stat_* functions
Adding the required aesthetics to the default aesthetics
Adding group to the default aesthetics

What finally worked with this:

Google the error message
Find it in the remove_missing() function in ggplot2
Find what calls remove_missing(): compute_layer()
Copy compute_layer() from the stat prototype into stat_ind(). Copy over internal ggplot2 functions that are needed by the default compute_layer() function. Hook it for debug.
Looking at the data coming into the compute_layer() function, I see the figure below with 'y' all 'NA'. Humm. That's odd...
Look at the data coming in without the ylimit set. This time 'y' data exists.
Go say 'hi' to my 2yo daughter. While carrying her around the house, realize that 'y' is no longer within the limits ....

Figure 3: Data parameter into compute_layer() with ylimits set

Figure 4: Data parameter into compute_layer() without ylimits set

So, what happened is, when `limits=c(0, yexp)` was applied, the 'y' data was replaced with 'NA' because the large integer values of 'y' was not within the 0-1 limits. `compute_layer()` was then called, which called `remove_missing()` which removes all the rows with `NA`. This caused the removed rows.

The reason this was happening is I'd accidentally overloaded the 'y' aesthetic. `y` meant something different to ggplot than it did to testIndependence(), (the internal function which calculated the bands). The solution was to replace 'y' with 's' as a required aesthetic. Now the function looks like this:

StatInd <- ggplot2::ggproto("StatInd", ggplot2::Stat,
  default_aes=ggplot2::aes(color=..band..), 
  required_aes=c("x", "s", "n"), 
  compute_panel = function(data, scales) {
    band_spacing <- -max(data$s/data$n, na.rm=TRUE) / 5
    # band_spacing <- 0.1
    bands <- testIndependence(data$s, data$n, ind.p=0.05, ind.method="fisher")
    bands <- tibble::as.tibble(bands) # Necessary to preserve column names in the cbind below
    bands <- cbind("x"=data[!is.na(data$n), "x"], bands[, grep("^band", names(bands), value=T)]) %>%
      tidyr::gather("band", "value", -x) %>%
      dplyr::filter(value) %>%
      dplyr::select(-value)
    y_locs <- tibble::tibble("band"=unique(bands$band), "y"=(1:dplyr::n_distinct(bands$band))/dplyr::n_distinct(bands$band) * band_spacing) # band spacing v2
    bands <- dplyr::left_join(bands, y_locs, by="band")
    bands[ , c("x", "y", "band")]
  } 
)

With that it finally started working!

Figure 5: Success!

Conclusion

Ultimately, I really stumbled through this. Given how many new ggplot2 geoms are always popping up, I expected this to be much more straight-forward. I expected to find multiple tutorials but really came up short. There was a lot of reading ggplot2 source code. There was a lot of trial and error. In the end though, the code isn't complicated. Much of the documentation however is within the code, or the code itself. Still, I'm hoping that the first is the roughest, and the next layer I create will come more smoothly.

Wednesday, September 13, 2017

The end of risk

Introduction

I think risk is hurting infosec and may need to go.

First, a quick definition of risk:

Risk is the probable frequency and probable magnitude of future loss.

(Note: If this is not your definition of risk, the rest of the blog is probably going to make much less sense. Unfortunately why this is the definition of risk is outside the scope of this blog so will have to wait.)

In practicality, risk is a way to measure security by measuring the likelihood and impact of something bad happening that could have been prevented by information security.

Outcomes

That last line is a bit nebulous though right? Risk measures the opposite of what we're doing. So let's better define what we're doing. Let's call what we're doing an outcome: the end result of our structure and processes (for example, in healthcare, heart disease is a general term for negative cardiovascular outcomes). Next, let's define what we want:

(A statistically insignificant and heavily biased survey. Obviously. Why is this a good outcome? See the addendum.)

Measures, Markers, and Key Risk Indicators

Where we can directly measure this outcome, we call it a 'measure' (for example ejection fraction for heart disease) and life is a lot easier. For risk we have to use surrogate markers (for example cholesterol for heart disease), sometimes called a Key Risk Indicators in risk terms. Now when we say 'risk', we normally mean the indicators we use to predict risk. The most well respected methodology is FAIR though if you are currently using an excel spreadsheet with a list of questions for your risks, you can easily improve simply by switching to Binary Risk Assessment.

The problems with risk.

The first problem with risk is not with risk per se, but with the surrogate markers we measure to predict risk. In other fields such as medicine, before using a surrogate marker, there would be non-controversial studies linking the surrogate marker and the outcome. In security, I'm not aware of any study which shows that the surrogate markers we measure to determine risk, actually predict the outcome in a holistic way. In my opinion, there's a specific reason:

Because there are more legitimate targets than attackers, what determines an organization's outcome (at least on the attack side) is attacker choice.

You can think of it like shooting fish in a barrel. Your infosec actions may take you out of the barrel, but you'll never truly know if you're out of the barrel or your in the barrel and just weren't targeted. This, I think, is the major weakness in risk as a useful marker of outcomes.

This ignores multiple additional issues with risk such as interrelationships between risks, the impact of rational actors, and difficulty in capturing context, let alone problems in less mature risk processes that are solved in mature processes such as FAIR.

The second problem with risk is related to defense:

Risk does not explicitly help measure minimization of defense. It tells us nothing about how to decrease (or minimally increase) the use of resources in infosec defense.

I suspect we then implicitly apply an equation that Tim Clancy brought to my attention as foundational in the legal field: Burden < Cost of Injury × Probability of occurrence, (i.e. if Burden < Risk, pay the burden). It sounds good in theory, but is fraught with pitfalls. The most obvious pitfall is that it doesn't scale. Attacks are paths, except the paths are not in isolation. At any given step, they can choose to go left or right, in effect creating more paths than can be counted. As such, while one burden might be affordable, the sum of all burdens would bankrupt the organization.

What happens when markers aren't linked to outcomes?

As discussed in this thread, I think a major amount of infosec spending is socially driven. Either the purchaser is making a purchase to signals success ("We only use the newest next gen products!"), signal inclusion in a group ("I'm a good CISO"), or is purchasing due to herd mentality ("Everyone else buys this so it must be the best option to buy"). Certainly, as we've shown above, the spending is not related to the outcome. This begs the question of who sets the trends for the group or steers the herd. I like Marcus Carey's suggestion: The analysts and Value Added Resellers. Maybe this is why we see so much marketing money in infosec.

The other major driver is likely other surrogate markers. Infosec decision makers are starved for concrete truth, and so almost any number is clung to. The unfortunate fact is that, like risk, most of these numbers have no demonstrable connection to the outcome. Take hypothetical threat intelligence solution for example. This solution includes all IPv4 addresses as indicators. It has a 100% success rate in identifying threats. (It also has a near 100% false positive rate.) I suspect, with a few minor tweaks, it would be readily purchased even though it adds no value.

What can we do?

There are three questions we need to ask ourselves when evaluating metrics/measures/surrogate markers/KRI/however you refer to them. From here on out, I'll refer to 'them' as 'metrics'.

What is the outcome?
Why is this the right outcome? (Is it actionable?)
How do you know the metric is predicting this outcome?

For any metric you are considering using to base your security strategy on (the metric you use to make decisions of projects, purchases, etc with), you should be able to answer these three questions definitively. (In fact, this blog answers the first question and part of the second above in the first section.) I think there are at least three potential areas for future research that may yield acceptable metrics.

Operational metrics

I believe operational metrics have a lot of potential. They are easy to collect with a SEIM. The are actionable. They can directly predict the outcome above. ("The ideal outcome of infosec is minimizing infosec, attack and defense.") Our response process:

Prevent
Detect
Respond
Recover

should minimize infosec. With that in mind we can measure it:

Absolute count of detections. (Should go down with mitigations.)
Time to detect (Should go down with improved detection.) (Technically, absolute count of detections should go up with improved detection as well, but should also be correlated with an improved time to detect)
Percent detected in under time T. (Where time T is set such that, above time T, the attack likely succeeded.)
Percent responded to. (Depending on the classification of the incidents, this can tell you both how much time you are wasting responding to false positives and what portion of true attacks are are resolving.)
Time to respond. (Goal of responding in under T where T represents time necessary for the attack to succeed.)
Successful response. (How many attacks are you preventing from having an impact)

The above metrics are loosely based on those Sandia National Labs uses in physical security assessments. You can capture additional resource-oriented metrics:

Time spent on attacks by type. (This can help identify where your resources are being spent so you can prioritize projects to improve resource utilization)
Recovery resources used. (This can help assess the impact of failure in the Detect and Respond metrics.)
Metrics on escalated incidents. (Time spent at tier 1, type, etc. This may suggest projects to minimize tier 2 use and, therefore, overall resource utilization.)

Combine this with data from infosec projects and other measurable infosec resource costs and the impact of infosec on the organization (both attack and defense) can be measured.

Relative risk

Risk has a lot of good qualities. One way to get around it's pitfalls may be to not track risk in absolute terms (probability of impact size in FAIR's case), but in relative terms. Unfortunately, it removes the ability to give the business a single, "this is your risk" score except in terms relative to other organizations. But relative may be enough,. For implementing a security strategy where the goal is to pick the most beneficial course of action, relative risk may be enough to choose a course of action. The actions can even have defensive costs associated with them as well. The problem is that the defensive costs and the relative risk are not in the same units, making it hard to understand if purchasing a course of action is a net benefit.

Attacker cost

Finally, I think attacker cost may be a worthwhile area of research. However, I don't think it is an area that has been well explored. As such, a connection between maximizing attacker cost and "minimizing infosec" (from our outcome) has not been demonstrated. I suspect a qualified economist could easily show that as attacker costs go up, some attackers will be priced out of the market, and those that still could afford to attack, will choose less expensive sources to fulfill their needs. However a qualified economist, I am not. Second, I don't know that we have a validated way to measure attack 'cost'. It makes intuitive sense that we could estimate these costs. (We know how they are done and, as such, can estimate what it would cost us to accomplish the attack and any differences between us and attackers.) But before this is accepted, academic research in pricing attacks will be necessary.

Conclusion

So, from this blog, I want you to take a two things:

The fact that attackers pick targets means no-one really knows if their mitigations make them secure.
Three easy questions to ask when looking at metrics to guide your organization's security strategy

With a well-defined outcome, and good metric(s) to support it, you can truly build a data-driven security strategy. But that's a talk for another day.

Addendum

Why is this the right outcome? Good question. It captures multiple things at once. Breaches can be considered a cost associated with infosec and so are captured. However, it'd be naive to think that all costs attackers cause are associated with breaches (or even incidents). The generality of the definition allows it to be inclusive. It also captures the flip side: the goal of minimizing defenses. This is easy to miss, but critical to organizations. There is no benefit to infosec if the cost of stopping attacks is worse than the cost of the attacks. Ideally, stopping attacks would have zero cost of resources (though that is practically impossible). This outcome is also vague about the unit to be minimized allowing flexibility. (It doesn't say 'minimize cost' or 'minimize time'.) Ultimately it's up to the organization to measure this outcome. How they choose to do so will determine their success.