Friday, September 7, 2018

Data Analysis Template

This is just a quick blog to share my jupyter notebook analysis template.  I analyze a lot of different datasets in a short period, so having the analysis consistent is very helpful.  I'll walk through the sections quickly to share a bit about my process.

Title Section

In the title section, I have a block for any ideas to explore, specific things I intend to do, anything I need to request to be updated in the data, and any notes about the data.  These are all bulleted text boxes.

This section is VERY helpful for working on multiple datasets.  it's easy to forget what you were going to do or what you've done and the summary up front helps get you back in place.

Preparation

next is preparing the data.  No data comes ready for analysis.  Here I have blocks to read in the data, clean the created dataframe, save it to an R data (Rda) object on disk, and then, the next time I need it, I just load the Rda and skip the cleaning.

Analysis

The analysis section is basically filled with mini experiments.  each chuck is one.  As such, it's important that each have a bit of information in comments at the top of it:

  1. A description of the hypothesis being tested or explored.  Something like "looking at the distribution of the periodicity of events".
  2. Once it's done, describe the results.  Yes, the results should describe the results but you'll thank past you if you write down what you got from the analysis when you did it.  Something like "it looks like the periodicity is bimodal with one mode representing X and another representing Y."
  3. Add a comment with a UUID.  Seriously.  Every. Single. Block.  If it's something interesting you're going to put it in a document or a blog or something.  You want to be able to track it from beginning to end.  (Ours track from the report, through several drafts of the report, through drafts of the sections, to a figures rmarkdown file that generates all the figures, to an exploratory report where we created the original analysis.)  Seriously.  If you like it then you shoulda put a UUID on it.
  4. Now you can actually write the analysis code

Appendixes

This is where I put all of the extra stuff.

Testing

I always have a testing block.  Throughout the analysis, you'll spend a lot time testing stuff to make it work, (or simply looking up things like the dimensions of your data and the column names).  Putting those in a testing block keeps you from coming back later and wondering what the block in your analysis was there for.

Lookups

Sometimes you have big, ugly, lookups.  putting them at the top clogs the Preparation section, so I tend to put them at the bottom.  You'll remember you forgot to run them when your analysis fails.

Backup

Really a parking lot for anything you don't want in another section, but don't want to delete.


Ultimately, if I were doing full modeling, I'd probably want a template that follows the process outlined in Modern Dive.  However, for someone just getting into analysis, hopefully this helps!

No comments:

Post a Comment