Crushing Monolithic Data Results ("Rock") Into "Gravel": dnsbq New -g Volume-Across-Time Option
By Joe St Sauver
Earlier this year, we published a three-part blog series on DNSDB volume-over-time analyses aimed at DNSDB Export (aka DNSDB "on-premises") customers. Those customers have direct file-level access to the data in DNSDB MTBL files. You can find the three blogs here: Finding Top FQDNs Per Day in DNSDB Export MTBL Files (Part 1), Volume-Over-Time Data From DNSDB Export MTBL Files, Part 2 and Analyzing DNSDB Volume-Over-Time Time Series Data With R and ggplot2 Graphics, Part 3.
Unfortunately, unless DNSDB users were fortunate enough to have file-level access to DNSDB Export data files, they couldn't get the data they needed to evaluate volume-across-time trends. This has now changed. Now "regular" DNSDB API users can begin to ask for "deaggregated" counts "across time," too.
dnsdbq and DNSDB Scout have both been extended to support this new functionality. Since this feature was already demonstrated using DNSDB Scout in the original announcement, this article will focus solely on
dnsdbq's implementation of its volume-across-time implementation, aka the
-g ("gravel") option.
2. Classic Aggregated DNSDB Counts
Normally DNSDB returns a single aggregated count. Consider a sample query and classic aggregated output, such as the following:
$ dnsdbq -r www.reed.edu/A/reed.edu ;; record times: 2010-06-24 17:12:52 .. 2019-05-29 16:37:26 ;; count: 971009; bailiwick: reed.edu. www.reed.edu. A 220.127.116.11 ;; record times: 2019-05-29 16:39:54 .. 2019-09-09 17:56:45 ;; count: 26747; bailiwick: reed.edu. www.reed.edu. A 18.104.22.168
Farsight's sensor network saw www.reed.edu resolve to the IPv4 address 22.214.171.124 a total of 971,009 times between 17:12:52 UTC on 2010-06-24 and 16:37:26 UTC on 2019-05-29.
At 16:39:54 UTC on 2019-05-29, that domain changed and began to resolve to 126.96.36.199, and has continued to do so 26,747 times through 17:56:45 UTC on 2019-09-09.
Those aggregated count ran from:
- The first time a given unique (RRname/RRtype/Bailiwick/Rdata/Sensor-or-Zone file origin) combination was seen up through
- The last time that unique (RRname/RRtype/Bailiwick/Rdata/Sensor-or-Zone file origin) combination was seen.
That aggregated count was useful, but sometimes you need finer-grained details. Being able to drill down and allocate the count over smaller time periods may be of interest to you if you're attempting to determine if:
- According to the time-between-dates site, there are roughly 3,261 days between 2010-06-24 and 2019-05-29. Were the 971,009 sensor hits distributed roughly evenly over that time, 971009/3261 ==> ~298 hits per day, day after day after day?
- Or, perhaps, did the query volume "start low" but then steadily increase in volume year after year?
- Or was volume very low ("down in the weeds"), only to suddenly have a big spike, after which it dropped back "down into the weeds" again?
Without volume across time data, there was simply no way to answer these sort of questions using DNSDB API.
3. Gravel'd ("de-aggregated") Results in dnsdbq
With the new DNSDB API volume across time functionality, a user can request a result from EACH underlying MTBL file. The duration of each MTBL file will vary according to the "roll-up" status of the data:
- For data from past years (currently 2010-2018), the MTBL files will have yearly-aggregated data.
- Within the current year, because the current year is still "in progress" (and thus a consolidated 2019 yearly file can't yet be created), you'll see monthly data.
- Within the current month, because the month hasn't yet completed, daily count data will be provided.
- Within the current day, you'll see hourly data, etc.
Let's rerun our sample query for www.reed.edu/A/reed.edu, this time adding a
-g option. Suddenly we get far more granular results:
$ dnsdbq -r www.reed.edu/A/reed.edu -g ;; record times: 2010-06-24 17:12:52 .. 2010-12-31 18:09:34 <== Through 2010 ;; count: 20833; bailiwick: reed.edu. www.reed.edu. A 188.8.131.52 ;; record times: 2010-12-28 11:56:00 .. 2011-12-31 18:45:54 <== Roughly 2011 ;; count: 57285; bailiwick: reed.edu. www.reed.edu. A 184.108.40.206 ;; record times: 2011-12-31 14:51:10 .. 2012-12-31 23:39:57 <== Roughly 2012 ;; count: 76954; bailiwick: reed.edu. www.reed.edu. A 220.127.116.11 ;; record times: 2012-12-31 22:04:40 .. 2013-12-31 22:24:43 <== Roughly 2013 ;; count: 82726; bailiwick: reed.edu. www.reed.edu. A 18.104.22.168 ;; record times: 2013-12-31 21:52:26 .. 2014-12-31 16:58:22 <== Roughly 2014 ;; count: 75379; bailiwick: reed.edu. www.reed.edu. A 22.214.171.124 ;; record times: 2014-12-31 12:41:43 .. 2015-12-31 21:24:51 <== Roughly 2015 ;; count: 100858; bailiwick: reed.edu. www.reed.edu. A 126.96.36.199 ;; record times: 2015-12-31 17:33:54 .. 2016-12-31 21:16:24 <== Roughly 2016 ;; count: 167684; bailiwick: reed.edu. www.reed.edu. A 188.8.131.52 ;; record times: 2016-12-31 10:07:17 .. 2017-12-31 21:20:05 <== Roughly 2017 ;; count: 182531; bailiwick: reed.edu. www.reed.edu. A 184.108.40.206 ;; record times: 2017-12-31 14:02:42 .. 2018-12-31 19:23:57 <== Roughly 2018 ;; count: 156543; bailiwick: reed.edu. www.reed.edu. A 220.127.116.11 ;; record times: 2018-12-31 16:01:04 .. 2019-01-31 22:38:31 <== Roughly Jan 2019 ;; count: 10742; bailiwick: reed.edu. www.reed.edu. A 18.104.22.168 </b> ;; record times: 2019-01-31 17:20:34 .. 2019-02-28 23:38:07 <== Roughly Feb 2019 ;; count: 10491; bailiwick: reed.edu. www.reed.edu. A 22.214.171.124 ;; record times: 2019-02-28 18:41:19 .. 2019-03-31 23:44:30 <== Roughly Mar 2019 ;; count: 10091; bailiwick: reed.edu. www.reed.edu. A 126.96.36.199 ;; record times: 2019-03-31 18:36:58 .. 2019-04-30 23:32:50 <== Roughly Apr 2019 ;; count: 9904; bailiwick: reed.edu. www.reed.edu. A 188.8.131.52 ;; record times: 2019-04-30 18:05:53 .. 2019-05-29 16:37:26 <== Roughly May 2019 ;; count: 8988; bailiwick: reed.edu. www.reed.edu. A 184.108.40.206 [the remaining results, e.g., for www.reed.edu/A/reed.edu ==> 220.127.116.11, omitted here]
You may wonder why the dates shown above don't "perfectly" coincide with their respective time periods (e.g., why are the time periods only "roughly" a year or "roughly" a month). The answer is that unique RRname/RRtype/Bailiwick/RRname data values are cached during processing, and sometimes cached entries will span a time boundary, resulting in the epoch-spanning time boundaries shown in the above MTBL reports.
4. Visualizing Our Gravel'd Results With Excel
While the results shown in section 2 contain the raw numerical gravel'd output data, it's hard to visualize what's going on when we're just looking at a table of raw numbers. Let's see what our results look like when we graph them. We begin by extracting a CSV file containing just:
- The time first-seen
- The difference between the time last seen and the time first seen (the "delta time") and
- The count
for our IP of interest using standard Unix command line tools plus
$ dnsdbq -g -r www.reed.edu/A/reed.edu -s -k first -j | grep "18.104.22.168" | jq -r '"\(.time_first), \(.time_last - .time_first), \(.count)"' > reed.csv $ head -5 reed.csv 1277399572, 16419402, 20833 1293537360, 31819794, 57285 1325343070, 31654127, 76954 1356991480, 31537203, 82726 1388526746, 31518356, 75379
We could then plot that data in a graphing package of your choice. For example, we could use
- We begin by opening a new sheet in Excel 16 for the Mac.
- Import the data from reed.csv by saying File –> Import –> CSV File –> Import. Select reed.csv. Hit "Get Data." Select Delimited Data. Hit Next. Select commas as delimiters. Hit Next. Leave the column data format as General. Put the data in the existing sheet at =$A$1
- Convert the time_first data (in Unix "tick time") to Excel data values. To do so, insert a new column between column A and column B by selecting column B and then right clicking on Insert. In the new blank cell B1, enter the formula: =(A1/86400)+DATE(1970,1,1)
- Now Edit–>Copy B1, then Edit–> Fill –> Down to put that formula into column B for the remaining values.
- Choose a suitable date time format for column B. For example, Select column B, then go to Format –> Cells –> Date and pick 3/14/12 as the format.
- Select Column B and then Command Click on Column D to select the two columns of interest.
- Insert –> Chart –> Line. Expand the chart to suit your preferences, insert a title and axis legends, and otherwise tweak the graph to taste. The result should be something like what's shown in Figure 1.
Figure 1. Count of www.reed.edu/A/reed.edu –> 22.214.171.124 Over Time Using Excel
5. An Alternative Visualization With R and ggplot
We can also use
ggplot to plot that graph. We assume you already have R and ggplot installed.
If you need help getting
ggplot installed, please visit here for more information.
Once you have
ggplot installed, create the Rscript commands needed to plot the data:
$ cat plot-volume-over-time.R #!/usr/local/bin/Rscript args <- commandArgs(trailingOnly = TRUE) filename <- args mydata <- read.table(file = filename, header = FALSE, sep = ",") colnames(mydata) <- c("x", "deltax", "y") mydata$datetime <- as.Date(as.POSIXct(mydata$x, origin="1970-01-01", tz="GMT")) outputfile <- args sink(file = outputfile, append = FALSE, type = c("output", "message"), split = FALSE) graphfile <- paste(outputfile, ".pdf", sep = "", collapse = NULL) pdf(graphfile, width = 10, height = 7.5) library("ggplot2") library("scales") mytitle <- paste("\n\n", "www.reed.edu/A/reed.edu", "\nCounts Over Time", sep = "") mydatebreaks = as.Date(c("2010-01-01","2011-01-01", "2012-01-01","2013-01-01","2014-01-01","2015-01-01", "2016-01-01","2017-01-01","2018-01-01","2019-01-01", "2020-01-01")) p <- ggplot()+ geom_point(aes(x = mydata$datetime, y = mydata$y)) + geom_line (aes(x = mydata$datetime, y = mydata$y )) + labs(title=mytitle, x = "Date\n\n", y = "\n\nCounts") + scale_x_date(limits= as.Date(c("2010-01-01", "2020-01-01")), breaks = mydatebreaks, date_minor_breaks = "1 month", date_labels="%m/%y") + scale_y_continuous(labels = comma, sec.axis = sec_axis(trans=~., name="\n\n", breaks=NULL)) print(p)
To run that program, enter:
$ chmod a+rx plot-volume-over-time.R $ ./plot-volume-over-time.R reed.csv reed.output [output will be in reed.output.pdf, and should look like the graph shown on the next page]
You may see a couple of warning messages, such as:
In min(x) : no non-missing arguments to min; returning Inf
In max(x) : no non-missing arguments to max; returning -Inf
You may safely disregard these. The graphic output in reed.output.pdf looks like Figure 2, below:
Figure 2. Count of www.reed.edu/A/reed.edu –> 126.96.36.199 Over Time Using R and ggplot
Figure 2 – even better than Figure 1 – makes it obvious that while the first measurements are annual totals, the last results are monthly results. As such, it should not be surprising that these are just a fraction of the size of the older annual results.
We hope that this article will serve to help "bootstrap" your use of the
-g ("gravel") option in
The Farsight Security Sales Team can be reached at email@example.com or give them a call at +1-650-489-7919.
Joe St Sauver Ph.D. is a Distinguished Scientist with Farsight Security®, Inc.