Farsight's Real-time DNSDB, Part One




I remember back in 2008 when Paul Vixie introduced PassiveDNS replication to me, a real-time stream of names and answers scrolling by in a terminal window. Every now and then I could pause the terminal and see what obviously looked like a "pharma" domain or some kind of phishing. It was exciting and magical, but it wasn't quite useful – yet. Until an indexed database was available, it wasn't easy to make associations between the same IP addresses a criminal used for one campaign to identify other campaigns. We also wanted to ask questions like "What other hostnames are being used inside this domain?" Until we were able to build an index based on the live data, all we could do was "grep".

We've had friends try to use standard SQL databases, and they've had difficulty being able to keep up with inserting new information with the flood of data coming in while still being able to perform queries. It inspired us to look to time-delimited NoSQL solutions. We've gone through several iterations of NoSQL database design on the back-end ranging from:

  • a simple DB4 file (disk was too slow), to
  • hybrid CDB sorted indices (not scalable long-term), to
  • Cassandra clusters (reliability/speed issues), to
  • TokyoCabinet on PCIe SSD (generally good performance), to
  • developing a generic MTBL and specific DNStable implementation of sorted string tables.

Each iteration took advantage of technology available at the time to make lookups as efficient as possible. The last two were developed to take advantage of SSD to write data once and let clients read as many times as needed without any spinning media bottlenecks. We generated hourly databases from live data that were merge-sorted into daily databases and then monthly databases and yearly databases. A "fileset" gives an access client the list of databases to open in parallel to query for their answer. The process scales well on RAID arrays of generic 2.5" SSD drives, and we can replicate linearly if needed.

If researchers wanted or needed to perform lookups for data in the last few minutes, we usually directed them toward the raw Passive DNS real-time data feed that's available on the Security Information Exchange (SIE). They could develop their own methods to utilize the real-time data, but they found it complex and time-consuming to use both a database and create their own processing scripts for the real-time data.

As announced last week, we've improved our DNSDB Export service to real time. Now, users don't have to wait for the next hourly update to get more information. We now make updates available for DNSDB Export every minute. We developed a TLS-based download manager to speed up transfers and manage consistency on the client side based on what local files are available.

How do I use it?

First, understand that with PassiveDNS replication, we gather matched questions and answers between recursive nameservers from all over the Internet. We have a waterfall computing model where raw uploads from sensors are deduplicated by their query and answer, deduplicated again based on where in the DNS hierarchy the answer arrived from, and then filter out superfluous data. The process is documented in Passive DNS Architecture. The live data includes information that looks like the following:

count: 1
time_first: 2015-10-22 03:20:19
time_last: 2015-10-22 03:20:19
bailiwick: tumblr.com.
rrname: ziegast.tumblr.com.
rrclass: IN (1)
rrtype: A (1)
rrttl: 30

count: 10
time_first: 2015-10-22 07:16:26
time_last: 2015-10-22 18:56:16
bailiwick: ru.
rrname: 1f.ru.
rrclass: IN (1)
rrtype: NS (2)
rrttl: 345600
rdata: ns3.nic.ru.
rdata: ns4.nic.ru.
rdata: ns8.nic.ru.

The entries are inserted as tuples into the database DNStable. The indices are built to enable queries based mostly on rrname and rdata. We can make simple direct queries like:

  • "What is the history of NS records for 1f.ru?"
  • "What other names are hosted at"
  • "What other domains are hosted by ns3.nic.ru?"

Rdata types like IP addresses have their indices optimized for CIDR lookups, and rrname or rdata names have their indices optimized for wildcard searches. As such, we can quickly provide answers to:

  • "What other names are in the *.tumblr.com domain?"
  • "What other names point their addresses into"

The databases are created each and every minute. For example, all of the new data from Oct 22, 2015 at 18:51 UTC get stored in a file named: dns.20151022.1851.m.mtbl. We merge-sort databases into combined databases at 10-minute, 1-hour, 1-day, 1-month and 1-year intervals. The collection of files form a set that the dnstable library can open and access in parallel to gather answers.

A command line lookup tool, dnstable_lookup, can use the DNSTABLE_FNAME to look up answers in one database file or a list of files included in a file specified in the DNSTABLE_SETFILE environment variable.

Another command line tool, dnstable_dump, can take the binary format stored in the databases and convert them to rows of JSON.

We'll provide examples of both commands below.

Brand name / counterfeit example

Back in April I wrote a blog about how to look up counterfeit names using SIE access and enhancing it with DNSDB lookups. This time, we'll just use our DNSDB Export files.

Consider the Burberry line of clothing and accessories. As a popular luxury brand, it is often targeted by counterfeiters. Counterfeiters often make use of these freshly created domain names, since they tend to have their wares taken down from established online sales platforms (Amazon, eBay, etc), and are unable to establish long-lived domain names due to the ability of rights holders to easily take down domain names with tools like the U.S.'s DMCA and ICANN’s UDRP. The examples below show freshly created domain names that would appear at first glance to fit into this pattern.

Let's look at the latest minute…

  $ dnstable_dump -r dns.20151022.1941.m.mtbl | grep burberry | grep -v ';'
  burberrybags808.tumblr.com. IN A
  burberrybags808.tumblr.com. IN A
  burberryoutletstores.xyz. IN NS f1g1ns1.dnspod.net.
  burberryoutletstores.xyz. IN NS f1g1ns2.dnspod.net.
  burberryoutletstores.xyz. IN NS f1g1ns1.dnspod.net.
  burberryoutletstores.xyz. IN NS f1g1ns2.dnspod.net.
  burberryoutletstores.xyz. IN SOA f1g1ns1.dnspod.net.
  freednsadmin.dnspod.com. 1444295154 3600 180 1209600 180
  www.burberryoutletstores.xyz. IN CNAME burberryoutletstores.xyz.

Looking up over the last year, we can find other merchandise hosted there. Using a larger set of DNSDB history, here's another lookup:

  $ ls dns.2015* > dns.fileset
  $ export DNSTABLE_SETFILE=dns.fileset
  $ dnstable_lookup rrset burberryoutletstores.xyz A
  ;;  bailiwick: burberryoutletstores.xyz.
  ;;      count: 5
  ;; first seen: 2015-10-04 03:40:30 -0000
  ;;  last seen: 2015-10-07 17:14:44 -0000
  burberryoutletstores.xyz. IN A
  ;;  bailiwick: burberryoutletstores.xyz.
  ;;      count: 18
  ;; first seen: 2015-10-10 21:15:43 -0000
  ;;  last seen: 2015-10-21 19:46:38 -0000
  burberryoutletstores.xyz. IN A
  ;;; Dumped 2 entries.

Looking up prior addresses finds other trademark names being hosted on the same servers now:

  $ dnstable_lookup rdata ip
  louisvuittonoutletonline.pw. IN A
  raybaneyeglasses.us.com. IN A
  3gp-ds.ytconv.net. IN A
  coachoutletonline.top. IN A
  michaelkorshandbags.xyz. IN A
  burberryoutletstores.xyz. IN A
  ;;; Dumped 6 entries.
  $ dnstable_lookup rdata ip
  raybaneyeglasses.us.com. IN A
  abercrombieandfitchoutletsonline.com. IN A
  furlaoutletsonline.in.net. IN A
  burberryoutlet.top. IN A
  discountnfljerseys.top. IN A
  burberryoutletonline.top. IN A
  burberryoutletstores.top. IN A
  guccioutletonline.xyz. IN A
  burberryoutletonline.xyz. IN A
  burberryoutletstores.xyz. IN A
  ;;; Dumped 10 entries.

We don't have to use command line tools to look at the data. There are C and Python bindings for easily doing lookups against DNStable files.

Consider the following script, lookup_ip.py:


import sys
import dnstable

d = dnstable.reader('dns.fileset')
q = dnstable.query(dnstable.RDATA_IP, sys.argv[1])

for res in d.query(q):
    print res.to_json()

Additionally, we can get JSON tuples for all of the names that reference that IP address:

  $ ./lookup_ip.py
  {"rrtype": "A", "time_last": 1445413846, "time_first": 1444238290,
   "count": 15, "rrname": "raybaneyeglasses.us.com.", "rdata":
  {"rrtype": "A", "time_last": 1445478986, "time_first": 1443590863,
   "count": 56, "rrname": "abercrombieandfitchoutletsonline.com.", "rdata":
  {"rrtype": "A", "time_last": 1445061327, "time_first": 1444404624,
   "count": 13, "rrname": "furlaoutletsonline.in.net.", "rdata":
  {"rrtype": "A", "time_last": 1444735169, "time_first": 1444302797,
   "count": 42, "rrname": "burberryoutlet.top.", "rdata": ""}
  {"rrtype": "A", "time_last": 1444948864, "time_first": 1444948864,
   "count": 2, "rrname": "discountnfljerseys.top.", "rdata": ""}
  {"rrtype": "A", "time_last": 1445428527, "time_first": 1444273015,
   "count": 7, "rrname": "burberryoutletonline.top.", "rdata":
  {"rrtype": "A", "time_last": 1444132851, "time_first": 1444097643,
   "count": 6, "rrname": "burberryoutletstores.top.", "rdata":
  {"rrtype": "A", "time_last": 1445492367, "time_first": 1444473075,
   "count": 9, "rrname": "guccioutletonline.xyz.", "rdata": ""}
  {"rrtype": "A", "time_last": 1445424613, "time_first": 1444273016,
   "count": 22, "rrname": "burberryoutletonline.xyz.", "rdata":
  {"rrtype": "A", "time_last": 1445456798, "time_first": 1444511743,
   "count": 18, "rrname": "burberryoutletstores.xyz.", "rdata":

rrname is the name that was queried, and rrtype was the DNS type ("A", "NS", "MX", etc.) found in the answer.

The tuple of time_first, time_last and count show how many times the name was seen within a given period. The times values are Unix epoch seconds (the number of seconds since midnight Jan 1 1970 UTC). A count of "0" means it was seen once in an INSERTION record. Actual counts are made on EXPIRATION records.

The bailiwick is the place in the DNS heirarchty from which we received and answer. Sometimes a registry nameserver and the domain's authoritative nameserver can be out of sync. If they are out of sync, they will list different bailiwick and rdata for the same rrname and rrtype.

The rdata is an array of answers returned for the given rrname/rrtype/bailiwick during the timeframe. In DNS, order of answers doesn't matter, so it may make sense to make sure the answers are sorted before importing to a database.


In the next article, I will provide more use-case examples.

Eric Ziegast is a Senior Distributed Systems Engineer for Farsight Security, Inc.

Read the next part in this series: Farsight's Real-time DNSDB, Part Two