Building a Reputation System From Available Data




Last week, you learned what reputation systems are and what kind of data they consume. This week, I will expand on the data reputation systems consume. We will look at the data that most mail server operators have and can use to create their own reputation systems.


To gain access to a reputation system, you could purchase or license one from many for-profit companies, but what if you or your employer don't want to foot the sometimes hefty bill for such a system? Perhaps there's a cheaper and more interesting way: as a mail server operator for a medium sized network, you likely have enough data available to develop your own.

In this article I'll walk you through the data you could easily collect (or may already be collecting) to feed a homegrown system (the actual code and hardware to build the reputation system are left as an exercise to the reader).

Before we begin, Bradley Taylor wrote a wonderful primer on how to build your own email reputation system. While almost 10 years old, the paper is still accurate and relevant.

SMTP data

The first place to look are the mail server logs. You’ll want to parse, collect, and/or check for the following:

  • HELOs. Some botnets use the same HELO for every message they spam. Also, do HELOs and reverse DNS (rDNS) match? Matching HELO and rDNS is a good indication of clued-in mail server operation. This would benefit the sender's score.
  • Matching forward and rDNS for the sender. This would benefit the sender's score.
  • Is there a Sender Policy Framework (SPF) record for the domain? Is it valid? The presence and validity of which would benefit the sender's score, the absence or invalidity of which would hurt the sender's score.
  • Is the sender IP or domain listed in a blocklist? Look especially for IPs listed in the Spamhaus PBL. This would significantly hurt the sender's score.

Other Data Sources

With a little scripting, there is other information you can glean from logs:

  • How many domain names has the IP resolved to within a particular quantum of time? The more domains that have resolved to the IP might be cause for concern and could hurt the sender's score.
  • Has the IP sent spam to you in the past? Clearly, this would significantly hurt the sender's score, probably with a decaying coefficient based on the date it last sent you spam.
  • Has the domain been seen in spam in the past? This would probably have a similar effect and be governed by a similar back-off as the previous item.
  • What ASN does the IP belong to? Do IPs from that ASN send spam frequently? This would probably have a similar effect and be governed by a similar back-off as the previous two items.
  • Does the TTL of the domain name fluctuate? How frequently? By how much? As this could be indicative of Fast Flux DNS, this could hurt the sender's score.
  • Data from an intrusion detection system such as Snort, Bro, or Suricata. This can be very useful in some circumstances. Detecting IPs that try to send malware into your network is obviously A Good Thing (TM). Additionally, detecting IPs that were the targets of that malware is also good, but may give unintended results when used in a reputation system. If an attack on an internal IP is intercepted or prevented, then "punishing" that internal IP with a degraded score may have unintended results. Test early and test often. You might find that data from an intrusion prevention system is useful as well, because that data is historical rather than contemporaneous, but again, test your rules thoroughly before using them in production.
  • Reliable DNSBLs are always useful. Consider the CBL from Spamhaus and the SpamCop Blocklist, as well as SURBL for domains.

Farsight Security Datafeeds

Additionally, Farsight provides unique sources of information that can be used as inputs to a reputation system:


With this data, you can be off to a good start on designing an in-house reputation system for your own network. In the next and final installment of this series, we'll look at the axes on which we might evaluate this data. More specifically, what exactly goes in to a reputation score and how that might change your individual goals.

Kelly Molloy is a Senior Program Manager for Farsight Security, Inc.

Read the next part in this series: Optimizing Reputation System Input Data