Malformed Internationalized Domain Name (IDN) Leads to Discovery of Vulnerability in IDN Libraries
By Mike Schiffman
As part of our research for "Farsight Security Global Internationalized Domain Name (IDN) Homograph, Q2 2018 Report", Farsight Security discovered a bug in the popular libidn and libidn2 C libraries, which are used to build Internationalized Domain Name in Applications (IDNA)-aware software. Depending on how the code is written, this bug could lead to a security vulnerability in trusting applications. It occurs in the Punycode decoder when pathological inputs decode to illegal Unicode code point values.
While we worked closely with the vendor to report and patch the vulnerability, it is important for application programmers and end-users to patch their code.
To get the most from this article, the reader should be familiar with the following technologies:
The functions responsible for decoding Punycode into Unicode in both libidn and libidn2 can be coerced to generate invalid Unicode code point values yet return successfully. These resultant code point values are larger than the maximum valid Unicode code point of 0x10FFFF (1,114,112) and depending on how they are subsequently treated by application code, these values may result in a program crash or other undefined behavior including possible arbitrary code execution.
The simplest Punycode string that triggers this behavior is
decodes to a single "code point" value of U+127252 (1,208,914) - and is not a
legal Unicode code point. This is shown below using a simple test program
"punydecode" (available in Appendix A).
$ echo "xn--0000h" | punydecode - 0000h:1:U+127252
The libidn and libidn2 libraries are open source implementations of IDNA (libidn implements IDNA2003 while libidn2 implements IDNA2008). They both provide APIs to encode and decode internationalized domain names.
Inside the latest versions of both libraries (1.35 for libidn and 2.0.5 for
libidn2) are two almost identical¹ functions responsible for decoding Punycode
strings into Unicode code points. Libidn calls this function
while libidn2 calls it
From here on out, we will refer to both functions as simply the "Punycode decoder".
The Punycode decoder is an implementation of the algorithm described in section 6.2 of RFC 3492. As it walks the input string, the Punycode decoder fills the output array with decoded code point values. The output array itself is typed to hold unsigned 32-bit integers while the Unicode code point space fits within 21 bits. This leaves a remainder of 11 unused bits that can result in the production of invalid Unicode code points if accidentally set. The vulnerability is enabled by the lack of a sanity check to ensure decoded code points are less than the Unicode code point maximum of 0x10FFFF. As such, for offending input, unchecked decoded values are copied directly to the output array and returned to the caller.
The bug can be fixed simply by checking for excessive code point values prior to insertion into the output array. Something as simple as the following will work:
A similar patch has been pushed to the libidn and libidn2 repositories and should be readily available.
For the remediation and disclosure of this security condition, Farsight worked directly with Tim Rühsen, the maintainer of libidn and libidn2. We would like to thank him for his prompt and detailed responses at every point in the process.
Finally, Farsight did not discover this vulnerability through a code audit, but rather, through an encounter with a malformed IDN in the wild. While we won't (currently) release details on the domain in question, we feel it's important to inform others that there are live hostnames out there that may trigger this bug, and thus that it is important to upgrade dependent libidn / libidn2 packages.
Appendix A: Punycode Decode Test Program
The following program can be used to check Punycode input strings for overflow. It expects input as single Punycode-encoded labels with or without the ACE prefix and can read from a file or a pipeline.
If there is no error, the output is colon separated as per the following:
input punycode:code point count:code points.
For conforming inputs punydecode will prepend a lowercase
u+ before each
$ echo "xn--8a" | punydecode - 8a:1:u+00a2
For offending inputs it will prepend an uppercase
U+ before each code point:
$ echo "xn--0000h" | punydecode - 0000h:1:U+127252
Additionally, the program tests the reversibility of the input Punycode string and will emit an "encode mismatch" error if the decoded code points don't encode to the original Punycode.
To build punydecode.c, you'll need "idn2.h", "puny_decode.c", "puny_decode.c", and "punycode.h" from libidn2 to reside in the same directory. You can build with something like:
gcc -Wall -O0 -ggdb punydecode.c puny_decode.c puny_encode.c -o punydecode.
¹ The only difference is libidn's support for case-awareness. Since IDNA2008 removes support for uppercase characters, libidn2 has no such support.
² This function is ostensibly private and not directly usable through the
libidn2 API. In fact, access to it is "protected" by a call to the
u8_to_u32() which validates the Punycode before handing it off to
_idn2_punycode_decode(). However, the function is not static in scope and is
externally accessible. According to the libidn2 README, the library is intended
to be drop-in replacement for libidn:
"This library is backwards (API) compatible with the libidn library. Replacing the idna.h header with idn2.h into a program is sufficient to switch the application from IDNA2003 to IDNA2008 as supported by this library."
As such, if an application programmer upgrades from libidn to libidn2 and
has an IDNA-based application that directly calls
punycode_decode(), and does
something like the following, program will be vulnerable the overflow:
Furthermore, if an application programmer is concerned about bloat and/or
performance, the Punycode source files might be cherry-picked directly from the
library, bypassing any protections afforded by
Mike Schiffman is an IDNA2020 Hopeful for Farsight Security, Inc.