SQL Injection : Code-Level Defenses – Canonicalization

Canonicalization

A difficulty with input validation and output encoding is ensuring that the data being evaluated or transformed is in the format that will be interpreted as intended by the end user of that input. A common technique for evading input validation and output encoding controls is to encode the input before it is sent to the application in such a way that it is then decoded and interpreted to suit the attacker’s aims. For example, Table 1 lists alternative ways to encode the single-quote character.

Table 1. Example Single-Quote Representations
Representation Type of encoding
%27 URL encoding
%2527 Double URL encoding
%%317 Nested double URL encoding
%u0027 Unicode representation
%u02b9 Unicode representation
%ca%b9 Unicode representation
' HTML entity
' Decimal HTML entity
' Hexadecimal HTML entity
%26apos; Mixed URL/HTML encoding

In some cases, these are alternative encodings of the character (%27 is the URL-encoded representation of the single quote), and in other cases these are double-encoded on the assumption that the data will be explicitly decoded by the application (%2527 when URL-decoded will be %27 as shown in Table 8.6, as will %%317) or are various Unicode representations, either valid or invalid. Not all of these representations will be interpreted as a single quote normally; in most cases, they will rely on certain conditions being in place (such as decoding at the application, application server, WAF, or Web server level), and therefore it will be very difficult to predict whether your application will interpret them this way.

For these reasons, it is important to consider canonicalization as part of your input validation approach. Canonicalization is the process of reducing input to a standard or simple form. For the single-quote examples in Table 1, this would normally be a single-quote character (‘).

Canonicalization Approaches

So, what alternatives for handling unusual input should you consider? One method, which is often the easiest to implement, is to reject all input that is not already in a canonical format. For example, you can reject all HTML-and URL-encoded input from being accepted by the application. This is one of the most reliable methods in situations where you are not expecting encoded input. This is also the approach that is often adopted by default when you do whitelist input validation, as you may not accept unusual forms of characters when validating for known good input. At the very least, this could involve not accepting the characters used to encode data (such as %, &, and # from the examples in Table 8.6), and therefore not allowing these characters to be input.

If rejecting input that can contain encoded forms is not possible, you need to look at ways to decode or otherwise make safe the input that you receive. This may include several decoding steps, such as URL decoding and HTML decoding, potentially repeated several times. This approach can be error-prone, however, as you will need to perform a check after each decoding step to determine whether the input still contains encoded data. A more realistic approach may be to decode the input once, and then reject the data if it still contains encoded characters. This approach assumes that genuine input will not contain double-encoded values, which should be a valid assumption in most cases.

Working with Unicode

When working with Unicode input such as UTF-8, one approach is normalization of the input. This converts the Unicode input into its simplest form, following a defined set of rules. Unicode normalization differs from canonicalization in that there may be multiple normal forms of a Unicode character according to which set of rules is followed. The recommended form of normalization for input validation purposes is NFKC (Normalization Form KC – Compatibility Decomposition followed by Canonical Composition). You can find more information on normalization forms at www.unicode.org/reports/tr15.

The normalization process will decompose the Unicode character into its representative components, and then reassemble the character in its simplest form. In most cases, it will transform double-width and other Unicode encodings into their ASCII equivalents, where they exist.

You can normalize input in Java with the Normalizer class (since Java 6) as follows:

normalized = Normalizer.normalize(input, Normalizer.Form.NFKC);

You can normalize input in C# with the Normalize method of the String class as follows:

normalized = input.Normalize(NormalizationForm.FormKC);

You can normalize input in PHP with the PEAR::I18N_UnicodeNormalizer package from the PEAR repository, as follows:

$normalized = I18N_UnicodeNormalizer::toNFKC($input, 'UTF-8');

Another approach is to first check that the Unicode is valid (and is not an invalid representation), and then to convert the data into a predictable format—for example, a Western European character set such as ISO-8859-1. The input would then be used in that format within the application from that point on. This is a deliberately lossy approach, as Unicode characters that cannot be represented in the character set converted to will normally be lost. However, for the purposes of making input validation decisions, it can be useful in situations where the application is not localized into languages outside Western Europe.

You can check for Unicode validity for UTF-8 encoded Unicode by applying the set of regular expressions shown in Table 2. If the input matches any of these conditions it should be a valid UTF-8 encoding. If it doesn’t match, the input is not a valid UTF-8 encoding and should be rejected. For other types of Unicode, you should consult the documentation for the framework you are using to determine whether functionality is available for testing the validity of input.

Table 2. UTF-8 Parsing Regular Expressions
Regular expression Description
[x00-\x7F] ASCII
[\xC2-\xDF][\x80-\xBF] Two-byte representation
\xE0[\xA0-\xBF][\x80-\xBF] Two-byte representation
[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} Three-byte representation
\xED[\x80-\x9F][\x80-\xBF] Three-byte representation
\xF0[\x90-\xBF][\x80-\xBF]{2} Planes 1 through 3
[\xF1-\xF3][\x80-\xBF]{3} Planes 4 through 15
\xF4[\x80-\x8F][\x80-\xBF]{2} Plane 16

Now that you have checked that the input is validly formed, you can convert it to a predictable format—for example, converting a Unicode UTF-8 string to another character set such as ISO-8859-1 (Latin 1).

In Java, you can use the CharsetEncoder class, or the simpler string method getBytes( ) (Java 6 and later) as follows:

string ascii = utf8.getBytes(“ISO-8859-1”);

In C#, you can use the Encoding.Convert class as follows:

ASCIIEncoding ascii = new ASCIIEncoding();
UTF8Encoding utf8 = new UTF8Encoding();
byte[] asciiBytes = Encoding.Convert(utf8, ascii, utf8Bytes);

In PHP, you can do this with utf8_decode as follows:

$ascii = utf8_decode($utf8string);