URI Percent Encoding by default

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

URI Percent Encoding by default

Tim Keanini

At the previous SCAP Developer Days meeting, I raised the issue that the 'Percent Encoding' as defined in the CPE standard version 2.2 section 5.4 is problematic and should be a concern for most implementations.  I'd like to clarify my point by offering some examples.  I strongly suggest that everyone reread section 5.4 and this time with an eye toward any security issue that could be exploited due to encoding/decoding and a bias toward using this facility as the exception and not the rule.  The objective here is to clarify the point I made on Percent Encoding and while I have some suggestions on how to rectify the situation, I will refrain and concentrate on exemplifying the concern. 

 

While the RFCs describing the syntax of URI's (rfc3986, rfc2396, rfc2141, rfc1738) is clear on URI characters and escape sequences, the general practice of escaping should be the exception and never the rule; we in the CPE standard have made it the rule having complete knowledge that there will be a large set of reserved characters that must be escaped.  The CPE standard version 2.2 section 5.1 begins with the sentence "A CPE Name is a percent-encoded URI…".   RFC3986 goes into more details around security considerations in section 7 (Security Considerations) and I feel that as a standard within the Security Automation Content Protocol, we should choose designs that avoid the need to encode/decode in the identifier itself; keeping the non-legal metadata OUT of the identifier and in what is being indentified.

 

Please let me offer two examples: one general and one specific; both pointing to the plethora of unforeseen problems related to percent encoding within the URI.

 

First example: In this general example we show how ambiguity in the URI's alternative transcription should be a concern and avoided as it is dependent on implementation specific encoding and decoding of the URI. There is recognition in the CPE 2.2 specification of this problem to some degree in the sentence "The percent-encoding mechanism is a frequent source of variance among otherwise identical URIs".  And where there is variance in deterministic behavior, there is fertile ground for security related issues:

http://www.technicalinfo.net/papers/URLEmbeddedAttacks.html

http://capec.mitre.org/data/definitions/64.html

http://www.dracos.co.uk/code/apache-rewrite-problem/

One only needs to search for terms like "Percent Encoding Problems" to see scope of the problem.

If we offer up an identifier that does NOT require percent encoding, we side step this problem and can relate metadata to this ID with forms that are not URI's.

 

Second example: Back in March of 2009, I was building a domain ontology using CPE names as an authoritative way to reference a platform.  The promise of a CPE Name in this context is very similar to that of SCAP:  in a universe of Linked Data, one could use this as a "key" to compute the equivalence.  To do this, the CPE Name would have to be used in RDF and more specifically, be successfully translated to and from RDF/XML.  As Jeremy Carroll (an expert in W3C Semantic Technology) points out in my threaded discussion, the syntax allowed by the CPE specification is problematic to the semantic web qname convention and as such will be a problem for RDF libraries.  The discussion can be seen here.

http://groups.google.com/group/topbraid-users/browse_thread/thread/81ed88db9df55ec4/deb88b8d1703122d?lnk=gst&q=CPE&fwc=1

While each tool may be able to find workarounds, we should not favor designs that require workarounds. 

 

Of the Common Enumerations, the platform enumeration is the only one to suffer this problem and I don’t think I need to explain why this is the case. 

I hope this offers enough information so that you can agree or agree to disagree.  Once that is done, we can discuss the multiples ways of getting us out of this problem.  I was pleased to see the DoD's posting on ISO/IEC 19770-2.  In terms of setting expectations, we will need something as good (or just incorporate it) and look forward to the discussion. 

 

--tk

 

 

--

Tim "TK" Keanini, CTO

nCircle Inc.

mbl (415) 328-2722

 


smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: URI Percent Encoding by default

Brant Cheikes

Tim,

I'm supportive of the proposal that we try to eliminate (or at least minimize) the need for percent encoding within CPE names, starting if possible in v2.3.  This is something the CPE Core Team is considering right now.  Let's discuss options and see if we can reach consensus on one of them.

As we have noted, for a minor release such as v2.3 we must preserve backward compatibility.  The current dictionary contains many entries which have embedded percent encodings of reserved characters, e.g., Xerces-C++ (the plus), AVG Email Server Edition for Linux/Freebsd 8.5 (the slash), Cisco Secure Access Control Server 3.3(1) (the parens), etc.  So the need is there.  So in v2.3, names with embedded substrings like "%28" must still be legal, since we cannot require that dictionary content developed under the v2.2 rules be modified in order to remain legal under v2.3.  So substrings that *look* like percent-encoded characters are here to stay, at least until 3.0.  For reference, the current grammar (v2.2 spec, Appendix A) defines the affected components (vendor, product, version, update, edition) as being STRINGs, such that

STRING = +( ALPHA / DIGIT / PUNC )
ALPHA =  %x41-5A / %x61-7A   ; A-Z / a-z
DIGIT =  %x30-39  ; 0-9
PUNC = ( "-" / "." / "_" / "˜" / "%" )

What are our design options for v2.3?

Option 1:  "Same look, different feel."  Leave the grammar effectively unchanged, but eliminate the percent encoding/decoding requirements and interpretations from the specification.

Taking this option, in v2.3 we won't say anything about percent encoding or decoding.  (Or at most we'll describe percent encoding as a recommended practice if there is a desire to embed special characters in component values.)  Instead, we'll restrict values of (vendor, product, version, update, edition) to STRINGs as defined above, and we will explicitly disallow any interpretation of ("%" HEXDIG HEXDIG) substrings as percent-encoded characters.  In effect, we'll disallow the use of most special characters in component names, and we will also eliminate the requirement stating, "When matching is performed, each component should be decoded when used during comparisons."  That is, when doing matching we'll perform a straight character-by-character comparison blind to percent encoding.  (NB: we'll ensure that all ALPHAs are normalized to lowercase before comparison.)

Pros:  All existing names (should) remain valid; no impact on matching.  We also gain forward compatibility, meaning that a v2.3 name is consumable by a v2.2 tool.
Cons:  We don't have a supported way of embedding special characters in components.

Option 2:  "Slightly relaxed look and feel."  Here's the thing: in v2.3, under certain conditions we may actually want to allow certain characters to appear embedded in components.  For example, suppose we want to allow limited embedded wildcards, namely, "?" (single-character wildcard) and "*" (multiple-character wildcard).  This is basically Option 1, but with PUNC extended to be:

PUNC = ( "-" / "." / "_" / "˜" / "%" / EXTRA)
EXTRA = ("?" / "*" )

Would adding those EXTRA characters break anything?  Are there other characters we could add to EXTRA that would not break existing implementations?

Pros:  Ability to embed a selective set of EXTRA characters directly in components.
Cons:  Potential for v2.2 tools to choke on v2.3 names (more on this below).

Option 3:  "So completely relaxed I could be on Percocet."  This option involves allowing virtually any special character to appear explicitly (unencoded) in the affected components.

Here the idea is to say, in v2.3 let's forget about that URI stuff.  Let's simply redefine a CPE name as a structured string with a syntax

STRING = +(UNRESERVED / SPECIAL)
UNRESERVED = LCALPHA / DIGIT / "-" / "." / "_" / "˜" / "%"
SPECIAL = "/" / "?" / "#" / "[" / "]" / "@" / "!" / "$" / "&" / "’" / "(" / ")" / "*" / "+" / "," / ";" / "="
RESERVED = ":" / "<" / ">" / DQUOTE
LCALPHA =  %x61-7A   ; a-z (lowercase only)
DIGIT =  %x30-39  ; 0-9
DQUOTE =  %x22  ; " (Double Quote)

Pros: This completely opens up the character set usable for affected name components.  Only the RESERVED characters would be disallowed.
Cons: V2.2 tools could choke on V2.3 names.  This is an operational problem.  How would we deal with this?  Any new names created under v2.3 either couldn't appear in a v2.2 dictionary, or we'd have to maintain two different versions of the dictionary, with names created under the more relaxed v2.3 rules mechanically transformed (read: percent-encoded) before being added to the v2.2 dictionary.  This also raises the question of the potential non-interoperability between v2.2 and v2.3 tools (e.g., a v2.2 vulnerability management tool interoperating with a v2.3 asset inventory tool).

Reactions?  Other options?

/Brant

 

Brant A. Cheikes
The MITRE Corporation
202 Burlington Road, M/S K302
Bedford, MA 01730-1420
Tel. 781-271-7505; Cell. 617-694-8180; Fax. 781-271-2352

 

From: Tim Keanini [mailto:[hidden email]]
Sent: Wednesday, April 14, 2010 11:57 AM
To: cpe-discussion-list CPE Community Forum
Subject: [CPE-DISCUSSION-LIST] URI Percent Encoding by default

 

At the previous SCAP Developer Days meeting, I raised the issue that the 'Percent Encoding' as defined in the CPE standard version 2.2 section 5.4 is problematic and should be a concern for most implementations.  I'd like to clarify my point by offering some examples.  I strongly suggest that everyone reread section 5.4 and this time with an eye toward any security issue that could be exploited due to encoding/decoding and a bias toward using this facility as the exception and not the rule.  The objective here is to clarify the point I made on Percent Encoding and while I have some suggestions on how to rectify the situation, I will refrain and concentrate on exemplifying the concern. 

 

While the RFCs describing the syntax of URI's (rfc3986, rfc2396, rfc2141, rfc1738) is clear on URI characters and escape sequences, the general practice of escaping should be the exception and never the rule; we in the CPE standard have made it the rule having complete knowledge that there will be a large set of reserved characters that must be escaped.  The CPE standard version 2.2 section 5.1 begins with the sentence "A CPE Name is a percent-encoded URI…".   RFC3986 goes into more details around security considerations in section 7 (Security Considerations) and I feel that as a standard within the Security Automation Content Protocol, we should choose designs that avoid the need to encode/decode in the identifier itself; keeping the non-legal metadata OUT of the identifier and in what is being indentified.

 

Please let me offer two examples: one general and one specific; both pointing to the plethora of unforeseen problems related to percent encoding within the URI.

 

First example: In this general example we show how ambiguity in the URI's alternative transcription should be a concern and avoided as it is dependent on implementation specific encoding and decoding of the URI. There is recognition in the CPE 2.2 specification of this problem to some degree in the sentence "The percent-encoding mechanism is a frequent source of variance among otherwise identical URIs".  And where there is variance in deterministic behavior, there is fertile ground for security related issues:

http://www.technicalinfo.net/papers/URLEmbeddedAttacks.html

http://capec.mitre.org/data/definitions/64.html

http://www.dracos.co.uk/code/apache-rewrite-problem/

One only needs to search for terms like "Percent Encoding Problems" to see scope of the problem.

If we offer up an identifier that does NOT require percent encoding, we side step this problem and can relate metadata to this ID with forms that are not URI's.

 

Second example: Back in March of 2009, I was building a domain ontology using CPE names as an authoritative way to reference a platform.  The promise of a CPE Name in this context is very similar to that of SCAP:  in a universe of Linked Data, one could use this as a "key" to compute the equivalence.  To do this, the CPE Name would have to be used in RDF and more specifically, be successfully translated to and from RDF/XML.  As Jeremy Carroll (an expert in W3C Semantic Technology) points out in my threaded discussion, the syntax allowed by the CPE specification is problematic to the semantic web qname convention and as such will be a problem for RDF libraries.  The discussion can be seen here.

http://groups.google.com/group/topbraid-users/browse_thread/thread/81ed88db9df55ec4/deb88b8d1703122d?lnk=gst&q=CPE&fwc=1

While each tool may be able to find workarounds, we should not favor designs that require workarounds. 

 

Of the Common Enumerations, the platform enumeration is the only one to suffer this problem and I don’t think I need to explain why this is the case. 

I hope this offers enough information so that you can agree or agree to disagree.  Once that is done, we can discuss the multiples ways of getting us out of this problem.  I was pleased to see the DoD's posting on ISO/IEC 19770-2.  In terms of setting expectations, we will need something as good (or just incorporate it) and look forward to the discussion. 

 

--tk

 

 

--

Tim "TK" Keanini, CTO

nCircle Inc.

mbl (415) 328-2722

 


smime.p7s (4K) Download Attachment