Consuming CybOX data

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Consuming CybOX data

Patrick Henry
Hi,
  I am confused about the consumption of CybOX data files. The
standard talks about the use of CybOX to send data between tools, and
one of the use cases listed is "Attack Detection" - one mode of
operation here might be that a tool only produces CybOX observables
based on a list of rules or signatures in some other format, but it
seems that the preferred mode of operation is one in which an
automated tool consumes threat indicators in CybOX format, and uses
those to test for the presence of the indicators listed in the CybOX
elements.

Unfortunately, the data type definitions seem to be geared towards
human consumption of data rather than automated processing. Is
consumption by automated tools really supported?

An example of this human consumption bias is found in an
ObjectAttributeGroup with condition="FitsPattern" and
pattern_type="Regex" - the RegexSyntaxEnum lists 15 regex syntaxes
that could be used to parse the value_set attribute. Any tool seeking
to detect the presence of CybOX observables would need to understand
and implement all 15 syntaxes, or else be unable to detect observables
defined using a non-implemented syntax. A human reading this could
simply look up any unfamiliar syntax, but I am not aware of any tool
or library currently available that implements all of these syntaxes,
and any attempt to link to existing libraries would almost certainly
run into license incompatibilities. Mitigations such as translating
from one syntax to another will result in multiple GUIDs describing
the same indicator, which just makes everybody's job that much harder.

Another example is in string representation. Many data types are given
as Common:StringObjectAttributeType, a variant of
Common:BaseObjectAttributeType. But only Common:ExtractedStringType
includes an encoding attribute. Everything else must presumably be
re-encoded in whatever the encoding of the enclosing XML document is
(commonly UTF-8). A single observable might contain several
FileObj:FileObjectType elements, each with a File_Name in a different
code page as seen on different target systems, and the consumer of
this observable must somehow know what the original code page was in
each case, in order to perform the requisite matching.

Even within Common:ExtractedStringType, we see only three encodings:
"ANSI", "Unicode" and "Other" - the fact that none of these three is
actually a valid string encoding would probably not bother a human
consumer. Tools such as Python might take the stance that "ANSI" means
1 byte per character and "Unicode" means 2 bytes per character, but
this does not help in the interpretation of strings, or representation
of strings to the user.

The commonly used meaning of "ANSI" string encoding is whatever
happens to be the local DOS/Windows code page of the source machine,
which is fine if you know that source and all consumers share the same
code page, but useless otherwise. Python in particular is extremely
clunky when presented with text strings in a variety of code pages and
asked to represent them to a user - you have to write your own
re-encoding functions. For "Unicode" there are several incompatible
string encodings such as UTF-8 and UTF-16-LE - are we to assume that
any Unicode Common:ExtractedStringType element contains a BOM at the
start, and if so is this part of the extracted string, or was it added
after the extraction? String length is given in characters rather than
bytes, which might help on infer the true encoding scheme, but might
also be its own source of confusion.

The standard also makes no mention of what to do when the encoding of
the Common:ExtractedStringType string does not match that of the
enclosing XML document, or when it contains special characters or
substrings that would disrupt XML parsing, such as '<', '"' or ']]>'.
The only safe way out of this "XML injection" problem would be to
re-encode as hexBinary or base64Binary, but that would necessitate the
use of Common:HexBinaryObjectAttributeType or
Common:Base64BinaryObjectAttributeType - otherwise the consumer would
not know whether to match on the string of encoded bytes, or decode
those bytes and then match.

Even if all of these difficulties are resolved, I worry that there is
no upper bound on the complexity of the CybOX schemata, and an attack
detection tool would need to implement most of the schemata. Already
we have overlaps such as Common:LanguageTypeEnum and
CodeObj:CodeLanguageEnum (identical except for the inclusion of
"Other"), and CybOX was built with extensibility in mind. Isn't there
a point at which the complexity will exceed that of a general purpose
programming language? We might be better served by embedding a simple
language such as Lua, with a few libraries to probe OS and filesystem
APIs. The result could certainly be more readable, particularly when
we examine logical composition of operators.

I am very interested to hear the groups opinions on these issues.
Many thanks,

Patrick

Ponte Technologies LLC
[hidden email]
443-355-4535
Reply | Threaded
Open this post in threaded view
|

RE: Consuming CybOX data

Aharon
We have learned a lot, not by analyzing the schemata, but through implementation of the language into our automation processes. The first thing we learned was, we are going to find problems with CybOx and STIX. The beauty of the existing approach is, I can pick up an phone or send an email to Mitre, and they resolve the issue in a matter of hours or days. Not that Mire is perfect, but they are paid to support and maintain the language, which is cheaper for me since I don't have to pay to maintain it.

The regex example you give does pose a challenge, but it also shows the flexibility of the language. In the financial sector scenario, we already have sharing agreements in place, and another one of our sharing documents will at some point include supported STIX/CybOx functionality. I think there is already discussion on defining various levels of CybOx understanding within sharing communities and tools, which will also help this case. I don't think providing 15 ways to do a regex is a deal breaker, and if this is causing me a problem then I must be receiving a lot of structured threat data, which is a problem I want to have. Right now I am trying to get us from 0% to 75%, using a method that I know will get us to 99% when we have the capabilities.

You encoding example could be completely valid, maybe it should be submitted as a bug report or feature request?

Aharon

DTCC Non-Confidential (White)
---------------------------------------------------
Michael "Aharon" Cherniin
Security Automation
DTCC Tampa
[hidden email]
Office: 813-470-2173


> -----Original Message-----
> From: [hidden email] [mailto:owner-cybox-
> [hidden email]] On Behalf Of Patrick Henry
> Sent: Thursday, March 14, 2013 1:36 PM
> To: [hidden email]
> Subject: Consuming CybOX data
>
> Hi,
>   I am confused about the consumption of CybOX data files. The standard
> talks about the use of CybOX to send data between tools, and one of the use
> cases listed is "Attack Detection" - one mode of operation here might be that
> a tool only produces CybOX observables based on a list of rules or signatures
> in some other format, but it seems that the preferred mode of operation is
> one in which an automated tool consumes threat indicators in CybOX format,
> and uses those to test for the presence of the indicators listed in the CybOX
> elements.
>
> Unfortunately, the data type definitions seem to be geared towards human
> consumption of data rather than automated processing. Is consumption by
> automated tools really supported?
>
> An example of this human consumption bias is found in an
> ObjectAttributeGroup with condition="FitsPattern" and
> pattern_type="Regex" - the RegexSyntaxEnum lists 15 regex syntaxes that
> could be used to parse the value_set attribute. Any tool seeking to detect
> the presence of CybOX observables would need to understand and
> implement all 15 syntaxes, or else be unable to detect observables defined
> using a non-implemented syntax. A human reading this could simply look up
> any unfamiliar syntax, but I am not aware of any tool or library currently
> available that implements all of these syntaxes, and any attempt to link to
> existing libraries would almost certainly run into license incompatibilities.
> Mitigations such as translating from one syntax to another will result in
> multiple GUIDs describing the same indicator, which just makes everybody's
> job that much harder.
>
> Another example is in string representation. Many data types are given as
> Common:StringObjectAttributeType, a variant of
> Common:BaseObjectAttributeType. But only Common:ExtractedStringType
> includes an encoding attribute. Everything else must presumably be re-
> encoded in whatever the encoding of the enclosing XML document is
> (commonly UTF-8). A single observable might contain several
> FileObj:FileObjectType elements, each with a File_Name in a different code
> page as seen on different target systems, and the consumer of this
> observable must somehow know what the original code page was in each
> case, in order to perform the requisite matching.
>
> Even within Common:ExtractedStringType, we see only three encodings:
> "ANSI", "Unicode" and "Other" - the fact that none of these three is actually
> a valid string encoding would probably not bother a human consumer. Tools
> such as Python might take the stance that "ANSI" means
> 1 byte per character and "Unicode" means 2 bytes per character, but this
> does not help in the interpretation of strings, or representation of strings to
> the user.
>
> The commonly used meaning of "ANSI" string encoding is whatever happens
> to be the local DOS/Windows code page of the source machine, which is fine
> if you know that source and all consumers share the same code page, but
> useless otherwise. Python in particular is extremely clunky when presented
> with text strings in a variety of code pages and asked to represent them to a
> user - you have to write your own re-encoding functions. For "Unicode"
> there are several incompatible string encodings such as UTF-8 and UTF-16-LE
> - are we to assume that any Unicode Common:ExtractedStringType element
> contains a BOM at the start, and if so is this part of the extracted string, or
> was it added after the extraction? String length is given in characters rather
> than bytes, which might help on infer the true encoding scheme, but might
> also be its own source of confusion.
>
> The standard also makes no mention of what to do when the encoding of the
> Common:ExtractedStringType string does not match that of the enclosing
> XML document, or when it contains special characters or substrings that
> would disrupt XML parsing, such as '<', '"' or ']]>'.
> The only safe way out of this "XML injection" problem would be to re-encode
> as hexBinary or base64Binary, but that would necessitate the use of
> Common:HexBinaryObjectAttributeType or
> Common:Base64BinaryObjectAttributeType - otherwise the consumer would
> not know whether to match on the string of encoded bytes, or decode those
> bytes and then match.
>
> Even if all of these difficulties are resolved, I worry that there is no upper
> bound on the complexity of the CybOX schemata, and an attack detection
> tool would need to implement most of the schemata. Already we have
> overlaps such as Common:LanguageTypeEnum and
> CodeObj:CodeLanguageEnum (identical except for the inclusion of "Other"),
> and CybOX was built with extensibility in mind. Isn't there a point at which
> the complexity will exceed that of a general purpose programming language?
> We might be better served by embedding a simple language such as Lua,
> with a few libraries to probe OS and filesystem APIs. The result could
> certainly be more readable, particularly when we examine logical
> composition of operators.
>
> I am very interested to hear the groups opinions on these issues.
> Many thanks,
>
> Patrick
>
> Ponte Technologies LLC
> [hidden email]
> 443-355-4535
<BR>_____________________________________________________________
<FONT size=2><BR>
DTCC DISCLAIMER: This email and any files transmitted with it are
confidential and intended solely for the use of the individual or
entity to whom they are addressed. If you have received this email
in error, please notify us immediately and delete the email and any
attachments from your system. The recipient should check this email
and any attachments for the presence of viruses.  The company
accepts no liability for any damage caused by any virus transmitted
by this email.</FONT>
Reply | Threaded
Open this post in threaded view
|

Re: Consuming CybOX data

Patrick Henry
Aharon,
  Thanks for the comments. When you look at CybOX in the context of
peer sharing agreements, it makes sense that some of the flexibility
(e.g. regex syntaxes) would be more tightly restricted by those
agreements than by the CybOX standard itself. I hadn't thought of it
like that - I was thinking more in terms of a public forum or
repository such as those springing up around OpenIOC. There you would
not have nearly as much control.

You make a good point that this problem only becomes a real concern
once you've achieved your goal of getting lots of observable patterns
from multiple sources, at which point it is perhaps worth the price.
This is assuming that you're writing all your own CybOX software of
course. The multiplicity of options is still a barrier to any
commercial tool that wishes to claim compliance to the standard.

Having thought some more about the encoding issue, I think that in
most cases the encoding itself is not a showstopper. If you see
multiple file paths in different codepages, simply convert each of
them to UTF-8 and write your CybOX observable pattern in UTF-8. Then
any test for that file path could take file paths seen on the system
and convert them to UTF-8 before doing a string/pattern match.

I still think that the encoding in ExtractedStringsType does not make
sense, and that there needs to be some way to safely handle XML or
other markup fragments embedded in captured data - for example when
including an HTML formatted email, or trying to characterize a
specific malicious use of XMLRPC. I'll think on it some more and
perhaps submit a feature request.

Thanks,

Patrick

Ponte Technologies LLC


On Fri, Mar 15, 2013 at 11:12 AM, Chernin, Michael A. <[hidden email]>
wrote:

>
> We have learned a lot, not by analyzing the schemata, but through
> implementation of the language into our automation processes. The first
> thing we learned was, we are going to find problems with CybOx and STIX. The
> beauty of the existing approach is, I can pick up an phone or send an email
> to Mitre, and they resolve the issue in a matter of hours or days. Not that
> Mire is perfect, but they are paid to support and maintain the language,
> which is cheaper for me since I don't have to pay to maintain it.
>
> The regex example you give does pose a challenge, but it also shows the
> flexibility of the language. In the financial sector scenario, we already
> have sharing agreements in place, and another one of our sharing documents
> will at some point include supported STIX/CybOx functionality. I think there
> is already discussion on defining various levels of CybOx understanding
> within sharing communities and tools, which will also help this case. I
> don't think providing 15 ways to do a regex is a deal breaker, and if this
> is causing me a problem then I must be receiving a lot of structured threat
> data, which is a problem I want to have. Right now I am trying to get us
> from 0% to 75%, using a method that I know will get us to 99% when we have
> the capabilities.
>
> You encoding example could be completely valid, maybe it should be
> submitted as a bug report or feature request?
>
> Aharon
>
> DTCC Non-Confidential (White)
> ---------------------------------------------------
> Michael "Aharon" Cherniin
> Security Automation
> DTCC Tampa
> [hidden email]
> Office: 813-470-2173
Reply | Threaded
Open this post in threaded view
|

RE: Consuming CybOX data

Moreau, Dennis
This may not be in the scope of the original question (so feel free to ignore this observation), I worry that conversion to UTF-8 for subsequent application of pattern matching would not be sufficient to address general recognition and comparison of URLs with representations from multiple different codepages, whenever character ranges, sorting or ordering are needed in the patterns.

My concern is that regexp that include any character range or ordering comparisons would seem to presume a specific collation. Wouldn't I need to also track the original collation, applying the appropriate collation specific regexp  to that URL?

I think that, in general,  concurrently processing sets of strings from multiple collations seems to create significant headaches for databases (1 collation per table), and in many other indexing, querying , sorting and clustering scenarios.

Dennis

Dennis R. Moreau, Ph.D.
RSA Security, Office of the CTO
Senior Technology Strategist
[hidden email]
719.964.0836


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Patrick Henry
Sent: Friday, March 15, 2013 12:26 PM
To: Chernin, Michael A.
Cc: [hidden email]
Subject: Re: Consuming CybOX data

Aharon,
  Thanks for the comments. When you look at CybOX in the context of peer sharing agreements, it makes sense that some of the flexibility (e.g. regex syntaxes) would be more tightly restricted by those agreements than by the CybOX standard itself. I hadn't thought of it like that - I was thinking more in terms of a public forum or repository such as those springing up around OpenIOC. There you would not have nearly as much control.

You make a good point that this problem only becomes a real concern once you've achieved your goal of getting lots of observable patterns from multiple sources, at which point it is perhaps worth the price.
This is assuming that you're writing all your own CybOX software of course. The multiplicity of options is still a barrier to any commercial tool that wishes to claim compliance to the standard.

Having thought some more about the encoding issue, I think that in most cases the encoding itself is not a showstopper. If you see multiple file paths in different codepages, simply convert each of them to UTF-8 and write your CybOX observable pattern in UTF-8. Then any test for that file path could take file paths seen on the system and convert them to UTF-8 before doing a string/pattern match.

I still think that the encoding in ExtractedStringsType does not make sense, and that there needs to be some way to safely handle XML or other markup fragments embedded in captured data - for example when including an HTML formatted email, or trying to characterize a specific malicious use of XMLRPC. I'll think on it some more and perhaps submit a feature request.

Thanks,

Patrick

Ponte Technologies LLC


On Fri, Mar 15, 2013 at 11:12 AM, Chernin, Michael A. <[hidden email]>
wrote:

>
> We have learned a lot, not by analyzing the schemata, but through
> implementation of the language into our automation processes. The
> first thing we learned was, we are going to find problems with CybOx
> and STIX. The beauty of the existing approach is, I can pick up an
> phone or send an email to Mitre, and they resolve the issue in a
> matter of hours or days. Not that Mire is perfect, but they are paid
> to support and maintain the language, which is cheaper for me since I don't have to pay to maintain it.
>
> The regex example you give does pose a challenge, but it also shows
> the flexibility of the language. In the financial sector scenario, we
> already have sharing agreements in place, and another one of our
> sharing documents will at some point include supported STIX/CybOx
> functionality. I think there is already discussion on defining various
> levels of CybOx understanding within sharing communities and tools,
> which will also help this case. I don't think providing 15 ways to do
> a regex is a deal breaker, and if this is causing me a problem then I
> must be receiving a lot of structured threat data, which is a problem
> I want to have. Right now I am trying to get us from 0% to 75%, using
> a method that I know will get us to 99% when we have the capabilities.
>
> You encoding example could be completely valid, maybe it should be
> submitted as a bug report or feature request?
>
> Aharon
>
> DTCC Non-Confidential (White)
> ---------------------------------------------------
> Michael "Aharon" Cherniin
> Security Automation
> DTCC Tampa
> [hidden email]
> Office: 813-470-2173
Reply | Threaded
Open this post in threaded view
|

Re: Consuming CybOX data

Patrick Henry
Dennis,
  I've been thinking about how to solve this problem. The one source
of data validation that's not going away is the XML parser. We want to
be able to make use of existing parser libraries, so we have to
produce valid XML - in particular, if the XML document is encoded as
UTF-8 (as it often is) then all data within the document must be
UTF-8. If we know that the data was originally represented in a
different encoding, then perhaps an optional @originalEncoding
attribute would do the trick.

What do you think?

Patrick

Ponte Technologies LLC
[hidden email]
443-355-4535


On Fri, Mar 15, 2013 at 2:01 PM, Moreau, Dennis <[hidden email]>
wrote:

>
> This may not be in the scope of the original question (so feel free to
> ignore this observation), I worry that conversion to UTF-8 for subsequent
> application of pattern matching would not be sufficient to address general
> recognition and comparison of URLs with representations from multiple
> different codepages, whenever character ranges, sorting or ordering are
> needed in the patterns.
>
> My concern is that regexp that include any character range or ordering
> comparisons would seem to presume a specific collation. Wouldn't I need to
> also track the original collation, applying the appropriate collation
> specific regexp  to that URL?
>
> I think that, in general,  concurrently processing sets of strings from
> multiple collations seems to create significant headaches for databases (1
> collation per table), and in many other indexing, querying , sorting and
> clustering scenarios.
>
> Dennis
>
> Dennis R. Moreau, Ph.D.
> RSA Security, Office of the CTO
> Senior Technology Strategist
> [hidden email]
> 719.964.0836
Reply | Threaded
Open this post in threaded view
|

Re: Consuming CybOX data

Moreau, Dennis
This would certainly provide more explicit contextual cueing. Subsequent processing could then be parameterized, when appropriate, by that original encoding information to get coherent searching, sorting and comparison across source alternative encodings.

Note: I don't believe that there is a simple solution to the (non CyBOX)  problem of consistent comprehensive analysis across different original encodings of paths in the same forensic repository. We may well have to settle for partitioning some analyses into equivalence classes of syntactically consistent paths (i.e. paths from the same collation).

(If only all malware were written to interact consistently with an asset's nationalization context ...  a standards opportunity if ever I saw one! :-) )

Dennis


Sent from my iPad

On Mar 18, 2013, at 4:53 PM, "Patrick Henry" <[hidden email]> wrote:

> Dennis,
>  I've been thinking about how to solve this problem. The one source
> of data validation that's not going away is the XML parser. We want to
> be able to make use of existing parser libraries, so we have to
> produce valid XML - in particular, if the XML document is encoded as
> UTF-8 (as it often is) then all data within the document must be
> UTF-8. If we know that the data was originally represented in a
> different encoding, then perhaps an optional @originalEncoding
> attribute would do the trick.
>
> What do you think?
>
> Patrick
>
> Ponte Technologies LLC
> [hidden email]
> 443-355-4535
>
>
> On Fri, Mar 15, 2013 at 2:01 PM, Moreau, Dennis <[hidden email]>
> wrote:
>>
>> This may not be in the scope of the original question (so feel free to
>> ignore this observation), I worry that conversion to UTF-8 for subsequent
>> application of pattern matching would not be sufficient to address general
>> recognition and comparison of URLs with representations from multiple
>> different codepages, whenever character ranges, sorting or ordering are
>> needed in the patterns.
>>
>> My concern is that regexp that include any character range or ordering
>> comparisons would seem to presume a specific collation. Wouldn't I need to
>> also track the original collation, applying the appropriate collation
>> specific regexp  to that URL?
>>
>> I think that, in general,  concurrently processing sets of strings from
>> multiple collations seems to create significant headaches for databases (1
>> collation per table), and in many other indexing, querying , sorting and
>> clustering scenarios.
>>
>> Dennis
>>
>> Dennis R. Moreau, Ph.D.
>> RSA Security, Office of the CTO
>> Senior Technology Strategist
>> [hidden email]
>> 719.964.0836
>