Data Collision handling - what are others doing?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Data Collision handling - what are others doing?

Gruman, Francis (Frank)
We are trying to come up with a way to ensure that we are not duplicating to much data in our system and are finding it a bit difficult to work around data collisions.  In our case, a collision is where the same data values are passed in from different sources with different GUIDs.  Ultimately, we are trying to avoid duplicate data in our system as well as duplicating data in other systems.

Here is one scenario where we are running into this problem:

We have an agreement with ORG1 to receive some information sets.  They send us their information and we get something like IP:192.168.168.168 with their unique GUID.
We also have an agreement with ORG2 to receive some of their information.  They send us their information and we also get IP:192.168.168.168 with _their_ unique GUID.

These two organizations do not yet communicate with each other (but may, in the future).

One of the ways we have considered handling this situation in our system is to always generate our own GUID and keep track of the external GUID values from external sources.  Then we can send the external source's GUID back to them with our own reporting that ties into their data.  As a total system, however, we still run the risk of duplicate data.  If ORG1 and ORG2 eventually decide to share data then they now have their own data collisions.

POSSIBLE SOLUTION:
Would it be possible to extend the CybOX schema to support "also-seen-as" references with GUID and date attributes?  As an example, I have extended a File object with a  MultipleReferenceType that I have created in my mind.

<cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7" type="File">
                <cybox:Description>
                    <common:Text>The word document contains flash, which downloads a corrupted mp4
                        file. The mp4 file itself is not anything special but an 0C filled (22kb)
                        mp4 file with a valid mp4 header.</common:Text>
                </cybox:Description>
                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
                    <FileObj:File_Name datatype="String">Iran's Oil and Nuclear
                        Situation.doc</FileObj:File_Name>
                    <FileObj:Size_In_Bytes datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
                    <FileObj:Hashes>
                        <common:Hash>
                            <common:Type datatype="String">MD5</common:Type>
                            <common:Simple_Hash_Value condition="Equals" datatype="hexBinary"
                                >E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
                        </common:Hash>
                    </FileObj:Hashes>
                </cybox:Defined_Object>
        <common:MultipleReferenceType id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce" firstObserved="03-MAR-2012 11:21:12345" />
        <common:MultipleReferenceType id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb" firstObserved="21-OCT-2011 15:32:54321" />
            </cybox:Object>

This would help to 1) support independence of individual data owners while 2) working to minimize duplication of data elements.

Any and all feedback here would be appreciated.  If extending the schema is not the answer then how are others working on deduplication (if at all).

Regards,
Frank Gruman
Systems Engineer, DC3 DCCI, contractor
410-981-1142 (work)
NIPR: [hidden email]
SIPR: [hidden email]



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

Scanned by the Clearswift SECURE Email Gateway.

**********************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Data Collision handling - what are others doing?

John Howie
Hi Frank,

One situation you will likely run into is where the data received,
although the same as other data received, pertains to two or more
completely different threats/events/incidents. Whether or not you should
correlate the reports should probably be probably determined by other
data, especially when considerable time has passed between the reports.
That does not prevent you from recording also-seen-as references, but no
inference should be automatically made about the association.

Regards,

John


On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

>We are trying to come up with a way to ensure that we are not duplicating
>to much data in our system and are finding it a bit difficult to work
>around data collisions.  In our case, a collision is where the same data
>values are passed in from different sources with different GUIDs.
>Ultimately, we are trying to avoid duplicate data in our system as well
>as duplicating data in other systems.
>
>Here is one scenario where we are running into this problem:
>
>We have an agreement with ORG1 to receive some information sets.  They
>send us their information and we get something like IP:192.168.168.168
>with their unique GUID.
>We also have an agreement with ORG2 to receive some of their information.
> They send us their information and we also get IP:192.168.168.168 with
>_their_ unique GUID.
>
>These two organizations do not yet communicate with each other (but may,
>in the future).
>
>One of the ways we have considered handling this situation in our system
>is to always generate our own GUID and keep track of the external GUID
>values from external sources.  Then we can send the external source's
>GUID back to them with our own reporting that ties into their data.  As a
>total system, however, we still run the risk of duplicate data.  If ORG1
>and ORG2 eventually decide to share data then they now have their own
>data collisions.
>
>POSSIBLE SOLUTION:
>Would it be possible to extend the CybOX schema to support "also-seen-as"
>references with GUID and date attributes?  As an example, I have extended
>a File object with a  MultipleReferenceType that I have created in my
>mind.
>
><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>type="File">
>                <cybox:Description>
>                    <common:Text>The word document contains flash, which
>downloads a corrupted mp4
>                        file. The mp4 file itself is not anything special
>but an 0C filled (22kb)
>                        mp4 file with a valid mp4 header.</common:Text>
>                </cybox:Description>
>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>                    <FileObj:File_Name datatype="String">Iran's Oil and
>Nuclear
>                        Situation.doc</FileObj:File_Name>
>                    <FileObj:Size_In_Bytes
>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>                    <FileObj:Hashes>
>                        <common:Hash>
>                            <common:Type
>datatype="String">MD5</common:Type>
>                            <common:Simple_Hash_Value condition="Equals"
>datatype="hexBinary"
>                  
>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>                        </common:Hash>
>                    </FileObj:Hashes>
>                </cybox:Defined_Object>
> <common:MultipleReferenceType
>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>firstObserved="03-MAR-2012 11:21:12345" />
> <common:MultipleReferenceType
>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>firstObserved="21-OCT-2011 15:32:54321" />
>            </cybox:Object>
>
>This would help to 1) support independence of individual data owners
>while 2) working to minimize duplication of data elements.
>
>Any and all feedback here would be appreciated.  If extending the schema
>is not the answer then how are others working on deduplication (if at
>all).
>
>Regards,
>Frank Gruman
>Systems Engineer, DC3 DCCI, contractor
>410-981-1142 (work)
>NIPR: [hidden email]
>SIPR: [hidden email]
>
>
>
>**********************************************************************
>This email and any files transmitted with it are confidential and
>intended solely for the use of the individual or entity to whom they
>are addressed. If you have received this email in error please notify
>the system manager.
>
>Scanned by the Clearswift SECURE Email Gateway.
>
>**********************************************************************
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Back, Greg
I agree with John.

I'm working through similar issues developing a tool for storing and sharing
CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
Domain, etc.) should be handled by direct comparison of the content, not
trying to match IDs. For files, uniqueness is based on hashes (which can be
a bit more difficult since different organizations may report different
hashes). It gets even harder for more complex types, but those cover 95% of
what I'm interested in.

I plan to use my own internal unique identifier to track that object in my
system, and track each "occurrence" of the object separately (including each
source organization that has reported that object, along with the ID they
refer to it by). These "occurrences" are tracked only within the system, and
not (yet) shared with others.

My gut is telling me that coordinating and communicating "occurrences" is
better left to a standard like STIX, but I'm admittedly much less familiar
with STIX than CybOX. I would only use the object GUIDs to determine whether
I'd seen the same object from the same organization before, and thus not add
a new "occurrence" if nothing else has changed.

Just my $.02. I have a feeling that as more organizations start sharing
CybOX and STIX content, these issues will become very apparent.

Thanks for your feedback.

Greg


>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of John Howie
>Sent: Monday, March 04, 2013 2:55 PM
>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>Observable Expression/CybOX Discussi
>Subject: Re: Data Collision handling - what are others doing?
>
>Hi Frank,
>
>One situation you will likely run into is where the data received,
>although the same as other data received, pertains to two or more
>completely different threats/events/incidents. Whether or not you should
>correlate the reports should probably be probably determined by other
>data, especially when considerable time has passed between the reports.
>That does not prevent you from recording also-seen-as references, but no
>inference should be automatically made about the association.
>
>Regards,
>
>John
>
>
>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
><[hidden email]> wrote:
>
>>We are trying to come up with a way to ensure that we are not duplicating
>>to much data in our system and are finding it a bit difficult to work
>>around data collisions.  In our case, a collision is where the same data
>>values are passed in from different sources with different GUIDs.
>>Ultimately, we are trying to avoid duplicate data in our system as well
>>as duplicating data in other systems.
>>
>>Here is one scenario where we are running into this problem:
>>
>>We have an agreement with ORG1 to receive some information sets.  They
>>send us their information and we get something like IP:192.168.168.168
>>with their unique GUID.
>>We also have an agreement with ORG2 to receive some of their information.
>> They send us their information and we also get IP:192.168.168.168 with
>>_their_ unique GUID.
>>
>>These two organizations do not yet communicate with each other (but may,
>>in the future).
>>
>>One of the ways we have considered handling this situation in our system
>>is to always generate our own GUID and keep track of the external GUID
>>values from external sources.  Then we can send the external source's
>>GUID back to them with our own reporting that ties into their data.  As a
>>total system, however, we still run the risk of duplicate data.  If ORG1
>>and ORG2 eventually decide to share data then they now have their own
>>data collisions.
>>
>>POSSIBLE SOLUTION:
>>Would it be possible to extend the CybOX schema to support "also-seen-as"
>>references with GUID and date attributes?  As an example, I have extended
>>a File object with a  MultipleReferenceType that I have created in my
>>mind.
>>
>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>type="File">
>>                <cybox:Description>
>>                    <common:Text>The word document contains flash, which
>>downloads a corrupted mp4
>>                        file. The mp4 file itself is not anything special
>>but an 0C filled (22kb)
>>                        mp4 file with a valid mp4 header.</common:Text>
>>                </cybox:Description>
>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>Nuclear
>>                        Situation.doc</FileObj:File_Name>
>>                    <FileObj:Size_In_Bytes
>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>                    <FileObj:Hashes>
>>                        <common:Hash>
>>                            <common:Type
>>datatype="String">MD5</common:Type>
>>                            <common:Simple_Hash_Value condition="Equals"
>>datatype="hexBinary"
>>
>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>                        </common:Hash>
>>                    </FileObj:Hashes>
>>                </cybox:Defined_Object>
>> <common:MultipleReferenceType
>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>firstObserved="03-MAR-2012 11:21:12345" />
>> <common:MultipleReferenceType
>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>firstObserved="21-OCT-2011 15:32:54321" />
>>            </cybox:Object>
>>
>>This would help to 1) support independence of individual data owners
>>while 2) working to minimize duplication of data elements.
>>
>>Any and all feedback here would be appreciated.  If extending the schema
>>is not the answer then how are others working on deduplication (if at
>>all).
>>
>>Regards,
>>Frank Gruman
>>Systems Engineer, DC3 DCCI, contractor
>>410-981-1142 (work)
>>NIPR: [hidden email]
>>SIPR: [hidden email]
>>
>>
>>
>>**********************************************************
>************
>>This email and any files transmitted with it are confidential and
>>intended solely for the use of the individual or entity to whom they
>>are addressed. If you have received this email in error please notify
>>the system manager.
>>
>>Scanned by the Clearswift SECURE Email Gateway.
>>
>>**********************************************************
>************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Gruman, Francis (Frank)
Perhaps I should have also sent this to the STIX discussion, too.  If anyone
is already on both lists, would you mind throwing it over that fence, too?  I
will sign up soon...

I am not trying to overcome unique identification within each organization's
system.  We, too, are assessing uniqueness of an object based on the
object's data elements.  So we are planning on using it as a hint to say that
we think the
object you sent us is "X".  If it is not "X" then we write it out as "Y".
But it seems to me that overlooking GUID values is essentially ignoring the
purpose of the "Globally Unique" part of the value in the first place.

To John's example of a File, it is possible to have multiple files with the
same hash.  So the hash by itself is not unique enough.  Multiple File
objects can point to the same hash value.  So you send us your file with
your hash and, even if we have seen the hash before, we will create a new
File object.  I think that is something all of our systems should take into
account.

I guess my point in suggesting the "MultipleReferenceType" or "Also-Seen-As"
is to enable sharing of the additional references we get from others (if
allowed to do so).  In our case, internally we put security caveats on the
additional reference that matches the originally sourced data.  If the sharing
scheme is adopted, we would only send out to those who have the authority to
view the additional references.  This would enable Greg and I to share
elements with each other that might have been seen as (possibly) something
else.

Thanks for the feedback, guys.

Regards,
Frank

-----Original Message-----
From: Back, Greg [mailto:[hidden email]]
Sent: Thursday, March 07, 2013 1:00 PM
To: John Howie; Gruman, Francis (Frank) Contractor DC3;
cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

I agree with John.

I'm working through similar issues developing a tool for storing and sharing
CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
Domain, etc.) should be handled by direct comparison of the content, not
trying to match IDs. For files, uniqueness is based on hashes (which can be
a bit more difficult since different organizations may report different
hashes). It gets even harder for more complex types, but those cover 95% of
what I'm interested in.

I plan to use my own internal unique identifier to track that object in my
system, and track each "occurrence" of the object separately (including each
source organization that has reported that object, along with the ID they
refer to it by). These "occurrences" are tracked only within the system, and
not (yet) shared with others.

My gut is telling me that coordinating and communicating "occurrences" is
better left to a standard like STIX, but I'm admittedly much less familiar
with STIX than CybOX. I would only use the object GUIDs to determine whether
I'd seen the same object from the same organization before, and thus not add
a new "occurrence" if nothing else has changed.

Just my $.02. I have a feeling that as more organizations start sharing
CybOX and STIX content, these issues will become very apparent.

Thanks for your feedback.

Greg


>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of John Howie
>Sent: Monday, March 04, 2013 2:55 PM
>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>Observable Expression/CybOX Discussi
>Subject: Re: Data Collision handling - what are others doing?
>
>Hi Frank,
>
>One situation you will likely run into is where the data received,
>although the same as other data received, pertains to two or more
>completely different threats/events/incidents. Whether or not you should
>correlate the reports should probably be probably determined by other
>data, especially when considerable time has passed between the reports.
>That does not prevent you from recording also-seen-as references, but no
>inference should be automatically made about the association.
>
>Regards,
>
>John
>
>
>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
><[hidden email]> wrote:
>
>>We are trying to come up with a way to ensure that we are not duplicating
>>to much data in our system and are finding it a bit difficult to work
>>around data collisions.  In our case, a collision is where the same data
>>values are passed in from different sources with different GUIDs.
>>Ultimately, we are trying to avoid duplicate data in our system as well
>>as duplicating data in other systems.
>>
>>Here is one scenario where we are running into this problem:
>>
>>We have an agreement with ORG1 to receive some information sets.  They
>>send us their information and we get something like IP:192.168.168.168
>>with their unique GUID.
>>We also have an agreement with ORG2 to receive some of their information.
>> They send us their information and we also get IP:192.168.168.168 with
>>_their_ unique GUID.
>>
>>These two organizations do not yet communicate with each other (but may,
>>in the future).
>>
>>One of the ways we have considered handling this situation in our system
>>is to always generate our own GUID and keep track of the external GUID
>>values from external sources.  Then we can send the external source's
>>GUID back to them with our own reporting that ties into their data.  As a
>>total system, however, we still run the risk of duplicate data.  If ORG1
>>and ORG2 eventually decide to share data then they now have their own
>>data collisions.
>>
>>POSSIBLE SOLUTION:
>>Would it be possible to extend the CybOX schema to support "also-seen-as"
>>references with GUID and date attributes?  As an example, I have extended
>>a File object with a  MultipleReferenceType that I have created in my
>>mind.
>>
>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>type="File">
>>                <cybox:Description>
>>                    <common:Text>The word document contains flash, which
>>downloads a corrupted mp4
>>                        file. The mp4 file itself is not anything special
>>but an 0C filled (22kb)
>>                        mp4 file with a valid mp4 header.</common:Text>
>>                </cybox:Description>
>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>Nuclear
>>                        Situation.doc</FileObj:File_Name>
>>                    <FileObj:Size_In_Bytes
>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>                    <FileObj:Hashes>
>>                        <common:Hash>
>>                            <common:Type
>>datatype="String">MD5</common:Type>
>>                            <common:Simple_Hash_Value condition="Equals"
>>datatype="hexBinary"
>>
>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>                        </common:Hash>
>>                    </FileObj:Hashes>
>>                </cybox:Defined_Object>
>> <common:MultipleReferenceType
>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>firstObserved="03-MAR-2012 11:21:12345" />
>> <common:MultipleReferenceType
>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>firstObserved="21-OCT-2011 15:32:54321" />
>>            </cybox:Object>
>>
>>This would help to 1) support independence of individual data owners
>>while 2) working to minimize duplication of data elements.
>>
>>Any and all feedback here would be appreciated.  If extending the schema
>>is not the answer then how are others working on deduplication (if at
>>all).
>>
>>Regards,
>>Frank Gruman
>>Systems Engineer, DC3 DCCI, contractor
>>410-981-1142 (work)
>>NIPR: [hidden email]
>>SIPR: [hidden email]
>>
>>
>>
>>**********************************************************
>************
>>This email and any files transmitted with it are confidential and
>>intended solely for the use of the individual or entity to whom they
>>are addressed. If you have received this email in error please notify
>>the system manager.
>>
>>Scanned by the Clearswift SECURE Email Gateway.
>>
>>**********************************************************
>************

smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Data Collision handling - what are others doing?

Osorno, Marcos
Hi Frank,

I've been using a combination of meaningful attributes for UIDs. Please
forgive the cross-post, this is what I expressed on the IDXWG mailing list
(slightly edited):

"I should clarify more of what I find to be the problem. I think maybe we
are conflating (myself included) the concepts of fingerprint, globally
unique ID, and name which seem like a tradeoff between:

* Uniqueness - how unique is it within the system, globally, or given a
set of pre-conditions about the entity being identified
* Meaningfulness - how meaningful is it to a machine or human attempting
to make sense of the ID
* Calculability - how hard is it to calculate
* Repeatability - how consistent is the the ID

I'm curious to find IDs more descriptive than "firefox.exe" but less
opaque than GUIDs. More akin to names, these IDs
would makes less guarantees of being globally unique for the benefit of
being more meaningful. It has been interesting to see how URLs have moved
from: site.whatever/this.aspx?barf=32234234242342&this=1&ord=asc to things
like site.whatever/catalog/pants/fancy/1234 because the 2nd URL is more
meaningful. Granted, that takes the domain name system to work which
requires registration and routing both of which are susceptible to attacks.

At any rate, I'd be curious to hear about best practices in generating
meaningful names in addition to what we know about GUIDs."

File hashes are a good example. I think a file hash is valid UID for the
abstract hash of that sequence of bytes representing theoretical code on a
computer. However, I think it's a bad UID for an instance of that file on
a particular system. I think from an ontological perspective the abstract
hash and the concrete file are two separate, but related entities. So, my
watch list has an observable entity with the hash as both the UID and an
attribute, but a file has a UID that consists of the hash, the major
device, the minor device, and the inode (for Linux systems) to read
HASH-MAJOR-MINOR-INODE. This entity also has the hash, but as an
attribute, not as a UID. What I'm working on is using the RelatedObjects
types to associate the two. I'm also then working on linking the file
instance to a machine whose UID is a combination of the uname information,
MAC addresses, and a guess at the OS install date. I'm staying away from
GUIDs since I find them not meaningful at all and in my mind they are like
arbitrary internal database index keys which should only be very minimally
exposed. My approach has it's own issues in terms of theoretical
collisions, but I prefer to have meaningful labels even if it does mean
there is some chance of collision. I reduce the chance of collision by
associating that object with a few other observables so that it's not just
represented by a hash alone.

Sincerely,

Marcos


On 3/14/13 12:58 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

>Perhaps I should have also sent this to the STIX discussion, too.  If
>anyone
>is already on both lists, would you mind throwing it over that fence,
>too?  I
>will sign up soon...
>
>I am not trying to overcome unique identification within each
>organization's
>system.  We, too, are assessing uniqueness of an object based on the
>object's data elements.  So we are planning on using it as a hint to say
>that
>we think the
>object you sent us is "X".  If it is not "X" then we write it out as "Y".
>But it seems to me that overlooking GUID values is essentially ignoring
>the
>purpose of the "Globally Unique" part of the value in the first place.
>
>To John's example of a File, it is possible to have multiple files with
>the
>same hash.  So the hash by itself is not unique enough.  Multiple File
>objects can point to the same hash value.  So you send us your file with
>your hash and, even if we have seen the hash before, we will create a new
>File object.  I think that is something all of our systems should take
>into
>account.
>
>I guess my point in suggesting the "MultipleReferenceType" or
>"Also-Seen-As"
>is to enable sharing of the additional references we get from others (if
>allowed to do so).  In our case, internally we put security caveats on
>the
>additional reference that matches the originally sourced data.  If the
>sharing
>scheme is adopted, we would only send out to those who have the authority
>to
>view the additional references.  This would enable Greg and I to share
>elements with each other that might have been seen as (possibly)
>something
>else.
>
>Thanks for the feedback, guys.
>
>Regards,
>Frank
>
>-----Original Message-----
>From: Back, Greg [mailto:[hidden email]]
>Sent: Thursday, March 07, 2013 1:00 PM
>To: John Howie; Gruman, Francis (Frank) Contractor DC3;
>cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>Subject: RE: Data Collision handling - what are others doing?
>
>I agree with John.
>
>I'm working through similar issues developing a tool for storing and
>sharing
>CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
>Domain, etc.) should be handled by direct comparison of the content, not
>trying to match IDs. For files, uniqueness is based on hashes (which can
>be
>a bit more difficult since different organizations may report different
>hashes). It gets even harder for more complex types, but those cover 95%
>of
>what I'm interested in.
>
>I plan to use my own internal unique identifier to track that object in my
>system, and track each "occurrence" of the object separately (including
>each
>source organization that has reported that object, along with the ID they
>refer to it by). These "occurrences" are tracked only within the system,
>and
>not (yet) shared with others.
>
>My gut is telling me that coordinating and communicating "occurrences" is
>better left to a standard like STIX, but I'm admittedly much less familiar
>with STIX than CybOX. I would only use the object GUIDs to determine
>whether
>I'd seen the same object from the same organization before, and thus not
>add
>a new "occurrence" if nothing else has changed.
>
>Just my $.02. I have a feeling that as more organizations start sharing
>CybOX and STIX content, these issues will become very apparent.
>
>Thanks for your feedback.
>
>Greg
>
>
>>-----Original Message-----
>>From: [hidden email] [mailto:owner-cybox-
>>[hidden email]] On Behalf Of John Howie
>>Sent: Monday, March 04, 2013 2:55 PM
>>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>>Observable Expression/CybOX Discussi
>>Subject: Re: Data Collision handling - what are others doing?
>>
>>Hi Frank,
>>
>>One situation you will likely run into is where the data received,
>>although the same as other data received, pertains to two or more
>>completely different threats/events/incidents. Whether or not you should
>>correlate the reports should probably be probably determined by other
>>data, especially when considerable time has passed between the reports.
>>That does not prevent you from recording also-seen-as references, but no
>>inference should be automatically made about the association.
>>
>>Regards,
>>
>>John
>>
>>
>>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
>><[hidden email]> wrote:
>>
>>>We are trying to come up with a way to ensure that we are not
>>>duplicating
>>>to much data in our system and are finding it a bit difficult to work
>>>around data collisions.  In our case, a collision is where the same data
>>>values are passed in from different sources with different GUIDs.
>>>Ultimately, we are trying to avoid duplicate data in our system as well
>>>as duplicating data in other systems.
>>>
>>>Here is one scenario where we are running into this problem:
>>>
>>>We have an agreement with ORG1 to receive some information sets.  They
>>>send us their information and we get something like IP:192.168.168.168
>>>with their unique GUID.
>>>We also have an agreement with ORG2 to receive some of their
>>>information.
>>> They send us their information and we also get IP:192.168.168.168 with
>>>_their_ unique GUID.
>>>
>>>These two organizations do not yet communicate with each other (but may,
>>>in the future).
>>>
>>>One of the ways we have considered handling this situation in our system
>>>is to always generate our own GUID and keep track of the external GUID
>>>values from external sources.  Then we can send the external source's
>>>GUID back to them with our own reporting that ties into their data.  As
>>>a
>>>total system, however, we still run the risk of duplicate data.  If ORG1
>>>and ORG2 eventually decide to share data then they now have their own
>>>data collisions.
>>>
>>>POSSIBLE SOLUTION:
>>>Would it be possible to extend the CybOX schema to support
>>>"also-seen-as"
>>>references with GUID and date attributes?  As an example, I have
>>>extended
>>>a File object with a  MultipleReferenceType that I have created in my
>>>mind.
>>>
>>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>>type="File">
>>>                <cybox:Description>
>>>                    <common:Text>The word document contains flash, which
>>>downloads a corrupted mp4
>>>                        file. The mp4 file itself is not anything
>>>special
>>>but an 0C filled (22kb)
>>>                        mp4 file with a valid mp4 header.</common:Text>
>>>                </cybox:Description>
>>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>>Nuclear
>>>                        Situation.doc</FileObj:File_Name>
>>>                    <FileObj:Size_In_Bytes
>>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>>                    <FileObj:Hashes>
>>>                        <common:Hash>
>>>                            <common:Type
>>>datatype="String">MD5</common:Type>
>>>                            <common:Simple_Hash_Value condition="Equals"
>>>datatype="hexBinary"
>>>
>>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>>                        </common:Hash>
>>>                    </FileObj:Hashes>
>>>                </cybox:Defined_Object>
>>> <common:MultipleReferenceType
>>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>>firstObserved="03-MAR-2012 11:21:12345" />
>>> <common:MultipleReferenceType
>>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>>firstObserved="21-OCT-2011 15:32:54321" />
>>>            </cybox:Object>
>>>
>>>This would help to 1) support independence of individual data owners
>>>while 2) working to minimize duplication of data elements.
>>>
>>>Any and all feedback here would be appreciated.  If extending the schema
>>>is not the answer then how are others working on deduplication (if at
>>>all).
>>>
>>>Regards,
>>>Frank Gruman
>>>Systems Engineer, DC3 DCCI, contractor
>>>410-981-1142 (work)
>>>NIPR: [hidden email]
>>>SIPR: [hidden email]
>>>
>>>
>>>
>>>**********************************************************
>>************
>>>This email and any files transmitted with it are confidential and
>>>intended solely for the use of the individual or entity to whom they
>>>are addressed. If you have received this email in error please notify
>>>the system manager.
>>>
>>>Scanned by the Clearswift SECURE Email Gateway.
>>>
>>>**********************************************************
>>************

smime.p7s (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Back, Greg
In reply to this post by Gruman, Francis (Frank)
Hmmm, Frank brings up an interesting point...

Assume I work for Organization A, Frank works for Organization B and John works for Organization C, and we are discussing a FileObject X which we universally agree represents the same "thing" (if we don't agree on what constitutes its uniqueness, it gets a lot more complicated).

If we each discover the file independently (rather than learning about it ONLY from another), then we might each give the file IDs of A-X, B-X, and C-X independently (where each is actually a random GUID such that we can't correlate on X, however we determine it is the same file via hash or similar). If Frank shares a CybOX/STIX document containing B-X, I can correlate to my A-X and say "Frank also saw this, and he's calling it B-X". My interpretation of what Frank is suggesting is that I could send something to John which says "I've seen this thing I'm calling A-X, and by the way, Frank calls it B-X", and John could add both A-X and B-X to his "also-known-as" list for C-X.

Assuming all the sharing agreements are in place to permit sharing the "also-known-as" lists, it seems like a less complex solution would be for Frank to share the name B-X with John directly. Are there use cases where this direct sharing between Frank and John would not be allowed, but sharing "through" me would be? On a related note, TAXII is a related standard for sharing and communicating threat information (as STIX documents) and handles the multi-party sharing pretty well, in my opinion.

Thanks for the discussion!

Greg

>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of Gruman, Francis (Frank)
>Contractor DC3
>Sent: Wednesday, March 13, 2013 9:58 AM
>To: cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>Subject: RE: Data Collision handling - what are others doing?
>
>Perhaps I should have also sent this to the STIX discussion, too.  If anyone
>is already on both lists, would you mind throwing it over that fence, too?  I
>will sign up soon...
>
>I am not trying to overcome unique identification within each organization's
>system.  We, too, are assessing uniqueness of an object based on the
>object's data elements.  So we are planning on using it as a hint to say that
>we think the
>object you sent us is "X".  If it is not "X" then we write it out as "Y".
>But it seems to me that overlooking GUID values is essentially ignoring the
>purpose of the "Globally Unique" part of the value in the first place.
>
>To John's example of a File, it is possible to have multiple files with the
>same hash.  So the hash by itself is not unique enough.  Multiple File
>objects can point to the same hash value.  So you send us your file with
>your hash and, even if we have seen the hash before, we will create a new
>File object.  I think that is something all of our systems should take into
>account.
>
>I guess my point in suggesting the "MultipleReferenceType" or "Also-Seen-
>As"
>is to enable sharing of the additional references we get from others (if
>allowed to do so).  In our case, internally we put security caveats on the
>additional reference that matches the originally sourced data.  If the sharing
>scheme is adopted, we would only send out to those who have the authority
>to
>view the additional references.  This would enable Greg and I to share
>elements with each other that might have been seen as (possibly) something
>else.
>
>Thanks for the feedback, guys.
>
>Regards,
>Frank
>
>-----Original Message-----
>From: Back, Greg [mailto:[hidden email]]
>Sent: Thursday, March 07, 2013 1:00 PM
>To: John Howie; Gruman, Francis (Frank) Contractor DC3;
>cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>Subject: RE: Data Collision handling - what are others doing?
>
>I agree with John.
>
>I'm working through similar issues developing a tool for storing and sharing
>CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
>Domain, etc.) should be handled by direct comparison of the content, not
>trying to match IDs. For files, uniqueness is based on hashes (which can be
>a bit more difficult since different organizations may report different
>hashes). It gets even harder for more complex types, but those cover 95% of
>what I'm interested in.
>
>I plan to use my own internal unique identifier to track that object in my
>system, and track each "occurrence" of the object separately (including each
>source organization that has reported that object, along with the ID they
>refer to it by). These "occurrences" are tracked only within the system, and
>not (yet) shared with others.
>
>My gut is telling me that coordinating and communicating "occurrences" is
>better left to a standard like STIX, but I'm admittedly much less familiar
>with STIX than CybOX. I would only use the object GUIDs to determine
>whether
>I'd seen the same object from the same organization before, and thus not add
>a new "occurrence" if nothing else has changed.
>
>Just my $.02. I have a feeling that as more organizations start sharing
>CybOX and STIX content, these issues will become very apparent.
>
>Thanks for your feedback.
>
>Greg
>
>
>>-----Original Message-----
>>From: [hidden email] [mailto:owner-cybox-
>>[hidden email]] On Behalf Of John Howie
>>Sent: Monday, March 04, 2013 2:55 PM
>>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>>Observable Expression/CybOX Discussi
>>Subject: Re: Data Collision handling - what are others doing?
>>
>>Hi Frank,
>>
>>One situation you will likely run into is where the data received,
>>although the same as other data received, pertains to two or more
>>completely different threats/events/incidents. Whether or not you should
>>correlate the reports should probably be probably determined by other
>>data, especially when considerable time has passed between the reports.
>>That does not prevent you from recording also-seen-as references, but no
>>inference should be automatically made about the association.
>>
>>Regards,
>>
>>John
>>
>>
>>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
>><[hidden email]> wrote:
>>
>>>We are trying to come up with a way to ensure that we are not duplicating
>>>to much data in our system and are finding it a bit difficult to work
>>>around data collisions.  In our case, a collision is where the same data
>>>values are passed in from different sources with different GUIDs.
>>>Ultimately, we are trying to avoid duplicate data in our system as well
>>>as duplicating data in other systems.
>>>
>>>Here is one scenario where we are running into this problem:
>>>
>>>We have an agreement with ORG1 to receive some information sets.  They
>>>send us their information and we get something like IP:192.168.168.168
>>>with their unique GUID.
>>>We also have an agreement with ORG2 to receive some of their
>information.
>>> They send us their information and we also get IP:192.168.168.168 with
>>>_their_ unique GUID.
>>>
>>>These two organizations do not yet communicate with each other (but
>may,
>>>in the future).
>>>
>>>One of the ways we have considered handling this situation in our system
>>>is to always generate our own GUID and keep track of the external GUID
>>>values from external sources.  Then we can send the external source's
>>>GUID back to them with our own reporting that ties into their data.  As a
>>>total system, however, we still run the risk of duplicate data.  If ORG1
>>>and ORG2 eventually decide to share data then they now have their own
>>>data collisions.
>>>
>>>POSSIBLE SOLUTION:
>>>Would it be possible to extend the CybOX schema to support "also-seen-
>as"
>>>references with GUID and date attributes?  As an example, I have
>extended
>>>a File object with a  MultipleReferenceType that I have created in my
>>>mind.
>>>
>>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>>type="File">
>>>                <cybox:Description>
>>>                    <common:Text>The word document contains flash, which
>>>downloads a corrupted mp4
>>>                        file. The mp4 file itself is not anything special
>>>but an 0C filled (22kb)
>>>                        mp4 file with a valid mp4 header.</common:Text>
>>>                </cybox:Description>
>>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>>Nuclear
>>>                        Situation.doc</FileObj:File_Name>
>>>                    <FileObj:Size_In_Bytes
>>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>>                    <FileObj:Hashes>
>>>                        <common:Hash>
>>>                            <common:Type
>>>datatype="String">MD5</common:Type>
>>>                            <common:Simple_Hash_Value condition="Equals"
>>>datatype="hexBinary"
>>>
>>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>>                        </common:Hash>
>>>                    </FileObj:Hashes>
>>>                </cybox:Defined_Object>
>>>     <common:MultipleReferenceType
>>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>>firstObserved="03-MAR-2012 11:21:12345" />
>>>     <common:MultipleReferenceType
>>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>>firstObserved="21-OCT-2011 15:32:54321" />
>>>            </cybox:Object>
>>>
>>>This would help to 1) support independence of individual data owners
>>>while 2) working to minimize duplication of data elements.
>>>
>>>Any and all feedback here would be appreciated.  If extending the schema
>>>is not the answer then how are others working on deduplication (if at
>>>all).
>>>
>>>Regards,
>>>Frank Gruman
>>>Systems Engineer, DC3 DCCI, contractor
>>>410-981-1142 (work)
>>>NIPR: [hidden email]
>>>SIPR: [hidden email]
>>>
>>>
>>>
>>>*********************************************************
>*
>>************
>>>This email and any files transmitted with it are confidential and
>>>intended solely for the use of the individual or entity to whom they
>>>are addressed. If you have received this email in error please notify
>>>the system manager.
>>>
>>>Scanned by the Clearswift SECURE Email Gateway.
>>>
>>>*********************************************************
>*
>>************
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Back, Greg
In reply to this post by Osorno, Marcos
Marcos-

I think what you're hitting on is distinction between an "instance" of an observable (in your example, a particular file at a particular place on a particular drive) and an observable "pattern" (in your example, the hash itself, regardless of where it appears). This distinction is not clearly made by the CybOX standard, but for individual use cases (particularly when including Observables in STIX Indicators, which is where I typically use CybOX) keeping this distinction is crucial.

I also agree that randomly generated GUIDs are useful only for (global) uniqueness, and do not provide any real meaning. For this reason, I've been using them only to represent relationships within the same document (to link  observables), and to represent an "instance" of a STIX document to recipients do not process the same document more than once. GUIDs can be useful as primary keys (or unique constraints) within databases, but as you mentioned, it's important to determine what you're calling "unique" (an instance or a pattern).

Thanks,
Greg

>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of Osorno, Marcos
>Sent: Wednesday, March 13, 2013 9:09 PM
>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>Observable Expression/CybOX Discussi
>Subject: Re: Data Collision handling - what are others doing?
>
>Hi Frank,
>
>I've been using a combination of meaningful attributes for UIDs. Please
>forgive the cross-post, this is what I expressed on the IDXWG mailing list
>(slightly edited):
>
>"I should clarify more of what I find to be the problem. I think maybe we
>are conflating (myself included) the concepts of fingerprint, globally
>unique ID, and name which seem like a tradeoff between:
>
>* Uniqueness - how unique is it within the system, globally, or given a
>set of pre-conditions about the entity being identified
>* Meaningfulness - how meaningful is it to a machine or human attempting
>to make sense of the ID
>* Calculability - how hard is it to calculate
>* Repeatability - how consistent is the the ID
>
>I'm curious to find IDs more descriptive than "firefox.exe" but less
>opaque than GUIDs. More akin to names, these IDs
>would makes less guarantees of being globally unique for the benefit of
>being more meaningful. It has been interesting to see how URLs have moved
>from: site.whatever/this.aspx?barf=32234234242342&this=1&ord=asc to
>things
>like site.whatever/catalog/pants/fancy/1234 because the 2nd URL is more
>meaningful. Granted, that takes the domain name system to work which
>requires registration and routing both of which are susceptible to attacks.
>
>At any rate, I'd be curious to hear about best practices in generating
>meaningful names in addition to what we know about GUIDs."
>
>File hashes are a good example. I think a file hash is valid UID for the
>abstract hash of that sequence of bytes representing theoretical code on a
>computer. However, I think it's a bad UID for an instance of that file on
>a particular system. I think from an ontological perspective the abstract
>hash and the concrete file are two separate, but related entities. So, my
>watch list has an observable entity with the hash as both the UID and an
>attribute, but a file has a UID that consists of the hash, the major
>device, the minor device, and the inode (for Linux systems) to read
>HASH-MAJOR-MINOR-INODE. This entity also has the hash, but as an
>attribute, not as a UID. What I'm working on is using the RelatedObjects
>types to associate the two. I'm also then working on linking the file
>instance to a machine whose UID is a combination of the uname information,
>MAC addresses, and a guess at the OS install date. I'm staying away from
>GUIDs since I find them not meaningful at all and in my mind they are like
>arbitrary internal database index keys which should only be very minimally
>exposed. My approach has it's own issues in terms of theoretical
>collisions, but I prefer to have meaningful labels even if it does mean
>there is some chance of collision. I reduce the chance of collision by
>associating that object with a few other observables so that it's not just
>represented by a hash alone.
>
>Sincerely,
>
>Marcos
>
>
>On 3/14/13 12:58 AM, "Gruman, Francis (Frank) Contractor DC3"
><[hidden email]> wrote:
>
>>Perhaps I should have also sent this to the STIX discussion, too.  If
>>anyone
>>is already on both lists, would you mind throwing it over that fence,
>>too?  I
>>will sign up soon...
>>
>>I am not trying to overcome unique identification within each
>>organization's
>>system.  We, too, are assessing uniqueness of an object based on the
>>object's data elements.  So we are planning on using it as a hint to say
>>that
>>we think the
>>object you sent us is "X".  If it is not "X" then we write it out as "Y".
>>But it seems to me that overlooking GUID values is essentially ignoring
>>the
>>purpose of the "Globally Unique" part of the value in the first place.
>>
>>To John's example of a File, it is possible to have multiple files with
>>the
>>same hash.  So the hash by itself is not unique enough.  Multiple File
>>objects can point to the same hash value.  So you send us your file with
>>your hash and, even if we have seen the hash before, we will create a new
>>File object.  I think that is something all of our systems should take
>>into
>>account.
>>
>>I guess my point in suggesting the "MultipleReferenceType" or
>>"Also-Seen-As"
>>is to enable sharing of the additional references we get from others (if
>>allowed to do so).  In our case, internally we put security caveats on
>>the
>>additional reference that matches the originally sourced data.  If the
>>sharing
>>scheme is adopted, we would only send out to those who have the authority
>>to
>>view the additional references.  This would enable Greg and I to share
>>elements with each other that might have been seen as (possibly)
>>something
>>else.
>>
>>Thanks for the feedback, guys.
>>
>>Regards,
>>Frank
>>
>>-----Original Message-----
>>From: Back, Greg [mailto:[hidden email]]
>>Sent: Thursday, March 07, 2013 1:00 PM
>>To: John Howie; Gruman, Francis (Frank) Contractor DC3;
>>cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>>Subject: RE: Data Collision handling - what are others doing?
>>
>>I agree with John.
>>
>>I'm working through similar issues developing a tool for storing and
>>sharing
>>CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
>>Domain, etc.) should be handled by direct comparison of the content, not
>>trying to match IDs. For files, uniqueness is based on hashes (which can
>>be
>>a bit more difficult since different organizations may report different
>>hashes). It gets even harder for more complex types, but those cover 95%
>>of
>>what I'm interested in.
>>
>>I plan to use my own internal unique identifier to track that object in my
>>system, and track each "occurrence" of the object separately (including
>>each
>>source organization that has reported that object, along with the ID they
>>refer to it by). These "occurrences" are tracked only within the system,
>>and
>>not (yet) shared with others.
>>
>>My gut is telling me that coordinating and communicating "occurrences" is
>>better left to a standard like STIX, but I'm admittedly much less familiar
>>with STIX than CybOX. I would only use the object GUIDs to determine
>>whether
>>I'd seen the same object from the same organization before, and thus not
>>add
>>a new "occurrence" if nothing else has changed.
>>
>>Just my $.02. I have a feeling that as more organizations start sharing
>>CybOX and STIX content, these issues will become very apparent.
>>
>>Thanks for your feedback.
>>
>>Greg
>>
>>
>>>-----Original Message-----
>>>From: [hidden email] [mailto:owner-cybox-
>>>[hidden email]] On Behalf Of John Howie
>>>Sent: Monday, March 04, 2013 2:55 PM
>>>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>>>Observable Expression/CybOX Discussi
>>>Subject: Re: Data Collision handling - what are others doing?
>>>
>>>Hi Frank,
>>>
>>>One situation you will likely run into is where the data received,
>>>although the same as other data received, pertains to two or more
>>>completely different threats/events/incidents. Whether or not you should
>>>correlate the reports should probably be probably determined by other
>>>data, especially when considerable time has passed between the reports.
>>>That does not prevent you from recording also-seen-as references, but no
>>>inference should be automatically made about the association.
>>>
>>>Regards,
>>>
>>>John
>>>
>>>
>>>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
>>><[hidden email]> wrote:
>>>
>>>>We are trying to come up with a way to ensure that we are not
>>>>duplicating
>>>>to much data in our system and are finding it a bit difficult to work
>>>>around data collisions.  In our case, a collision is where the same data
>>>>values are passed in from different sources with different GUIDs.
>>>>Ultimately, we are trying to avoid duplicate data in our system as well
>>>>as duplicating data in other systems.
>>>>
>>>>Here is one scenario where we are running into this problem:
>>>>
>>>>We have an agreement with ORG1 to receive some information sets.
>They
>>>>send us their information and we get something like IP:192.168.168.168
>>>>with their unique GUID.
>>>>We also have an agreement with ORG2 to receive some of their
>>>>information.
>>>> They send us their information and we also get IP:192.168.168.168 with
>>>>_their_ unique GUID.
>>>>
>>>>These two organizations do not yet communicate with each other (but
>may,
>>>>in the future).
>>>>
>>>>One of the ways we have considered handling this situation in our system
>>>>is to always generate our own GUID and keep track of the external GUID
>>>>values from external sources.  Then we can send the external source's
>>>>GUID back to them with our own reporting that ties into their data.  As
>>>>a
>>>>total system, however, we still run the risk of duplicate data.  If ORG1
>>>>and ORG2 eventually decide to share data then they now have their own
>>>>data collisions.
>>>>
>>>>POSSIBLE SOLUTION:
>>>>Would it be possible to extend the CybOX schema to support
>>>>"also-seen-as"
>>>>references with GUID and date attributes?  As an example, I have
>>>>extended
>>>>a File object with a  MultipleReferenceType that I have created in my
>>>>mind.
>>>>
>>>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>>>type="File">
>>>>                <cybox:Description>
>>>>                    <common:Text>The word document contains flash, which
>>>>downloads a corrupted mp4
>>>>                        file. The mp4 file itself is not anything
>>>>special
>>>>but an 0C filled (22kb)
>>>>                        mp4 file with a valid mp4 header.</common:Text>
>>>>                </cybox:Description>
>>>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>>>Nuclear
>>>>                        Situation.doc</FileObj:File_Name>
>>>>                    <FileObj:Size_In_Bytes
>>>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>>>                    <FileObj:Hashes>
>>>>                        <common:Hash>
>>>>                            <common:Type
>>>>datatype="String">MD5</common:Type>
>>>>                            <common:Simple_Hash_Value condition="Equals"
>>>>datatype="hexBinary"
>>>>
>>>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>>>                        </common:Hash>
>>>>                    </FileObj:Hashes>
>>>>                </cybox:Defined_Object>
>>>>    <common:MultipleReferenceType
>>>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>>>firstObserved="03-MAR-2012 11:21:12345" />
>>>>    <common:MultipleReferenceType
>>>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>>>firstObserved="21-OCT-2011 15:32:54321" />
>>>>            </cybox:Object>
>>>>
>>>>This would help to 1) support independence of individual data owners
>>>>while 2) working to minimize duplication of data elements.
>>>>
>>>>Any and all feedback here would be appreciated.  If extending the schema
>>>>is not the answer then how are others working on deduplication (if at
>>>>all).
>>>>
>>>>Regards,
>>>>Frank Gruman
>>>>Systems Engineer, DC3 DCCI, contractor
>>>>410-981-1142 (work)
>>>>NIPR: [hidden email]
>>>>SIPR: [hidden email]
>>>>
>>>>
>>>>
>>>>********************************************************
>**
>>>************
>>>>This email and any files transmitted with it are confidential and
>>>>intended solely for the use of the individual or entity to whom they
>>>>are addressed. If you have received this email in error please notify
>>>>the system manager.
>>>>
>>>>Scanned by the Clearswift SECURE Email Gateway.
>>>>
>>>>********************************************************
>**
>>>************
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by Gruman, Francis (Frank)
Sorry for the delayed response. I am still trying to catch up on a huge pile
of email traffic.

Comments inline below

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Gruman,
Francis (Frank) Contractor DC3
Sent: Monday, March 04, 2013 1:23 PM
To: cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: Data Collision handling - what are others doing?

We are trying to come up with a way to ensure that we are not duplicating to
much data in our system and are finding it a bit difficult to work around
data collisions.  In our case, a collision is where the same data values are
passed in from different sources with different GUIDs.  Ultimately, we are
trying to avoid duplicate data in our system as well as duplicating data in
other systems.

[Barnum, Sean D.] Just for clarity, are you trying to avoid duplicate data
for space reasons or are you really just saying that you want to clearly
understand where various data may exist from multiple sources and be the
same. If it is for space reasons, I would be curious about the driver behind
the concern as most orgs we have been talking with are more concerned about
capturing everything (given low cost of storage today) and carefully
understanding correlations than compressing things down and potentially
losing some context.

Here is one scenario where we are running into this problem:

We have an agreement with ORG1 to receive some information sets.  They send
us their information and we get something like IP:192.168.168.168 with their
unique GUID.
We also have an agreement with ORG2 to receive some of their information.
They send us their information and we also get IP:192.168.168.168 with
_their_ unique GUID.

These two organizations do not yet communicate with each other (but may, in
the future).

One of the ways we have considered handling this situation in our system is
to always generate our own GUID and keep track of the external GUID values
from external sources.  Then we can send the external source's GUID back to
them with our own reporting that ties into their data.

[Barnum, Sean D.] This is definitely the consensus approach that we have
been hearing from the community.

 As a total system, however, we still run the risk of duplicate data.  If
ORG1 and ORG2 eventually decide to share data then they now have their own
data collisions.

 [Barnum, Sean D.] With this approach you will likely be storing some data
multiple times but you will have a clear understanding of the data, who has
reported it and how to dereference it back to the source. Again, is space
the concern here?

POSSIBLE SOLUTION:
Would it be possible to extend the CybOX schema to support "also-seen-as"
references with GUID and date attributes?  As an example, I have extended a
File object with a  MultipleReferenceType that I have created in my mind.

<cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
type="File">
                <cybox:Description>
                    <common:Text>The word document contains flash, which
downloads a corrupted mp4
                        file. The mp4 file itself is not anything special
but an 0C filled (22kb)
                        mp4 file with a valid mp4 header.</common:Text>
                </cybox:Description>
                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
                    <FileObj:File_Name datatype="String">Iran's Oil and
Nuclear
                        Situation.doc</FileObj:File_Name>
                    <FileObj:Size_In_Bytes
datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
                    <FileObj:Hashes>
                        <common:Hash>
                            <common:Type datatype="String">MD5</common:Type>
                            <common:Simple_Hash_Value condition="Equals"
datatype="hexBinary"
 
>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
                        </common:Hash>
                    </FileObj:Hashes>
                </cybox:Defined_Object>
        <common:MultipleReferenceType
id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
firstObserved="03-MAR-2012 11:21:12345" />
        <common:MultipleReferenceType
id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
firstObserved="21-OCT-2011 15:32:54321" />
            </cybox:Object>

[Barnum, Sean D.] Well, the approach currently taken in CybOX to support
this use case (at least what I think you are saying) of correlating multiple
objects is to use the Related_Objects structure to reference the other
objects using the idref attribute (yielding something very similar to your
example above). The advantage of this approach is that it uses the same
structure as all other object correlations and thus avoids adding new
complexity. I see two potential downsides currently. The first is easily
overcome and that is that it does not look like we currently have an entry
in ObjectRelationshipEnum for something like 'asserted_same_as' but we could
easily add it. The other potential downside is that for references to
Objects not defined in the system you would not have the associated date
captured. For Objects defined in the system you would have this info within
the Observable_Source structure. The currently supported capability appears
very similar to what you are describing here with the above caveats. Do you
believe that the current capability is insufficient for what you are looking
to do?
[Barnum, Sean D.]The other thing to keep in mind is that the intent of STIX
is that this sort of information sharing of observables would appropriately
be done within STIX-Indicators rather than just as raw observables. The
Indicators enable sharing of the observable pattern of what is relevant
(sharing a bunch of observed object instance data seems far less likely) and
the context around it. This gives you not only the other valuable context
(e.g. valid time window, confidence, indicated TTP (meaning), handling
guidance, etc.) but gives two things I believe are very relevant to the use
case you describe above. First, there is a RelatedIndicators construct that
allows the assertion of correlation between indicators (similar to the sort
of thing described above at the Object level) with a descriptor attribute
for the nature of the relationship (e.g. 'asserted_same_as' or
'also_seen_as') and the ability to specify Confidence in the asserted
correlation relationship (this can be very powerful for this sort of
correlation where things are not always black and white). Second, the
Indicator construct provides a Sightings structure such that if you identify
an Indicator within your context but receive via sharing what you believe to
be an identical Indicator before you have shared yours you can simply report
a sighting of the Indicator that was shared with you rather than sending out
an unnecessary complete duplicate. If you have something extra to add to the
shared Indicator you could share your own superset Indicator but specify
that it is related to the original and specify some sort of extends
relationship.

This would help to 1) support independence of individual data owners while
2) working to minimize duplication of data elements.

Any and all feedback here would be appreciated.  If extending the schema is
not the answer then how are others working on deduplication (if at all).

[Barnum, Sean D.] Please do not take any of my comments above to indicate
(pun intended ;-) ) a dismissal of your concerns. On the contrary, we are
VERY interested to hear your perspective on these sorts of issues. Where
CybOX and STIX are today is a result of these sorts of dialogues with the
community and we are interested in any gaps they may exist in current
approaches.
[Barnum, Sean D.] Thank you.

Regards,
Frank Gruman
Systems Engineer, DC3 DCCI, contractor
410-981-1142 (work)
NIPR: [hidden email]
SIPR: [hidden email]



**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

Scanned by the Clearswift SECURE Email Gateway.

**********************************************************************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by John Howie
If I am understanding John here I believe I would agree.
Capturing it all and defining correlations based on context would likely
yield the most accurate and flexible understanding.
Automatic and context-less correlations can lead to misinterpretations.
Similarly, any deduping needs to be done very carefully to avoid losing
context.

I am not suggesting that Frank is advocating any of these things but in the
context of the conversation it bears emphasizing these points.

sean

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of John Howie
Sent: Monday, March 04, 2013 2:55 PM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: Re: Data Collision handling - what are others doing?

Hi Frank,

One situation you will likely run into is where the data received,
although the same as other data received, pertains to two or more
completely different threats/events/incidents. Whether or not you should
correlate the reports should probably be probably determined by other
data, especially when considerable time has passed between the reports.
That does not prevent you from recording also-seen-as references, but no
inference should be automatically made about the association.

Regards,

John


On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

>We are trying to come up with a way to ensure that we are not duplicating
>to much data in our system and are finding it a bit difficult to work
>around data collisions.  In our case, a collision is where the same data
>values are passed in from different sources with different GUIDs.
>Ultimately, we are trying to avoid duplicate data in our system as well
>as duplicating data in other systems.
>
>Here is one scenario where we are running into this problem:
>
>We have an agreement with ORG1 to receive some information sets.  They
>send us their information and we get something like IP:192.168.168.168
>with their unique GUID.
>We also have an agreement with ORG2 to receive some of their information.
> They send us their information and we also get IP:192.168.168.168 with
>_their_ unique GUID.
>
>These two organizations do not yet communicate with each other (but may,
>in the future).
>
>One of the ways we have considered handling this situation in our system
>is to always generate our own GUID and keep track of the external GUID
>values from external sources.  Then we can send the external source's
>GUID back to them with our own reporting that ties into their data.  As a
>total system, however, we still run the risk of duplicate data.  If ORG1
>and ORG2 eventually decide to share data then they now have their own
>data collisions.
>
>POSSIBLE SOLUTION:
>Would it be possible to extend the CybOX schema to support "also-seen-as"
>references with GUID and date attributes?  As an example, I have extended
>a File object with a  MultipleReferenceType that I have created in my
>mind.
>
><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>type="File">
>                <cybox:Description>
>                    <common:Text>The word document contains flash, which
>downloads a corrupted mp4
>                        file. The mp4 file itself is not anything special
>but an 0C filled (22kb)
>                        mp4 file with a valid mp4 header.</common:Text>
>                </cybox:Description>
>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>                    <FileObj:File_Name datatype="String">Iran's Oil and
>Nuclear
>                        Situation.doc</FileObj:File_Name>
>                    <FileObj:Size_In_Bytes
>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>                    <FileObj:Hashes>
>                        <common:Hash>
>                            <common:Type
>datatype="String">MD5</common:Type>
>                            <common:Simple_Hash_Value condition="Equals"
>datatype="hexBinary"
>                  
>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>                        </common:Hash>
>                    </FileObj:Hashes>
>                </cybox:Defined_Object>
> <common:MultipleReferenceType
>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>firstObserved="03-MAR-2012 11:21:12345" />
> <common:MultipleReferenceType
>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>firstObserved="21-OCT-2011 15:32:54321" />
>            </cybox:Object>
>
>This would help to 1) support independence of individual data owners
>while 2) working to minimize duplication of data elements.
>
>Any and all feedback here would be appreciated.  If extending the schema
>is not the answer then how are others working on deduplication (if at
>all).
>
>Regards,
>Frank Gruman
>Systems Engineer, DC3 DCCI, contractor
>410-981-1142 (work)
>NIPR: [hidden email]
>SIPR: [hidden email]
>
>
>
>**********************************************************************
>This email and any files transmitted with it are confidential and
>intended solely for the use of the individual or entity to whom they
>are addressed. If you have received this email in error please notify
>the system manager.
>
>Scanned by the Clearswift SECURE Email Gateway.
>
>**********************************************************************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by Back, Greg
Comments inline below.

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Back, Greg
Sent: Thursday, March 07, 2013 1:00 PM
To: John Howie; Gruman, Francis (Frank) Contractor DC3;
cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

I agree with John.

I'm working through similar issues developing a tool for storing and sharing
CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
Domain, etc.) should be handled by direct comparison of the content, not
trying to match IDs.

[Barnum, Sean D.] Agree

For files, uniqueness is based on hashes (which can be
a bit more difficult since different organizations may report different
hashes). It gets even harder for more complex types, but those cover 95% of
what I'm interested in.

I plan to use my own internal unique identifier to track that object in my
system, and track each "occurrence" of the object separately (including each
source organization that has reported that object, along with the ID they
refer to it by). These "occurrences" are tracked only within the system, and
not (yet) shared with others.

 [Barnum, Sean D.] Again, I think this is the most common approach we are
hearing from the community.

My gut is telling me that coordinating and communicating "occurrences" is
better left to a standard like STIX, but I'm admittedly much less familiar
with STIX than CybOX. I would only use the object GUIDs to determine whether
I'd seen the same object from the same organization before, and thus not add
a new "occurrence" if nothing else has changed.

[Barnum, Sean D.] Agreed.

Just my $.02. I have a feeling that as more organizations start sharing
CybOX and STIX content, these issues will become very apparent.

Thanks for your feedback.

Greg


>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of John Howie
>Sent: Monday, March 04, 2013 2:55 PM
>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>Observable Expression/CybOX Discussi
>Subject: Re: Data Collision handling - what are others doing?
>
>Hi Frank,
>
>One situation you will likely run into is where the data received,
>although the same as other data received, pertains to two or more
>completely different threats/events/incidents. Whether or not you should
>correlate the reports should probably be probably determined by other
>data, especially when considerable time has passed between the reports.
>That does not prevent you from recording also-seen-as references, but no
>inference should be automatically made about the association.
>
>Regards,
>
>John
>
>
>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
><[hidden email]> wrote:
>
>>We are trying to come up with a way to ensure that we are not duplicating
>>to much data in our system and are finding it a bit difficult to work
>>around data collisions.  In our case, a collision is where the same data
>>values are passed in from different sources with different GUIDs.
>>Ultimately, we are trying to avoid duplicate data in our system as well
>>as duplicating data in other systems.
>>
>>Here is one scenario where we are running into this problem:
>>
>>We have an agreement with ORG1 to receive some information sets.  They
>>send us their information and we get something like IP:192.168.168.168
>>with their unique GUID.
>>We also have an agreement with ORG2 to receive some of their information.
>> They send us their information and we also get IP:192.168.168.168 with
>>_their_ unique GUID.
>>
>>These two organizations do not yet communicate with each other (but may,
>>in the future).
>>
>>One of the ways we have considered handling this situation in our system
>>is to always generate our own GUID and keep track of the external GUID
>>values from external sources.  Then we can send the external source's
>>GUID back to them with our own reporting that ties into their data.  As a
>>total system, however, we still run the risk of duplicate data.  If ORG1
>>and ORG2 eventually decide to share data then they now have their own
>>data collisions.
>>
>>POSSIBLE SOLUTION:
>>Would it be possible to extend the CybOX schema to support "also-seen-as"
>>references with GUID and date attributes?  As an example, I have extended
>>a File object with a  MultipleReferenceType that I have created in my
>>mind.
>>
>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>type="File">
>>                <cybox:Description>
>>                    <common:Text>The word document contains flash, which
>>downloads a corrupted mp4
>>                        file. The mp4 file itself is not anything special
>>but an 0C filled (22kb)
>>                        mp4 file with a valid mp4 header.</common:Text>
>>                </cybox:Description>
>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>Nuclear
>>                        Situation.doc</FileObj:File_Name>
>>                    <FileObj:Size_In_Bytes
>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>                    <FileObj:Hashes>
>>                        <common:Hash>
>>                            <common:Type
>>datatype="String">MD5</common:Type>
>>                            <common:Simple_Hash_Value condition="Equals"
>>datatype="hexBinary"
>>
>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>                        </common:Hash>
>>                    </FileObj:Hashes>
>>                </cybox:Defined_Object>
>> <common:MultipleReferenceType
>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>firstObserved="03-MAR-2012 11:21:12345" />
>> <common:MultipleReferenceType
>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>firstObserved="21-OCT-2011 15:32:54321" />
>>            </cybox:Object>
>>
>>This would help to 1) support independence of individual data owners
>>while 2) working to minimize duplication of data elements.
>>
>>Any and all feedback here would be appreciated.  If extending the schema
>>is not the answer then how are others working on deduplication (if at
>>all).
>>
>>Regards,
>>Frank Gruman
>>Systems Engineer, DC3 DCCI, contractor
>>410-981-1142 (work)
>>NIPR: [hidden email]
>>SIPR: [hidden email]
>>
>>
>>
>>**********************************************************
>************
>>This email and any files transmitted with it are confidential and
>>intended solely for the use of the individual or entity to whom they
>>are addressed. If you have received this email in error please notify
>>the system manager.
>>
>>Scanned by the Clearswift SECURE Email Gateway.
>>
>>**********************************************************
>************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by Gruman, Francis (Frank)


-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Gruman,
Francis (Frank) Contractor DC3
Sent: Wednesday, March 13, 2013 9:58 AM
To: cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

Perhaps I should have also sent this to the STIX discussion, too.  If anyone

is already on both lists, would you mind throwing it over that fence, too?
I
will sign up soon...

I am not trying to overcome unique identification within each organization's
system.  We, too, are assessing uniqueness of an object based on the
object's data elements.  So we are planning on using it as a hint to say
that
we think the
object you sent us is "X".  If it is not "X" then we write it out as "Y".
But it seems to me that overlooking GUID values is essentially ignoring the
purpose of the "Globally Unique" part of the value in the first place.

To John's example of a File, it is possible to have multiple files with the
same hash.  So the hash by itself is not unique enough.  Multiple File
objects can point to the same hash value.  So you send us your file with
your hash and, even if we have seen the hash before, we will create a new
File object.  I think that is something all of our systems should take into
account.

 [Barnum, Sean D.] Sounds reasonable. Though I would add a suggestion that
these objects are given appropriate relationship assertions to maintain
context.

I guess my point in suggesting the "MultipleReferenceType" or "Also-Seen-As"

is to enable sharing of the additional references we get from others (if
allowed to do so).  In our case, internally we put security caveats on the
additional reference that matches the originally sourced data.  If the
sharing
scheme is adopted, we would only send out to those who have the authority to

view the additional references.  This would enable Greg and I to share
elements with each other that might have been seen as (possibly) something
else.

[Barnum, Sean D.] One nice thing about capturing them all as individual
entities and relating them is that this sort of data marking based on source
can be inherent in the source item itself.

Thanks for the feedback, guys.

Regards,
Frank

-----Original Message-----
From: Back, Greg [mailto:[hidden email]]
Sent: Thursday, March 07, 2013 1:00 PM
To: John Howie; Gruman, Francis (Frank) Contractor DC3;
cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

I agree with John.

I'm working through similar issues developing a tool for storing and sharing
CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
Domain, etc.) should be handled by direct comparison of the content, not
trying to match IDs. For files, uniqueness is based on hashes (which can be
a bit more difficult since different organizations may report different
hashes). It gets even harder for more complex types, but those cover 95% of
what I'm interested in.

I plan to use my own internal unique identifier to track that object in my
system, and track each "occurrence" of the object separately (including each
source organization that has reported that object, along with the ID they
refer to it by). These "occurrences" are tracked only within the system, and
not (yet) shared with others.

My gut is telling me that coordinating and communicating "occurrences" is
better left to a standard like STIX, but I'm admittedly much less familiar
with STIX than CybOX. I would only use the object GUIDs to determine whether
I'd seen the same object from the same organization before, and thus not add
a new "occurrence" if nothing else has changed.

Just my $.02. I have a feeling that as more organizations start sharing
CybOX and STIX content, these issues will become very apparent.

Thanks for your feedback.

Greg


>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of John Howie
>Sent: Monday, March 04, 2013 2:55 PM
>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>Observable Expression/CybOX Discussi
>Subject: Re: Data Collision handling - what are others doing?
>
>Hi Frank,
>
>One situation you will likely run into is where the data received,
>although the same as other data received, pertains to two or more
>completely different threats/events/incidents. Whether or not you should
>correlate the reports should probably be probably determined by other
>data, especially when considerable time has passed between the reports.
>That does not prevent you from recording also-seen-as references, but no
>inference should be automatically made about the association.
>
>Regards,
>
>John
>
>
>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
><[hidden email]> wrote:
>
>>We are trying to come up with a way to ensure that we are not duplicating
>>to much data in our system and are finding it a bit difficult to work
>>around data collisions.  In our case, a collision is where the same data
>>values are passed in from different sources with different GUIDs.
>>Ultimately, we are trying to avoid duplicate data in our system as well
>>as duplicating data in other systems.
>>
>>Here is one scenario where we are running into this problem:
>>
>>We have an agreement with ORG1 to receive some information sets.  They
>>send us their information and we get something like IP:192.168.168.168
>>with their unique GUID.
>>We also have an agreement with ORG2 to receive some of their information.
>> They send us their information and we also get IP:192.168.168.168 with
>>_their_ unique GUID.
>>
>>These two organizations do not yet communicate with each other (but may,
>>in the future).
>>
>>One of the ways we have considered handling this situation in our system
>>is to always generate our own GUID and keep track of the external GUID
>>values from external sources.  Then we can send the external source's
>>GUID back to them with our own reporting that ties into their data.  As a
>>total system, however, we still run the risk of duplicate data.  If ORG1
>>and ORG2 eventually decide to share data then they now have their own
>>data collisions.
>>
>>POSSIBLE SOLUTION:
>>Would it be possible to extend the CybOX schema to support "also-seen-as"
>>references with GUID and date attributes?  As an example, I have extended
>>a File object with a  MultipleReferenceType that I have created in my
>>mind.
>>
>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>type="File">
>>                <cybox:Description>
>>                    <common:Text>The word document contains flash, which
>>downloads a corrupted mp4
>>                        file. The mp4 file itself is not anything special
>>but an 0C filled (22kb)
>>                        mp4 file with a valid mp4 header.</common:Text>
>>                </cybox:Description>
>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>Nuclear
>>                        Situation.doc</FileObj:File_Name>
>>                    <FileObj:Size_In_Bytes
>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>                    <FileObj:Hashes>
>>                        <common:Hash>
>>                            <common:Type
>>datatype="String">MD5</common:Type>
>>                            <common:Simple_Hash_Value condition="Equals"
>>datatype="hexBinary"
>>
>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>                        </common:Hash>
>>                    </FileObj:Hashes>
>>                </cybox:Defined_Object>
>> <common:MultipleReferenceType
>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>firstObserved="03-MAR-2012 11:21:12345" />
>> <common:MultipleReferenceType
>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>firstObserved="21-OCT-2011 15:32:54321" />
>>            </cybox:Object>
>>
>>This would help to 1) support independence of individual data owners
>>while 2) working to minimize duplication of data elements.
>>
>>Any and all feedback here would be appreciated.  If extending the schema
>>is not the answer then how are others working on deduplication (if at
>>all).
>>
>>Regards,
>>Frank Gruman
>>Systems Engineer, DC3 DCCI, contractor
>>410-981-1142 (work)
>>NIPR: [hidden email]
>>SIPR: [hidden email]
>>
>>
>>
>>**********************************************************
>************
>>This email and any files transmitted with it are confidential and
>>intended solely for the use of the individual or entity to whom they
>>are addressed. If you have received this email in error please notify
>>the system manager.
>>
>>Scanned by the Clearswift SECURE Email Gateway.
>>
>>**********************************************************
>************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by Osorno, Marcos
Comments inline below

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Osorno,
Marcos
Sent: Wednesday, March 13, 2013 9:09 PM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: Re: Data Collision handling - what are others doing?

Hi Frank,

I've been using a combination of meaningful attributes for UIDs. Please
forgive the cross-post, this is what I expressed on the IDXWG mailing list
(slightly edited):

"I should clarify more of what I find to be the problem. I think maybe we
are conflating (myself included) the concepts of fingerprint, globally
unique ID, and name which seem like a tradeoff between:

* Uniqueness - how unique is it within the system, globally, or given a
set of pre-conditions about the entity being identified
* Meaningfulness - how meaningful is it to a machine or human attempting
to make sense of the ID
* Calculability - how hard is it to calculate
* Repeatability - how consistent is the the ID

I'm curious to find IDs more descriptive than "firefox.exe" but less
opaque than GUIDs. More akin to names, these IDs
would makes less guarantees of being globally unique for the benefit of
being more meaningful. It has been interesting to see how URLs have moved
from: site.whatever/this.aspx?barf=32234234242342&this=1&ord=asc to things
like site.whatever/catalog/pants/fancy/1234 because the 2nd URL is more
meaningful. Granted, that takes the domain name system to work which
requires registration and routing both of which are susceptible to attacks.

At any rate, I'd be curious to hear about best practices in generating
meaningful names in addition to what we know about GUIDs."

File hashes are a good example. I think a file hash is valid UID for the
abstract hash of that sequence of bytes representing theoretical code on a
computer. However, I think it's a bad UID for an instance of that file on
a particular system. I think from an ontological perspective the abstract
hash and the concrete file are two separate, but related entities. So, my
watch list has an observable entity with the hash as both the UID and an
attribute, but a file has a UID that consists of the hash, the major
device, the minor device, and the inode (for Linux systems) to read
HASH-MAJOR-MINOR-INODE. This entity also has the hash, but as an
attribute, not as a UID. What I'm working on is using the RelatedObjects
types to associate the two. I'm also then working on linking the file
instance to a machine whose UID is a combination of the uname information,
MAC addresses, and a guess at the OS install date. I'm staying away from
GUIDs since I find them not meaningful at all and in my mind they are like
arbitrary internal database index keys which should only be very minimally
exposed. My approach has it's own issues in terms of theoretical
collisions, but I prefer to have meaningful labels even if it does mean
there is some chance of collision. I reduce the chance of collision by
associating that object with a few other observables so that it's not just
represented by a hash alone.

[Barnum, Sean D.] This sort of balance and flexibility is what we were
looking for when we decided to utilize Qualified Names as out ID type.
[Barnum, Sean D.]The idea is that the QName prefix should be a unique
namespace identifying the producer of that piece of data and the postfix
would be some sort of identifier that is globally unique within the
producer's context. This approach guarantees global uniqueness but also
implicitly supports correlation to the producing org as well as leaves
flexibility in the hands of each org to define their own postfix ID format.
Given this flexibility, you could define postfix formats like Marcos
describes above based on your context. The base requirement for the ID is
simply to uniquely identify that resource not to give further illuminating
context but that does not mean that it cannot do so at the same time.
[Barnum, Sean D.]So, Marcos could use an ID something like
'<a href="http://jhuapl.edu:HASH-MAJOR-MINOR-INODE'">http://jhuapl.edu:HASH-MAJOR-MINOR-INODE' to identify a file.
[Barnum, Sean D.]You could also blend some descriptive textual components
with an actual GUID if desired. This is the sort of approach you may have
seen in some of the example content we have created (e.g.
MITRE:observable-6f45f0aa-30c8-11e2-8011-000c291a73d5,
MITRE:object-6dcae276-30c8-11e2-8011-000c291a73d5,
MITRE:Indicator-ba1d406e-937c-414f-9231-6e1dbe64fe8b where MITRE is a
namespace abbreviation declared in the file header).

[Barnum, Sean D.]Do you think this approach in the language gives you the
capability and flexibility to define IDs that work for your contexts?

Sincerely,

Marcos


On 3/14/13 12:58 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

>Perhaps I should have also sent this to the STIX discussion, too.  If
>anyone
>is already on both lists, would you mind throwing it over that fence,
>too?  I
>will sign up soon...
>
>I am not trying to overcome unique identification within each
>organization's
>system.  We, too, are assessing uniqueness of an object based on the
>object's data elements.  So we are planning on using it as a hint to say
>that
>we think the
>object you sent us is "X".  If it is not "X" then we write it out as "Y".
>But it seems to me that overlooking GUID values is essentially ignoring
>the
>purpose of the "Globally Unique" part of the value in the first place.
>
>To John's example of a File, it is possible to have multiple files with
>the
>same hash.  So the hash by itself is not unique enough.  Multiple File
>objects can point to the same hash value.  So you send us your file with
>your hash and, even if we have seen the hash before, we will create a new
>File object.  I think that is something all of our systems should take
>into
>account.
>
>I guess my point in suggesting the "MultipleReferenceType" or
>"Also-Seen-As"
>is to enable sharing of the additional references we get from others (if
>allowed to do so).  In our case, internally we put security caveats on
>the
>additional reference that matches the originally sourced data.  If the
>sharing
>scheme is adopted, we would only send out to those who have the authority
>to
>view the additional references.  This would enable Greg and I to share
>elements with each other that might have been seen as (possibly)
>something
>else.
>
>Thanks for the feedback, guys.
>
>Regards,
>Frank
>
>-----Original Message-----
>From: Back, Greg [mailto:[hidden email]]
>Sent: Thursday, March 07, 2013 1:00 PM
>To: John Howie; Gruman, Francis (Frank) Contractor DC3;
>cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>Subject: RE: Data Collision handling - what are others doing?
>
>I agree with John.
>
>I'm working through similar issues developing a tool for storing and
>sharing
>CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
>Domain, etc.) should be handled by direct comparison of the content, not
>trying to match IDs. For files, uniqueness is based on hashes (which can
>be
>a bit more difficult since different organizations may report different
>hashes). It gets even harder for more complex types, but those cover 95%
>of
>what I'm interested in.
>
>I plan to use my own internal unique identifier to track that object in my
>system, and track each "occurrence" of the object separately (including
>each
>source organization that has reported that object, along with the ID they
>refer to it by). These "occurrences" are tracked only within the system,
>and
>not (yet) shared with others.
>
>My gut is telling me that coordinating and communicating "occurrences" is
>better left to a standard like STIX, but I'm admittedly much less familiar
>with STIX than CybOX. I would only use the object GUIDs to determine
>whether
>I'd seen the same object from the same organization before, and thus not
>add
>a new "occurrence" if nothing else has changed.
>
>Just my $.02. I have a feeling that as more organizations start sharing
>CybOX and STIX content, these issues will become very apparent.
>
>Thanks for your feedback.
>
>Greg
>
>
>>-----Original Message-----
>>From: [hidden email] [mailto:owner-cybox-
>>[hidden email]] On Behalf Of John Howie
>>Sent: Monday, March 04, 2013 2:55 PM
>>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>>Observable Expression/CybOX Discussi
>>Subject: Re: Data Collision handling - what are others doing?
>>
>>Hi Frank,
>>
>>One situation you will likely run into is where the data received,
>>although the same as other data received, pertains to two or more
>>completely different threats/events/incidents. Whether or not you should
>>correlate the reports should probably be probably determined by other
>>data, especially when considerable time has passed between the reports.
>>That does not prevent you from recording also-seen-as references, but no
>>inference should be automatically made about the association.
>>
>>Regards,
>>
>>John
>>
>>
>>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
>><[hidden email]> wrote:
>>
>>>We are trying to come up with a way to ensure that we are not
>>>duplicating
>>>to much data in our system and are finding it a bit difficult to work
>>>around data collisions.  In our case, a collision is where the same data
>>>values are passed in from different sources with different GUIDs.
>>>Ultimately, we are trying to avoid duplicate data in our system as well
>>>as duplicating data in other systems.
>>>
>>>Here is one scenario where we are running into this problem:
>>>
>>>We have an agreement with ORG1 to receive some information sets.  They
>>>send us their information and we get something like IP:192.168.168.168
>>>with their unique GUID.
>>>We also have an agreement with ORG2 to receive some of their
>>>information.
>>> They send us their information and we also get IP:192.168.168.168 with
>>>_their_ unique GUID.
>>>
>>>These two organizations do not yet communicate with each other (but may,
>>>in the future).
>>>
>>>One of the ways we have considered handling this situation in our system
>>>is to always generate our own GUID and keep track of the external GUID
>>>values from external sources.  Then we can send the external source's
>>>GUID back to them with our own reporting that ties into their data.  As
>>>a
>>>total system, however, we still run the risk of duplicate data.  If ORG1
>>>and ORG2 eventually decide to share data then they now have their own
>>>data collisions.
>>>
>>>POSSIBLE SOLUTION:
>>>Would it be possible to extend the CybOX schema to support
>>>"also-seen-as"
>>>references with GUID and date attributes?  As an example, I have
>>>extended
>>>a File object with a  MultipleReferenceType that I have created in my
>>>mind.
>>>
>>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>>type="File">
>>>                <cybox:Description>
>>>                    <common:Text>The word document contains flash, which
>>>downloads a corrupted mp4
>>>                        file. The mp4 file itself is not anything
>>>special
>>>but an 0C filled (22kb)
>>>                        mp4 file with a valid mp4 header.</common:Text>
>>>                </cybox:Description>
>>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>>Nuclear
>>>                        Situation.doc</FileObj:File_Name>
>>>                    <FileObj:Size_In_Bytes
>>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>>                    <FileObj:Hashes>
>>>                        <common:Hash>
>>>                            <common:Type
>>>datatype="String">MD5</common:Type>
>>>                            <common:Simple_Hash_Value condition="Equals"
>>>datatype="hexBinary"
>>>
>>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>>                        </common:Hash>
>>>                    </FileObj:Hashes>
>>>                </cybox:Defined_Object>
>>> <common:MultipleReferenceType
>>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>>firstObserved="03-MAR-2012 11:21:12345" />
>>> <common:MultipleReferenceType
>>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>>firstObserved="21-OCT-2011 15:32:54321" />
>>>            </cybox:Object>
>>>
>>>This would help to 1) support independence of individual data owners
>>>while 2) working to minimize duplication of data elements.
>>>
>>>Any and all feedback here would be appreciated.  If extending the schema
>>>is not the answer then how are others working on deduplication (if at
>>>all).
>>>
>>>Regards,
>>>Frank Gruman
>>>Systems Engineer, DC3 DCCI, contractor
>>>410-981-1142 (work)
>>>NIPR: [hidden email]
>>>SIPR: [hidden email]
>>>
>>>
>>>
>>>**********************************************************
>>************
>>>This email and any files transmitted with it are confidential and
>>>intended solely for the use of the individual or entity to whom they
>>>are addressed. If you have received this email in error please notify
>>>the system manager.
>>>
>>>Scanned by the Clearswift SECURE Email Gateway.
>>>
>>>**********************************************************
>>************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by Back, Greg
Comments inline below

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Back, Greg
Sent: Friday, March 15, 2013 9:39 AM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

Hmmm, Frank brings up an interesting point...

Assume I work for Organization A, Frank works for Organization B and John
works for Organization C, and we are discussing a FileObject X which we
universally agree represents the same "thing" (if we don't agree on what
constitutes its uniqueness, it gets a lot more complicated).

If we each discover the file independently (rather than learning about it
ONLY from another), then we might each give the file IDs of A-X, B-X, and
C-X independently (where each is actually a random GUID such that we can't
correlate on X, however we determine it is the same file via hash or
similar). If Frank shares a CybOX/STIX document containing B-X, I can
correlate to my A-X and say "Frank also saw this, and he's calling it B-X".
My interpretation of what Frank is suggesting is that I could send something
to John which says "I've seen this thing I'm calling A-X, and by the way,
Frank calls it B-X", and John could add both A-X and B-X to his
"also-known-as" list for C-X.

 [Barnum, Sean D.] You should be able to do this currently using Indicators
and the RelatedIndicators construct.

Assuming all the sharing agreements are in place to permit sharing the
"also-known-as" lists, it seems like a less complex solution would be for
Frank to share the name B-X with John directly. Are there use cases where
this direct sharing between Frank and John would not be allowed, but sharing
"through" me would be?

[Barnum, Sean D.]I would certainly think so.
 
On a related note, TAXII is a related standard for sharing and communicating
threat information (as STIX documents) and handles the multi-party sharing
pretty well, in my opinion.

Thanks for the discussion!

Greg

>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of Gruman, Francis (Frank)
>Contractor DC3
>Sent: Wednesday, March 13, 2013 9:58 AM
>To: cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>Subject: RE: Data Collision handling - what are others doing?
>
>Perhaps I should have also sent this to the STIX discussion, too.  If
anyone
>is already on both lists, would you mind throwing it over that fence, too?
I
>will sign up soon...
>
>I am not trying to overcome unique identification within each
organization's
>system.  We, too, are assessing uniqueness of an object based on the
>object's data elements.  So we are planning on using it as a hint to say
that

>we think the
>object you sent us is "X".  If it is not "X" then we write it out as "Y".
>But it seems to me that overlooking GUID values is essentially ignoring the
>purpose of the "Globally Unique" part of the value in the first place.
>
>To John's example of a File, it is possible to have multiple files with the
>same hash.  So the hash by itself is not unique enough.  Multiple File
>objects can point to the same hash value.  So you send us your file with
>your hash and, even if we have seen the hash before, we will create a new
>File object.  I think that is something all of our systems should take into
>account.
>
>I guess my point in suggesting the "MultipleReferenceType" or "Also-Seen-
>As"
>is to enable sharing of the additional references we get from others (if
>allowed to do so).  In our case, internally we put security caveats on the
>additional reference that matches the originally sourced data.  If the
sharing

>scheme is adopted, we would only send out to those who have the authority
>to
>view the additional references.  This would enable Greg and I to share
>elements with each other that might have been seen as (possibly) something
>else.
>
>Thanks for the feedback, guys.
>
>Regards,
>Frank
>
>-----Original Message-----
>From: Back, Greg [mailto:[hidden email]]
>Sent: Thursday, March 07, 2013 1:00 PM
>To: John Howie; Gruman, Francis (Frank) Contractor DC3;
>cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>Subject: RE: Data Collision handling - what are others doing?
>
>I agree with John.
>
>I'm working through similar issues developing a tool for storing and
sharing
>CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
>Domain, etc.) should be handled by direct comparison of the content, not
>trying to match IDs. For files, uniqueness is based on hashes (which can be
>a bit more difficult since different organizations may report different
>hashes). It gets even harder for more complex types, but those cover 95% of
>what I'm interested in.
>
>I plan to use my own internal unique identifier to track that object in my
>system, and track each "occurrence" of the object separately (including
each
>source organization that has reported that object, along with the ID they
>refer to it by). These "occurrences" are tracked only within the system,
and
>not (yet) shared with others.
>
>My gut is telling me that coordinating and communicating "occurrences" is
>better left to a standard like STIX, but I'm admittedly much less familiar
>with STIX than CybOX. I would only use the object GUIDs to determine
>whether
>I'd seen the same object from the same organization before, and thus not
add

>a new "occurrence" if nothing else has changed.
>
>Just my $.02. I have a feeling that as more organizations start sharing
>CybOX and STIX content, these issues will become very apparent.
>
>Thanks for your feedback.
>
>Greg
>
>
>>-----Original Message-----
>>From: [hidden email] [mailto:owner-cybox-
>>[hidden email]] On Behalf Of John Howie
>>Sent: Monday, March 04, 2013 2:55 PM
>>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>>Observable Expression/CybOX Discussi
>>Subject: Re: Data Collision handling - what are others doing?
>>
>>Hi Frank,
>>
>>One situation you will likely run into is where the data received,
>>although the same as other data received, pertains to two or more
>>completely different threats/events/incidents. Whether or not you should
>>correlate the reports should probably be probably determined by other
>>data, especially when considerable time has passed between the reports.
>>That does not prevent you from recording also-seen-as references, but no
>>inference should be automatically made about the association.
>>
>>Regards,
>>
>>John
>>
>>
>>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
>><[hidden email]> wrote:
>>
>>>We are trying to come up with a way to ensure that we are not duplicating
>>>to much data in our system and are finding it a bit difficult to work
>>>around data collisions.  In our case, a collision is where the same data
>>>values are passed in from different sources with different GUIDs.
>>>Ultimately, we are trying to avoid duplicate data in our system as well
>>>as duplicating data in other systems.
>>>
>>>Here is one scenario where we are running into this problem:
>>>
>>>We have an agreement with ORG1 to receive some information sets.  They
>>>send us their information and we get something like IP:192.168.168.168
>>>with their unique GUID.
>>>We also have an agreement with ORG2 to receive some of their
>information.
>>> They send us their information and we also get IP:192.168.168.168 with
>>>_their_ unique GUID.
>>>
>>>These two organizations do not yet communicate with each other (but
>may,
>>>in the future).
>>>
>>>One of the ways we have considered handling this situation in our system
>>>is to always generate our own GUID and keep track of the external GUID
>>>values from external sources.  Then we can send the external source's
>>>GUID back to them with our own reporting that ties into their data.  As a
>>>total system, however, we still run the risk of duplicate data.  If ORG1
>>>and ORG2 eventually decide to share data then they now have their own
>>>data collisions.
>>>
>>>POSSIBLE SOLUTION:
>>>Would it be possible to extend the CybOX schema to support "also-seen-
>as"
>>>references with GUID and date attributes?  As an example, I have
>extended
>>>a File object with a  MultipleReferenceType that I have created in my
>>>mind.
>>>
>>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>>type="File">
>>>                <cybox:Description>
>>>                    <common:Text>The word document contains flash, which
>>>downloads a corrupted mp4
>>>                        file. The mp4 file itself is not anything special
>>>but an 0C filled (22kb)
>>>                        mp4 file with a valid mp4 header.</common:Text>
>>>                </cybox:Description>
>>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>>Nuclear
>>>                        Situation.doc</FileObj:File_Name>
>>>                    <FileObj:Size_In_Bytes
>>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>>                    <FileObj:Hashes>
>>>                        <common:Hash>
>>>                            <common:Type
>>>datatype="String">MD5</common:Type>
>>>                            <common:Simple_Hash_Value condition="Equals"
>>>datatype="hexBinary"
>>>
>>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>>                        </common:Hash>
>>>                    </FileObj:Hashes>
>>>                </cybox:Defined_Object>
>>>     <common:MultipleReferenceType
>>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>>firstObserved="03-MAR-2012 11:21:12345" />
>>>     <common:MultipleReferenceType
>>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>>firstObserved="21-OCT-2011 15:32:54321" />
>>>            </cybox:Object>
>>>
>>>This would help to 1) support independence of individual data owners
>>>while 2) working to minimize duplication of data elements.
>>>
>>>Any and all feedback here would be appreciated.  If extending the schema
>>>is not the answer then how are others working on deduplication (if at
>>>all).
>>>
>>>Regards,
>>>Frank Gruman
>>>Systems Engineer, DC3 DCCI, contractor
>>>410-981-1142 (work)
>>>NIPR: [hidden email]
>>>SIPR: [hidden email]
>>>
>>>
>>>
>>>*********************************************************
>*
>>************
>>>This email and any files transmitted with it are confidential and
>>>intended solely for the use of the individual or entity to whom they
>>>are addressed. If you have received this email in error please notify
>>>the system manager.
>>>
>>>Scanned by the Clearswift SECURE Email Gateway.
>>>
>>>*********************************************************
>*
>>************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Data Collision handling - what are others doing?

Barnum, Sean D.
In reply to this post by Back, Greg
Comments inline below.

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Back, Greg
Sent: Friday, March 15, 2013 9:46 AM
To: Osorno, Marcos; cybox-discussion-list Cyber Observable Expression/CybOX
Discussi
Subject: RE: Data Collision handling - what are others doing?

Marcos-

I think what you're hitting on is distinction between an "instance" of an
observable (in your example, a particular file at a particular place on a
particular drive) and an observable "pattern" (in your example, the hash
itself, regardless of where it appears). This distinction is not clearly
made by the CybOX standard, but for individual use cases (particularly when
including Observables in STIX Indicators, which is where I typically use
CybOX) keeping this distinction is crucial.

 [Barnum, Sean D.] Agreed. We are currently thinking through ways to make
the distinction between a CybOX Observable pattern and CybOX Observable
instance more clear. The reality is that sharing observable instances among
parties is not the typical use case. Most sharing will typically be sharing
of observable patterns (as Indicators) based on the appropriate properties
observed from one or more instance observables. Basically it is saying
"something that looks like this may be of interest".

I also agree that randomly generated GUIDs are useful only for (global)
uniqueness, and do not provide any real meaning. For this reason, I've been
using them only to represent relationships within the same document (to link
observables), and to represent an "instance" of a STIX document to
recipients do not process the same document more than once. GUIDs can be
useful as primary keys (or unique constraints) within databases, but as you
mentioned, it's important to determine what you're calling "unique" (an
instance or a pattern).

Thanks,
Greg

>-----Original Message-----
>From: [hidden email] [mailto:owner-cybox-
>[hidden email]] On Behalf Of Osorno, Marcos
>Sent: Wednesday, March 13, 2013 9:09 PM
>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>Observable Expression/CybOX Discussi
>Subject: Re: Data Collision handling - what are others doing?
>
>Hi Frank,
>
>I've been using a combination of meaningful attributes for UIDs. Please
>forgive the cross-post, this is what I expressed on the IDXWG mailing list
>(slightly edited):
>
>"I should clarify more of what I find to be the problem. I think maybe we
>are conflating (myself included) the concepts of fingerprint, globally
>unique ID, and name which seem like a tradeoff between:
>
>* Uniqueness - how unique is it within the system, globally, or given a
>set of pre-conditions about the entity being identified
>* Meaningfulness - how meaningful is it to a machine or human attempting
>to make sense of the ID
>* Calculability - how hard is it to calculate
>* Repeatability - how consistent is the the ID
>
>I'm curious to find IDs more descriptive than "firefox.exe" but less
>opaque than GUIDs. More akin to names, these IDs
>would makes less guarantees of being globally unique for the benefit of
>being more meaningful. It has been interesting to see how URLs have moved
>from: site.whatever/this.aspx?barf=32234234242342&this=1&ord=asc to
>things
>like site.whatever/catalog/pants/fancy/1234 because the 2nd URL is more
>meaningful. Granted, that takes the domain name system to work which
>requires registration and routing both of which are susceptible to attacks.
>
>At any rate, I'd be curious to hear about best practices in generating
>meaningful names in addition to what we know about GUIDs."
>
>File hashes are a good example. I think a file hash is valid UID for the
>abstract hash of that sequence of bytes representing theoretical code on a
>computer. However, I think it's a bad UID for an instance of that file on
>a particular system. I think from an ontological perspective the abstract
>hash and the concrete file are two separate, but related entities. So, my
>watch list has an observable entity with the hash as both the UID and an
>attribute, but a file has a UID that consists of the hash, the major
>device, the minor device, and the inode (for Linux systems) to read
>HASH-MAJOR-MINOR-INODE. This entity also has the hash, but as an
>attribute, not as a UID. What I'm working on is using the RelatedObjects
>types to associate the two. I'm also then working on linking the file
>instance to a machine whose UID is a combination of the uname information,
>MAC addresses, and a guess at the OS install date. I'm staying away from
>GUIDs since I find them not meaningful at all and in my mind they are like
>arbitrary internal database index keys which should only be very minimally
>exposed. My approach has it's own issues in terms of theoretical
>collisions, but I prefer to have meaningful labels even if it does mean
>there is some chance of collision. I reduce the chance of collision by
>associating that object with a few other observables so that it's not just
>represented by a hash alone.
>
>Sincerely,
>
>Marcos
>
>
>On 3/14/13 12:58 AM, "Gruman, Francis (Frank) Contractor DC3"
><[hidden email]> wrote:
>
>>Perhaps I should have also sent this to the STIX discussion, too.  If
>>anyone
>>is already on both lists, would you mind throwing it over that fence,
>>too?  I
>>will sign up soon...
>>
>>I am not trying to overcome unique identification within each
>>organization's
>>system.  We, too, are assessing uniqueness of an object based on the
>>object's data elements.  So we are planning on using it as a hint to say
>>that
>>we think the
>>object you sent us is "X".  If it is not "X" then we write it out as "Y".
>>But it seems to me that overlooking GUID values is essentially ignoring
>>the
>>purpose of the "Globally Unique" part of the value in the first place.
>>
>>To John's example of a File, it is possible to have multiple files with
>>the
>>same hash.  So the hash by itself is not unique enough.  Multiple File
>>objects can point to the same hash value.  So you send us your file with
>>your hash and, even if we have seen the hash before, we will create a new
>>File object.  I think that is something all of our systems should take
>>into
>>account.
>>
>>I guess my point in suggesting the "MultipleReferenceType" or
>>"Also-Seen-As"
>>is to enable sharing of the additional references we get from others (if
>>allowed to do so).  In our case, internally we put security caveats on
>>the
>>additional reference that matches the originally sourced data.  If the
>>sharing
>>scheme is adopted, we would only send out to those who have the authority
>>to
>>view the additional references.  This would enable Greg and I to share
>>elements with each other that might have been seen as (possibly)
>>something
>>else.
>>
>>Thanks for the feedback, guys.
>>
>>Regards,
>>Frank
>>
>>-----Original Message-----
>>From: Back, Greg [mailto:[hidden email]]
>>Sent: Thursday, March 07, 2013 1:00 PM
>>To: John Howie; Gruman, Francis (Frank) Contractor DC3;
>>cybox-discussion-list Cyber Observable Expression/CybOX Discussi
>>Subject: RE: Data Collision handling - what are others doing?
>>
>>I agree with John.
>>
>>I'm working through similar issues developing a tool for storing and
>>sharing
>>CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
>>Domain, etc.) should be handled by direct comparison of the content, not
>>trying to match IDs. For files, uniqueness is based on hashes (which can
>>be
>>a bit more difficult since different organizations may report different
>>hashes). It gets even harder for more complex types, but those cover 95%
>>of
>>what I'm interested in.
>>
>>I plan to use my own internal unique identifier to track that object in my
>>system, and track each "occurrence" of the object separately (including
>>each
>>source organization that has reported that object, along with the ID they
>>refer to it by). These "occurrences" are tracked only within the system,
>>and
>>not (yet) shared with others.
>>
>>My gut is telling me that coordinating and communicating "occurrences" is
>>better left to a standard like STIX, but I'm admittedly much less familiar
>>with STIX than CybOX. I would only use the object GUIDs to determine
>>whether
>>I'd seen the same object from the same organization before, and thus not
>>add
>>a new "occurrence" if nothing else has changed.
>>
>>Just my $.02. I have a feeling that as more organizations start sharing
>>CybOX and STIX content, these issues will become very apparent.
>>
>>Thanks for your feedback.
>>
>>Greg
>>
>>
>>>-----Original Message-----
>>>From: [hidden email] [mailto:owner-cybox-
>>>[hidden email]] On Behalf Of John Howie
>>>Sent: Monday, March 04, 2013 2:55 PM
>>>To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
>>>Observable Expression/CybOX Discussi
>>>Subject: Re: Data Collision handling - what are others doing?
>>>
>>>Hi Frank,
>>>
>>>One situation you will likely run into is where the data received,
>>>although the same as other data received, pertains to two or more
>>>completely different threats/events/incidents. Whether or not you should
>>>correlate the reports should probably be probably determined by other
>>>data, especially when considerable time has passed between the reports.
>>>That does not prevent you from recording also-seen-as references, but no
>>>inference should be automatically made about the association.
>>>
>>>Regards,
>>>
>>>John
>>>
>>>
>>>On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
>>><[hidden email]> wrote:
>>>
>>>>We are trying to come up with a way to ensure that we are not
>>>>duplicating
>>>>to much data in our system and are finding it a bit difficult to work
>>>>around data collisions.  In our case, a collision is where the same data
>>>>values are passed in from different sources with different GUIDs.
>>>>Ultimately, we are trying to avoid duplicate data in our system as well
>>>>as duplicating data in other systems.
>>>>
>>>>Here is one scenario where we are running into this problem:
>>>>
>>>>We have an agreement with ORG1 to receive some information sets.
>They
>>>>send us their information and we get something like IP:192.168.168.168
>>>>with their unique GUID.
>>>>We also have an agreement with ORG2 to receive some of their
>>>>information.
>>>> They send us their information and we also get IP:192.168.168.168 with
>>>>_their_ unique GUID.
>>>>
>>>>These two organizations do not yet communicate with each other (but
>may,
>>>>in the future).
>>>>
>>>>One of the ways we have considered handling this situation in our system
>>>>is to always generate our own GUID and keep track of the external GUID
>>>>values from external sources.  Then we can send the external source's
>>>>GUID back to them with our own reporting that ties into their data.  As
>>>>a
>>>>total system, however, we still run the risk of duplicate data.  If ORG1
>>>>and ORG2 eventually decide to share data then they now have their own
>>>>data collisions.
>>>>
>>>>POSSIBLE SOLUTION:
>>>>Would it be possible to extend the CybOX schema to support
>>>>"also-seen-as"
>>>>references with GUID and date attributes?  As an example, I have
>>>>extended
>>>>a File object with a  MultipleReferenceType that I have created in my
>>>>mind.
>>>>
>>>><cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
>>>>type="File">
>>>>                <cybox:Description>
>>>>                    <common:Text>The word document contains flash, which
>>>>downloads a corrupted mp4
>>>>                        file. The mp4 file itself is not anything
>>>>special
>>>>but an 0C filled (22kb)
>>>>                        mp4 file with a valid mp4 header.</common:Text>
>>>>                </cybox:Description>
>>>>                <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
>>>>                    <FileObj:File_Name datatype="String">Iran's Oil and
>>>>Nuclear
>>>>                        Situation.doc</FileObj:File_Name>
>>>>                    <FileObj:Size_In_Bytes
>>>>datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
>>>>                    <FileObj:Hashes>
>>>>                        <common:Hash>
>>>>                            <common:Type
>>>>datatype="String">MD5</common:Type>
>>>>                            <common:Simple_Hash_Value condition="Equals"
>>>>datatype="hexBinary"
>>>>
>>>>>E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
>>>>                        </common:Hash>
>>>>                    </FileObj:Hashes>
>>>>                </cybox:Defined_Object>
>>>>    <common:MultipleReferenceType
>>>>id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
>>>>firstObserved="03-MAR-2012 11:21:12345" />
>>>>    <common:MultipleReferenceType
>>>>id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
>>>>firstObserved="21-OCT-2011 15:32:54321" />
>>>>            </cybox:Object>
>>>>
>>>>This would help to 1) support independence of individual data owners
>>>>while 2) working to minimize duplication of data elements.
>>>>
>>>>Any and all feedback here would be appreciated.  If extending the schema
>>>>is not the answer then how are others working on deduplication (if at
>>>>all).
>>>>
>>>>Regards,
>>>>Frank Gruman
>>>>Systems Engineer, DC3 DCCI, contractor
>>>>410-981-1142 (work)
>>>>NIPR: [hidden email]
>>>>SIPR: [hidden email]
>>>>
>>>>
>>>>
>>>>********************************************************
>**
>>>************
>>>>This email and any files transmitted with it are confidential and
>>>>intended solely for the use of the individual or entity to whom they
>>>>are addressed. If you have received this email in error please notify
>>>>the system manager.
>>>>
>>>>Scanned by the Clearswift SECURE Email Gateway.
>>>>
>>>>********************************************************
>**
>>>************

smime.p7s (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Data Collision handling - what are others doing?

PAT MARONEY-2
In reply to this post by Barnum, Sean D.
Musings:  Pivot/uniqueness could be on the Indicator itself (provided it is "Actionable" vs. "Informational" (e.g.: IP Address, Domain Name, Hash vs. Filename, File Size, etc.).  Each organization will have it's own "Record of Authority" with it's own GUID for any Indicators it sources.  For externally sourced Indicators one could adopt the Source GUID of the first external instance ingested (or generate it's own GUID).  Compound Indicators get a little trickier (e.g.: FileName AND FileSize).  In this case the GUID for the Compound Indicator "Group" could be used as the discriminator (Organization Generated GUID if self sourced, or GUID of first external instance received).

Each set of internal/external observations could then be linked to that "Actionable" Indicator

"Informational" Indicators:

An argument can be made that pivoting on individual "Informational" indicators (e.g.: FileName) can be useful when trying to do attribution or identify potential affiliations. There are often "non-unique discriminators" (oxymoron?) that can be graphed to find relations.  For example a given Adversary may always name files for Scheduled Tasks 1.bat, 2.bat, 3.bat.  Another may move cmd.exe to a specific system directory.  There are also unique "styles" of Command Line Syntax that can also be used to form correlations to specific individual Actors (or group of actors following the same scripts/doctrine).  Point being graphing clusters of loose correlations over time can reveal valuable insights.


Patrick Maroney

President
Integrated Networking Technologies, Inc.
PO Box 569
Marlton, NJ 08053
Office: (856)983-0001
Cell: (609)841-5104
Fax: (856)983-0001
[hidden email]



On Mar 15, 2013, at 11:13 AM, "Barnum, Sean D." <[hidden email]> wrote:
Comments inline below


-----Original Message-----
From: [hidden email]
[mailto:owner-[hidden email]] On Behalf Of Osorno,
Marcos
Sent: Wednesday, March 13, 2013 9:09 PM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: Re: Data Collision handling - what are others doing?

Hi Frank,

I've been using a combination of meaningful attributes for UIDs. Please
forgive the cross-post, this is what I expressed on the IDXWG mailing list
(slightly edited):

"I should clarify more of what I find to be the problem. I think maybe we
are conflating (myself included) the concepts of fingerprint, globally
unique ID, and name which seem like a tradeoff between:

* Uniqueness - how unique is it within the system, globally, or given a
set of pre-conditions about the entity being identified
* Meaningfulness - how meaningful is it to a machine or human attempting
to make sense of the ID
* Calculability - how hard is it to calculate
* Repeatability - how consistent is the the ID

I'm curious to find IDs more descriptive than "firefox.exe" but less
opaque than GUIDs. More akin to names, these IDs
would makes less guarantees of being globally unique for the benefit of
being more meaningful. It has been interesting to see how URLs have moved
from: site.whatever/this.aspx?barf=32234234242342&this=1&ord=asc to things
like site.whatever/catalog/pants/fancy/1234 because the 2nd URL is more
meaningful. Granted, that takes the domain name system to work which
requires registration and routing both of which are susceptible to attacks.

At any rate, I'd be curious to hear about best practices in generating
meaningful names in addition to what we know about GUIDs."

File hashes are a good example. I think a file hash is valid UID for the
abstract hash of that sequence of bytes representing theoretical code on a
computer. However, I think it's a bad UID for an instance of that file on
a particular system. I think from an ontological perspective the abstract
hash and the concrete file are two separate, but related entities. So, my
watch list has an observable entity with the hash as both the UID and an
attribute, but a file has a UID that consists of the hash, the major
device, the minor device, and the inode (for Linux systems) to read
HASH-MAJOR-MINOR-INODE. This entity also has the hash, but as an
attribute, not as a UID. What I'm working on is using the RelatedObjects
types to associate the two. I'm also then working on linking the file
instance to a machine whose UID is a combination of the uname information,
MAC addresses, and a guess at the OS install date. I'm staying away from
GUIDs since I find them not meaningful at all and in my mind they are like
arbitrary internal database index keys which should only be very minimally
exposed. My approach has it's own issues in terms of theoretical
collisions, but I prefer to have meaningful labels even if it does mean
there is some chance of collision. I reduce the chance of collision by
associating that object with a few other observables so that it's not just
represented by a hash alone.

[Barnum, Sean D.] This sort of balance and flexibility is what we were
looking for when we decided to utilize Qualified Names as out ID type.
[Barnum, Sean D.]The idea is that the QName prefix should be a unique
namespace identifying the producer of that piece of data and the postfix
would be some sort of identifier that is globally unique within the
producer's context. This approach guarantees global uniqueness but also
implicitly supports correlation to the producing org as well as leaves
flexibility in the hands of each org to define their own postfix ID format.
Given this flexibility, you could define postfix formats like Marcos
describes above based on your context. The base requirement for the ID is
simply to uniquely identify that resource not to give further illuminating
context but that does not mean that it cannot do so at the same time.
[Barnum, Sean D.]So, Marcos could use an ID something like
'<a href="http://jhuapl.edu:HASH-MAJOR-MINOR-INODE'">http://jhuapl.edu:HASH-MAJOR-MINOR-INODE' to identify a file.
[Barnum, Sean D.]You could also blend some descriptive textual components
with an actual GUID if desired. This is the sort of approach you may have
seen in some of the example content we have created (e.g.
MITRE:observable-6f45f0aa-30c8-11e2-8011-000c291a73d5,
MITRE:object-6dcae276-30c8-11e2-8011-000c291a73d5,
MITRE:Indicator-ba1d406e-937c-414f-9231-6e1dbe64fe8b where MITRE is a
namespace abbreviation declared in the file header).

[Barnum, Sean D.]Do you think this approach in the language gives you the
capability and flexibility to define IDs that work for your contexts?

Sincerely,

Marcos


On 3/14/13 12:58 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

Perhaps I should have also sent this to the STIX discussion, too.  If
anyone 
is already on both lists, would you mind throwing it over that fence,
too?  I 
will sign up soon...

I am not trying to overcome unique identification within each
organization's
system.  We, too, are assessing uniqueness of an object based on the
object's data elements.  So we are planning on using it as a hint to say
that 
we think the
object you sent us is "X".  If it is not "X" then we write it out as "Y".
But it seems to me that overlooking GUID values is essentially ignoring
the
purpose of the "Globally Unique" part of the value in the first place.

To John's example of a File, it is possible to have multiple files with
the
same hash.  So the hash by itself is not unique enough.  Multiple File
objects can point to the same hash value.  So you send us your file with
your hash and, even if we have seen the hash before, we will create a new
File object.  I think that is something all of our systems should take
into
account.

I guess my point in suggesting the "MultipleReferenceType" or
"Also-Seen-As" 
is to enable sharing of the additional references we get from others (if
allowed to do so).  In our case, internally we put security caveats on
the 
additional reference that matches the originally sourced data.  If the
sharing 
scheme is adopted, we would only send out to those who have the authority
to 
view the additional references.  This would enable Greg and I to share
elements with each other that might have been seen as (possibly)
something 
else.

Thanks for the feedback, guys.

Regards,
Frank

-----Original Message-----
From: Back, Greg [mailto:gback@mitre.org]
Sent: Thursday, March 07, 2013 1:00 PM
To: John Howie; Gruman, Francis (Frank) Contractor DC3;
cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

I agree with John.

I'm working through similar issues developing a tool for storing and
sharing
CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
Domain, etc.) should be handled by direct comparison of the content, not
trying to match IDs. For files, uniqueness is based on hashes (which can
be
a bit more difficult since different organizations may report different
hashes). It gets even harder for more complex types, but those cover 95%
of
what I'm interested in.

I plan to use my own internal unique identifier to track that object in my
system, and track each "occurrence" of the object separately (including
each
source organization that has reported that object, along with the ID they
refer to it by). These "occurrences" are tracked only within the system,
and
not (yet) shared with others.

My gut is telling me that coordinating and communicating "occurrences" is
better left to a standard like STIX, but I'm admittedly much less familiar
with STIX than CybOX. I would only use the object GUIDs to determine
whether
I'd seen the same object from the same organization before, and thus not
add
a new "occurrence" if nothing else has changed.

Just my $.02. I have a feeling that as more organizations start sharing
CybOX and STIX content, these issues will become very apparent.

Thanks for your feedback.

Greg


-----Original Message-----
From: [hidden email] [mailto:owner-cybox-
[hidden email]] On Behalf Of John Howie
Sent: Monday, March 04, 2013 2:55 PM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: Re: Data Collision handling - what are others doing?

Hi Frank,

One situation you will likely run into is where the data received,
although the same as other data received, pertains to two or more
completely different threats/events/incidents. Whether or not you should
correlate the reports should probably be probably determined by other
data, especially when considerable time has passed between the reports.
That does not prevent you from recording also-seen-as references, but no
inference should be automatically made about the association.

Regards,

John


On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

We are trying to come up with a way to ensure that we are not
duplicating
to much data in our system and are finding it a bit difficult to work
around data collisions.  In our case, a collision is where the same data
values are passed in from different sources with different GUIDs.
Ultimately, we are trying to avoid duplicate data in our system as well
as duplicating data in other systems.

Here is one scenario where we are running into this problem:

We have an agreement with ORG1 to receive some information sets.  They
send us their information and we get something like IP:192.168.168.168
with their unique GUID.
We also have an agreement with ORG2 to receive some of their
information.
They send us their information and we also get IP:192.168.168.168 with
_their_ unique GUID.

These two organizations do not yet communicate with each other (but may,
in the future).

One of the ways we have considered handling this situation in our system
is to always generate our own GUID and keep track of the external GUID
values from external sources.  Then we can send the external source's
GUID back to them with our own reporting that ties into their data.  As
a
total system, however, we still run the risk of duplicate data.  If ORG1
and ORG2 eventually decide to share data then they now have their own
data collisions.

POSSIBLE SOLUTION:
Would it be possible to extend the CybOX schema to support
"also-seen-as"
references with GUID and date attributes?  As an example, I have
extended
a File object with a  MultipleReferenceType that I have created in my
mind.

<cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
type="File">
              <cybox:Description>
                  <common:Text>The word document contains flash, which
downloads a corrupted mp4
                      file. The mp4 file itself is not anything
special
but an 0C filled (22kb)
                      mp4 file with a valid mp4 header.</common:Text>
              </cybox:Description>
              <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
                  <FileObj:File_Name datatype="String">Iran's Oil and
Nuclear
                      Situation.doc</FileObj:File_Name>
                  <FileObj:Size_In_Bytes
datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
                  <FileObj:Hashes>
                      <common:Hash>
                          <common:Type
datatype="String">MD5</common:Type>
                          <common:Simple_Hash_Value condition="Equals"
datatype="hexBinary"

E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
                      </common:Hash>
                  </FileObj:Hashes>
              </cybox:Defined_Object>
<common:MultipleReferenceType
id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
firstObserved="03-MAR-2012 11:21:12345" />
<common:MultipleReferenceType
id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
firstObserved="21-OCT-2011 15:32:54321" />
          </cybox:Object>

This would help to 1) support independence of individual data owners
while 2) working to minimize duplication of data elements.

Any and all feedback here would be appreciated.  If extending the schema
is not the answer then how are others working on deduplication (if at
all).

Regards,
Frank Gruman
Systems Engineer, DC3 DCCI, contractor
410-981-1142 (work)
NIPR: [hidden email]
SIPR: [hidden email]



**********************************************************
************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

Scanned by the Clearswift SECURE Email Gateway.

**********************************************************
************

Reply | Threaded
Open this post in threaded view
|

Re: Data Collision handling - what are others doing?

Josh Zaritsky
Patrick raises some good points with respect to "informational" indicators, and this really illustrates that the notion of equivalence when it comes to many indicators is going to be different for each organization and their intended purpose (e.g., clustering, attribution, de-duplication, etc.).  Furthermore, even for pivot/uniqueness based on the indicator itself, there will always be corner cases that can't be trivially covered by comparing the indicator attributes directly.  For example, you could have two groups referring to the same file but one uses MD5 while another uses SHA1 -- or even trying to capture a single string where each group used a different encoding to represent it.

I don't think it's realistic to expect that we would be able to solve all these different cases with the language itself.  I think what may be the most helpful is if folks could share the scripts they use to normalize/cluster/de-dup/whatever so that other members of the community can adopt or adapt those techniques in whatever way best fits their needs.  Could we potentially set up a folder on the github repository for people to share these?

PAT MARONEY wrote:
Musings:  Pivot/uniqueness could be on the Indicator itself (provided it is "Actionable" vs. "Informational" (e.g.: IP Address, Domain Name, Hash vs. Filename, File Size, etc.).  Each organization will have it's own "Record of Authority" with it's own GUID for any Indicators it sources.  For externally sourced Indicators one could adopt the Source GUID of the first external instance ingested (or generate it's own GUID).  Compound Indicators get a little trickier (e.g.: FileName AND FileSize).  In this case the GUID for the Compound Indicator "Group" could be used as the discriminator (Organization Generated GUID if self sourced, or GUID of first external instance received).

Each set of internal/external observations could then be linked to that "Actionable" Indicator

"Informational" Indicators:

An argument can be made that pivoting on individual "Informational" indicators (e.g.: FileName) can be useful when trying to do attribution or identify potential affiliations. There are often "non-unique discriminators" (oxymoron?) that can be graphed to find relations.  For example a given Adversary may always name files for Scheduled Tasks 1.bat, 2.bat, 3.bat.  Another may move cmd.exe to a specific system directory.  There are also unique "styles" of Command Line Syntax that can also be used to form correlations to specific individual Actors (or group of actors following the same scripts/doctrine).  Point being graphing clusters of loose correlations over time can reveal valuable insights.


Patrick Maroney

President
Integrated Networking Technologies, Inc.
PO Box 569
Marlton, NJ 08053
Office: (856)983-0001
Cell: (609)841-5104
Fax: (856)983-0001
[hidden email]



On Mar 15, 2013, at 11:13 AM, "Barnum, Sean D." <[hidden email]> wrote:
Comments inline below


-----Original Message-----
From: [hidden email]
[[hidden email]-[hidden email]] On Behalf Of Osorno,
Marcos
Sent: Wednesday, March 13, 2013 9:09 PM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: Re: Data Collision handling - what are others doing?

Hi Frank,

I've been using a combination of meaningful attributes for UIDs. Please
forgive the cross-post, this is what I expressed on the IDXWG mailing list
(slightly edited):

"I should clarify more of what I find to be the problem. I think maybe we
are conflating (myself included) the concepts of fingerprint, globally
unique ID, and name which seem like a tradeoff between:

* Uniqueness - how unique is it within the system, globally, or given a
set of pre-conditions about the entity being identified
* Meaningfulness - how meaningful is it to a machine or human attempting
to make sense of the ID
* Calculability - how hard is it to calculate
* Repeatability - how consistent is the the ID

I'm curious to find IDs more descriptive than "firefox.exe" but less
opaque than GUIDs. More akin to names, these IDs
would makes less guarantees of being globally unique for the benefit of
being more meaningful. It has been interesting to see how URLs have moved
from: site.whatever/this.aspx?barf=32234234242342&this=1&ord=asc to things
like site.whatever/catalog/pants/fancy/1234 because the 2nd URL is more
meaningful. Granted, that takes the domain name system to work which
requires registration and routing both of which are susceptible to attacks.

At any rate, I'd be curious to hear about best practices in generating
meaningful names in addition to what we know about GUIDs."

File hashes are a good example. I think a file hash is valid UID for the
abstract hash of that sequence of bytes representing theoretical code on a
computer. However, I think it's a bad UID for an instance of that file on
a particular system. I think from an ontological perspective the abstract
hash and the concrete file are two separate, but related entities. So, my
watch list has an observable entity with the hash as both the UID and an
attribute, but a file has a UID that consists of the hash, the major
device, the minor device, and the inode (for Linux systems) to read
HASH-MAJOR-MINOR-INODE. This entity also has the hash, but as an
attribute, not as a UID. What I'm working on is using the RelatedObjects
types to associate the two. I'm also then working on linking the file
instance to a machine whose UID is a combination of the uname information,
MAC addresses, and a guess at the OS install date. I'm staying away from
GUIDs since I find them not meaningful at all and in my mind they are like
arbitrary internal database index keys which should only be very minimally
exposed. My approach has it's own issues in terms of theoretical
collisions, but I prefer to have meaningful labels even if it does mean
there is some chance of collision. I reduce the chance of collision by
associating that object with a few other observables so that it's not just
represented by a hash alone.

[Barnum, Sean D.] This sort of balance and flexibility is what we were
looking for when we decided to utilize Qualified Names as out ID type.
[Barnum, Sean D.]The idea is that the QName prefix should be a unique
namespace identifying the producer of that piece of data and the postfix
would be some sort of identifier that is globally unique within the
producer's context. This approach guarantees global uniqueness but also
implicitly supports correlation to the producing org as well as leaves
flexibility in the hands of each org to define their own postfix ID format.
Given this flexibility, you could define postfix formats like Marcos
describes above based on your context. The base requirement for the ID is
simply to uniquely identify that resource not to give further illuminating
context but that does not mean that it cannot do so at the same time.
[Barnum, Sean D.]So, Marcos could use an ID something like
'<a moz-do-not-send="true" href="http://jhuapl.edu:HASH-MAJOR-MINOR-INODE%27">http://jhuapl.edu:HASH-MAJOR-MINOR-INODE' to identify a file.
[Barnum, Sean D.]You could also blend some descriptive textual components
with an actual GUID if desired. This is the sort of approach you may have
seen in some of the example content we have created (e.g.
MITRE:observable-6f45f0aa-30c8-11e2-8011-000c291a73d5,
MITRE:object-6dcae276-30c8-11e2-8011-000c291a73d5,
MITRE:Indicator-ba1d406e-937c-414f-9231-6e1dbe64fe8b where MITRE is a
namespace abbreviation declared in the file header).

[Barnum, Sean D.]Do you think this approach in the language gives you the
capability and flexibility to define IDs that work for your contexts?

Sincerely,

Marcos


On 3/14/13 12:58 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

Perhaps I should have also sent this to the STIX discussion, too.  If
anyone 
is already on both lists, would you mind throwing it over that fence,
too?  I 
will sign up soon...

I am not trying to overcome unique identification within each
organization's
system.  We, too, are assessing uniqueness of an object based on the
object's data elements.  So we are planning on using it as a hint to say
that 
we think the
object you sent us is "X".  If it is not "X" then we write it out as "Y".
But it seems to me that overlooking GUID values is essentially ignoring
the
purpose of the "Globally Unique" part of the value in the first place.

To John's example of a File, it is possible to have multiple files with
the
same hash.  So the hash by itself is not unique enough.  Multiple File
objects can point to the same hash value.  So you send us your file with
your hash and, even if we have seen the hash before, we will create a new
File object.  I think that is something all of our systems should take
into
account.

I guess my point in suggesting the "MultipleReferenceType" or
"Also-Seen-As" 
is to enable sharing of the additional references we get from others (if
allowed to do so).  In our case, internally we put security caveats on
the 
additional reference that matches the originally sourced data.  If the
sharing 
scheme is adopted, we would only send out to those who have the authority
to 
view the additional references.  This would enable Greg and I to share
elements with each other that might have been seen as (possibly)
something 
else.

Thanks for the feedback, guys.

Regards,
Frank

-----Original Message-----
From: Back, Greg [[hidden email]mitre.org]
Sent: Thursday, March 07, 2013 1:00 PM
To: John Howie; Gruman, Francis (Frank) Contractor DC3;
cybox-discussion-list Cyber Observable Expression/CybOX Discussi
Subject: RE: Data Collision handling - what are others doing?

I agree with John.

I'm working through similar issues developing a tool for storing and
sharing
CybOX content. Within CybOX, I believe that uniqueness of objects (IP,
Domain, etc.) should be handled by direct comparison of the content, not
trying to match IDs. For files, uniqueness is based on hashes (which can
be
a bit more difficult since different organizations may report different
hashes). It gets even harder for more complex types, but those cover 95%
of
what I'm interested in.

I plan to use my own internal unique identifier to track that object in my
system, and track each "occurrence" of the object separately (including
each
source organization that has reported that object, along with the ID they
refer to it by). These "occurrences" are tracked only within the system,
and
not (yet) shared with others.

My gut is telling me that coordinating and communicating "occurrences" is
better left to a standard like STIX, but I'm admittedly much less familiar
with STIX than CybOX. I would only use the object GUIDs to determine
whether
I'd seen the same object from the same organization before, and thus not
add
a new "occurrence" if nothing else has changed.

Just my $.02. I have a feeling that as more organizations start sharing
CybOX and STIX content, these issues will become very apparent.

Thanks for your feedback.

Greg


-----Original Message-----
From: [hidden email] [[hidden email]-
[hidden email]] On Behalf Of John Howie
Sent: Monday, March 04, 2013 2:55 PM
To: Gruman, Francis (Frank) Contractor DC3; cybox-discussion-list Cyber
Observable Expression/CybOX Discussi
Subject: Re: Data Collision handling - what are others doing?

Hi Frank,

One situation you will likely run into is where the data received,
although the same as other data received, pertains to two or more
completely different threats/events/incidents. Whether or not you should
correlate the reports should probably be probably determined by other
data, especially when considerable time has passed between the reports.
That does not prevent you from recording also-seen-as references, but no
inference should be automatically made about the association.

Regards,

John


On 3/4/13 10:23 AM, "Gruman, Francis (Frank) Contractor DC3"
<[hidden email]> wrote:

We are trying to come up with a way to ensure that we are not
duplicating
to much data in our system and are finding it a bit difficult to work
around data collisions.  In our case, a collision is where the same data
values are passed in from different sources with different GUIDs.
Ultimately, we are trying to avoid duplicate data in our system as well
as duplicating data in other systems.

Here is one scenario where we are running into this problem:

We have an agreement with ORG1 to receive some information sets.  They
send us their information and we get something like IP:192.168.168.168
with their unique GUID.
We also have an agreement with ORG2 to receive some of their
information.
They send us their information and we also get IP:192.168.168.168 with
_their_ unique GUID.

These two organizations do not yet communicate with each other (but may,
in the future).

One of the ways we have considered handling this situation in our system
is to always generate our own GUID and keep track of the external GUID
values from external sources.  Then we can send the external source's
GUID back to them with our own reporting that ties into their data.  As
a
total system, however, we still run the risk of duplicate data.  If ORG1
and ORG2 eventually decide to share data then they now have their own
data collisions.

POSSIBLE SOLUTION:
Would it be possible to extend the CybOX schema to support
"also-seen-as"
references with GUID and date attributes?  As an example, I have
extended
a File object with a  MultipleReferenceType that I have created in my
mind.

<cybox:Object id="cybox:guid-49d31c13-8d7b-4528-b8d6-ce8ed0d43ad7"
type="File">
              <cybox:Description>
                  <common:Text>The word document contains flash, which
downloads a corrupted mp4
                      file. The mp4 file itself is not anything
special
but an 0C filled (22kb)
                      mp4 file with a valid mp4 header.</common:Text>
              </cybox:Description>
              <cybox:Defined_Object xsi:type="FileObj:FileObjectType">
                  <FileObj:File_Name datatype="String">Iran's Oil and
Nuclear
                      Situation.doc</FileObj:File_Name>
                  <FileObj:Size_In_Bytes
datatype="UnsignedLong">106604</FileObj:Size_In_Bytes>
                  <FileObj:Hashes>
                      <common:Hash>
                          <common:Type
datatype="String">MD5</common:Type>
                           <common:Simple_Hash_Value condition="Equals"
datatype="hexBinary"

E92A4FC283EB2802AD6D0E24C7FCC857</common:Simple_Hash_Value>
                      </common:Hash>
                  </FileObj:Hashes>
              </cybox:Defined_Object>
<common:MultipleReferenceType
id="ORG1:guid-a54deb9c-ac26-4562-9d43-2820f8cb34ce"
firstObserved="03-MAR-2012 11:21:12345" />
<common:MultipleReferenceType
id="ORG2:guid-9d4e9844-3c33-449f-8fb8-caeb2c121afb"
firstObserved="21-OCT-2011 15:32:54321" />
          </cybox:Object>

This would help to 1) support independence of individual data owners
while 2) working to minimize duplication of data elements.

Any and all feedback here would be appreciated.  If extending the schema
is not the answer then how are others working on deduplication (if at
all).

Regards,
Frank Gruman
Systems Engineer, DC3 DCCI, contractor
410-981-1142 (work)
NIPR: [hidden email]
SIPR: [hidden email]



**********************************************************
************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

Scanned by the Clearswift SECURE Email Gateway.

**********************************************************
************


--
Josh Zaritsky
Product Manager
CrowdStrike, Inc.
443-492-9404
[hidden email]


smime.p7s (4K) Download Attachment