Home > Unstructured Thoughts > Unstructured Data = Letters in a Bucket?

Unstructured Data = Letters in a Bucket?

Unstructured Data – The Myth

There’s a lot of noise about “structured” and “unstructured” data. But what is this “unstructured data” beast we keep hearing about?

It’s almost impossible to see a presentation from a search or ECM or eDiscovery vendor who doesn’t talk about how “80% of a corporations data is unstructured.” Aside from the fact that much of this data is garbage (see my earlier postings), is it really fair to call it unstructured?

Personally I hate the descriptor “unstructured content.” When I hear it,all I can think of is letters spilling out of a bucket – no rhyme or reason to the flow or sense to be made (unless you’re Edward Lorenz).

What we so often call unstructured content really has lots of structure that can be used and leveraged for many types of text analytics purposes.

Explicit Metadata

Every file system object has metadata attributes associated to it. This is explicit metadata which search vendors have used for years to help make content findability more successful. Things like “name,” “date,” “filetype” are examples of explicit metadata.

Since the dawn of computers, explicit metadata has been used to help computers store information, and users to find it.

Implied Metadata

This is metadata that can be defined or extracted from the content of objects in file systems. Office documents have a Properties sheet with information such as “last accessed date,” “author, and “word count.” Music files often store album information and song length.

Even deeper, within many content objects we can identify and extract things like credit card numbers, phone numbers, dates, or names. This type of entity identification and extraction enables a rich metadata view into content – something only possible because there’s _structure_ not random letters.

So What?

It’s unfair to call an email or Word document “unstructured.” I prefer the generic “content” because when you really look, there’s rich structure that can be exploited by a business. Companies can use “last accessed date” to support RM retention and disposition activities, or identification of credit card numbers to ensure proper PII strategy adherence.

Thankfully, there’s no such thing as unstructured data. Otherwise the companies I work with wouldn’t be able to use the explicit and implied metadata, and rich object structure of content in their content analytics strategies.

Advertisements
  1. August 10, 2011 at 11:58

    Thanks for the interesting article. I loved your comparison of unstructured data to a bucket – especially since I used to get letters in … buckets. When I worked as an archivist people would donate papers, diaries, photos, and letters in any thing imaginable. They came in boxes, lunch bags, garbage bags, and yes – even buckets. You described their state perfectly! “Letters spilling out of a bucket – no rhyme or reason to the flow or sense to be made.”

    In your blog you wrote about extacting explicit and implied metadata as one way to getting valuable information out of bucket of letters. (I find your definitions of explicit and implied metadata useful and like them, btw). As an archivist I would take a different approach to extracting useful information. My goal was to give a sense of meaning – to tease out the “rhyme or reason to the flow or sense to be made” when there was none readily apparent. I would focus on describing what the letters in the bucket were “about.” “About-ness,” in the sense I mean here, speaks to what brought all the specific pieces of data together. There are different techniques for doing this and I won’t bore you with the details here. The important part is that it could be done. The approach generated a different type of metadata which in turn guided users to documents they wouldn’t otherwise have found. I don’t see this approach in the ECM world. In some cases it is not appropriate – for example, if a company was only supporting only “fact-finding” search strategies I wouldn’t use it. I found it works great when companies want to solve more process-oriented problems. IMO there is a huge opportunity here waiting to be realized by Enterprises.

    Again, thanks for your blog article. Hope to hear from you soon.

    Peter Wilkerson

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: