Archive

Posts Tagged ‘Trusted Business Content’

Unstructured Data = Letters in a Bucket?

August 9, 2011 1 comment

Unstructured Data – The Myth

There’s a lot of noise about “structured” and “unstructured” data. But what is this “unstructured data” beast we keep hearing about?

It’s almost impossible to see a presentation from a search or ECM or eDiscovery vendor who doesn’t talk about how “80% of a corporations data is unstructured.” Aside from the fact that much of this data is garbage (see my earlier postings), is it really fair to call it unstructured?

Personally I hate the descriptor “unstructured content.” When I hear it,all I can think of is letters spilling out of a bucket – no rhyme or reason to the flow or sense to be made (unless you’re Edward Lorenz).

What we so often call unstructured content really has lots of structure that can be used and leveraged for many types of text analytics purposes.

Explicit Metadata

Every file system object has metadata attributes associated to it. This is explicit metadata which search vendors have used for years to help make content findability more successful. Things like “name,” “date,” “filetype” are examples of explicit metadata.

Since the dawn of computers, explicit metadata has been used to help computers store information, and users to find it.

Implied Metadata

This is metadata that can be defined or extracted from the content of objects in file systems. Office documents have a Properties sheet with information such as “last accessed date,” “author, and “word count.” Music files often store album information and song length.

Even deeper, within many content objects we can identify and extract things like credit card numbers, phone numbers, dates, or names. This type of entity identification and extraction enables a rich metadata view into content – something only possible because there’s _structure_ not random letters.

So What?

It’s unfair to call an email or Word document “unstructured.” I prefer the generic “content” because when you really look, there’s rich structure that can be exploited by a business. Companies can use “last accessed date” to support RM retention and disposition activities, or identification of credit card numbers to ensure proper PII strategy adherence.

Thankfully, there’s no such thing as unstructured data. Otherwise the companies I work with wouldn’t be able to use the explicit and implied metadata, and rich object structure of content in their content analytics strategies.

Is All Your Content Equal?

January 7, 2011 Leave a comment

Trusted Business Content

If you’re like the vast majority of the companies I speak to, more content is not your problem.  The problem you’re struggling with is “how can I trust this content” or “how do I figure out if this content is valuable to my business?”

Being awash in a sea of content doesn’t providing a unique business opportunity.  Being able to trust the content you have, and leverage it for new business insights does.

In the same manner that not all content is equal, not all content repositories are equal.  Identifying the content that matters to your business, and then doing something useful with that content, is the difference between a company that stores content, and one that leverages it.

At first blush, the problem of sifting through the masses of content you store to figure out what’s relevant or not can be extremely daunting.  if you’re like many of the companies I speak with, you have terabytes of content across stores like FileNet, Livelink, SharePoint and shared drives.

Sadly, reality is there’s simply no magic wand you can wave.  But there are programmatic approaches you can use to build a strategy for identifying the content that matters.

What is Trusted Business Content?

Fundamentally, Trusted Business Content (TBC) is your unstructured information that you know you can leverage to optimize your content related business activities and enable innovation. 

Trusted Business Content enables innovation.  Very simply, if your IT department can deliver accurate and complete information in context for a business user then they will be able to make the right analysis to draw out true business insights.  Using bad data as a strategic asset is simply not an option.

TBC has 4 key tenets:

  1. Insightful.  Derives meaning and new understanding from your content, as the content is created and changes
  2. In Context.  Supports the real-time and on-demand delivery of relevant information in the context of business processes, applications, systems and user demands
  3. Complete.  Related information is reconciled into a single and holistic view
  4. Accurate.  Complex and disparate content is transformed, cleansed and delivered

So, where, and how, do you start?

It’s a Journey

There’s no “one-and-done” approach to developing organizational Trusted Business Content (TBC).

Just as data quality (#dataquality), data integration (#dataintegration) and data warehousing (#datawarehousing) projects are ongoing business activities, so to is the need to identify TBC.  There’s low-hanging fruit in the most commonly accessed and used systems.

Like any business transformation exercise, find the low-hanging fruit to build a fast and measurable success for internal stakeholders.

Leverage What You Already Know

Every organization already has lists of entities they deal with.  Customers, vendors, suppliers, employees, etc.  Some companies store/control/manage a single view of these entities in their MDM (#MDM) repositories, other’s it’s in one – or more – data warehouses.

These entities are a start to help with understanding the business value of your content.  You can create entity identifiers, such as UIMA (#UIMA) dictionary annotators, for entity extraction and identification.  A usable content analytics technology (#textanalytics) enables the business to quickly and easily manage the entities being extracted with feature rich tooling

If there’s no customer name, vendor name, supplier name, or any other business entity you report on – there’s a strong possibility it’s not relevant content.

Additionally, libraries of algorithms are available for identifying strings of characters as credit card numbers, Social Insurance Numbers, phone numbers, cities, States, Countries, etc. which can be used to help identify content of business value.

Use common sense.  If a document has a credit card number or a customer name – it’s probably important.

Visualize the Results

A business user doesn’t want to see a printout of a list of file names that are (or aren’t) of business value.  An effective content analytics product will be able to visualize the results and allow for the dynamic slice & dice of the results. 

Deliver to the decision makers visual reports of how much content can be eliminated, and what the makeup of the TBC is.  This will best enable the business decision makers to understand the value of the IT work.