Archive

Posts Tagged ‘Content Analytics’

Unstructured Data = Letters in a Bucket?

August 9, 2011 1 comment

Unstructured Data – The Myth

There’s a lot of noise about “structured” and “unstructured” data. But what is this “unstructured data” beast we keep hearing about?

It’s almost impossible to see a presentation from a search or ECM or eDiscovery vendor who doesn’t talk about how “80% of a corporations data is unstructured.” Aside from the fact that much of this data is garbage (see my earlier postings), is it really fair to call it unstructured?

Personally I hate the descriptor “unstructured content.” When I hear it,all I can think of is letters spilling out of a bucket – no rhyme or reason to the flow or sense to be made (unless you’re Edward Lorenz).

What we so often call unstructured content really has lots of structure that can be used and leveraged for many types of text analytics purposes.

Explicit Metadata

Every file system object has metadata attributes associated to it. This is explicit metadata which search vendors have used for years to help make content findability more successful. Things like “name,” “date,” “filetype” are examples of explicit metadata.

Since the dawn of computers, explicit metadata has been used to help computers store information, and users to find it.

Implied Metadata

This is metadata that can be defined or extracted from the content of objects in file systems. Office documents have a Properties sheet with information such as “last accessed date,” “author, and “word count.” Music files often store album information and song length.

Even deeper, within many content objects we can identify and extract things like credit card numbers, phone numbers, dates, or names. This type of entity identification and extraction enables a rich metadata view into content – something only possible because there’s _structure_ not random letters.

So What?

It’s unfair to call an email or Word document “unstructured.” I prefer the generic “content” because when you really look, there’s rich structure that can be exploited by a business. Companies can use “last accessed date” to support RM retention and disposition activities, or identification of credit card numbers to ensure proper PII strategy adherence.

Thankfully, there’s no such thing as unstructured data. Otherwise the companies I work with wouldn’t be able to use the explicit and implied metadata, and rich object structure of content in their content analytics strategies.

Advertisements

Text Analytics – An Ingredient, not a Solution

July 27, 2011 1 comment

Wikipedia has a short entry on “text analytics” http://en.wikipedia.org/wiki/Text_analytics where the length of the concept definition is eclipsed by the list of vendors selling products / solutions / tools that claim some type of text analytics capabilities.

After reading far too many vendor data sheets and whitepapers and reviewing all the fun #textanalytics twitter posts, it really boils down to “deep inspection of content” for some tuned business purpose.

There are those who claim to have broad platforms for text analytics and cast a (very) wide net across the multitude of potential business use-cases – threat and fraud to better search to clinical trials – but without a real “solution.”  (I define a “solution” as a product or platform with domain specific IP to address a business problem).

Truly the text analytics marketplace is nascent.  We’re seeing huge growth in content related to the topic, and searches about the topic.  You can see for yourself with Google Insights:   http://www.google.com/insights/search/#q=text%20analytics&cmpt=q

There are some vendors who are positioning their technologies as platforms for text analytics, however I expect they’ll struggle as customers try and understand if they’re selling a “dessert topping or floor wax.”

The true power, and value, of text analytics comes from leveraging the capabilities as ingredients in broader business solutions.  Text analytics alone won’t reduce insurance fraud, but it can be an extremely powerful addition to an insurance fraud solution which includes ECM, BPM and BI.

I expect over the next 2 years we’ll see a decline in the positioning of platforms and a marked increase in the positioning (and vendor success) of leveraging text analytics in real business solutions and enterprise applications.  The power of text analytics is real – but that power only becomes truly useful when delivered in context of a business problem with definable ROI.
This is, I believe, why “consumer insight” is so popular in the text analytics field.

It’s intuitively obvious that happy customers are better than unhappy.  What keeps me so interested in text analytics is the ingredient opportunities beyond consumer sentiment analytics.

Data Integration – How Text Analytics Can Help

February 8, 2011 Leave a comment

According to a Bloor Research report of some time ago, 80% of data integration projects fail.

I’m not sure if that’s been updated recently, but I’d be very surprised to learn the percentage has changed much one way or the other.

The question we should ask ourselves is not “why.”  But instead, as odd as it sounds, we should ask “what.”

Not “Why” Integrate

The reasons “why” to integrate data are well known and well understood – and there’s usually a solid ROI.  We integrate data to create data warehouses.  To create ODS’ for near-line operational BI.  We integrate to migrate systems (from one vendor to another, from one database to another, from one application to another).

Not “How” to Integrate

An alphabet soup of technologies exists to answer the “how” question.  Depending upon the “why” an organization might opt for any of the usual ETL, or EII, or EAI, or CDC suspects.  All excellent options with very well defined usage scenarios to meet whatever the business & IT requirements are.

It’s the “What” to Integrate

Figuring out what data to integrate is the single biggest challenge, and as more systems are implemented and data grows – it’s becoming a harder problem to solve, not easier.

In the dark old days, data architects would sit around tables with printouts of data models and try and map them all together.  Today, vendors like Informatica, SAP and IBM have business user focused tools for helping to identify the relationships between data elements.

Unfortunately most of the technologies in place today rely on looking at column definitions (name, datatype, size, referential integrity definitions) and try and create a logical mapping across systems through that.

When dealing with the many applications in the marketplace which use application level code to enforce PK/FK relationships, the above modelling tools simply aren’t good enough.  If they were, our 80% would be much smaller.

Text Analytics Can Help

It’s not a panacea.  It’s not a magic wand.  But – text analytics can help data integration projects succeed.

Using text analytics, organizations can index the data across many different systems and infer relationships between columns/data sets.  Text analytics can provide a company with a view into columns that store exactly alike, or very similar data (such as company names, vendor names, product names).  Even going so far as to be able to recognize that in DB1.table9.column99 the data is of type “credit card” and provide a report of all the other databases, tables and columns that have the same data in them.

Text analytics is another way to get a view into your structured data assets that can help support successful data integration projects.  With an 80% failure rate – anything that can help turn a challenge into success is of critical importance.

Data Quality and Unstructured Content

February 6, 2011 4 comments

Data quality matters to unstructured content.  Just as quality is a critical requirement in structured data integration, the need is intrinsic in an effective content strategy.

Unstructured content is rife with quality issues.  Spelling errors, nearly random formatting of key attributes like phone numbers, acronyms (and their variants) are just some of the quality issues confronting unstructured content – just like structured data.

There’s no such thing as an irrelevant data quality issue.  To quote Ted Friedman:

“If you look at…any business function in your company, you’re going to find some direct cost there attributed to poor data quality.”  http://www.gartner.com/it/products/podcasting/asset_145611_2575.jsp

The quality of your data directly impacts the businesses ability to support effective content analytics, search and content integration.

If you’re going to leverage content, you must be able to trust it – and that means executing quality processes as part of the semantic enrichment before analysis, search or content integration can be successful.

Data quality processes are key to an effective BI strategy.  So to are they for content analytics, search and content integration (ETL).

Smarter content means trusted content.  You can’t trust your content unless there’s a quality process around it.

Data Quality for Unstructured Content

Not intended to be comprehensive, here are some core quality activities one must undertake to create Trusted Business Content:

Standardization – This comes in the form of spelling, managing acronyms and content formats.
a) Spelling:  Correcting the spellings of product, vendor or supplier names.  Frequently in documents and web pages product names will be misspelt, and it’s simple yet critical to recognize those errors and correct them.
b) Acronyms:  You must recognize and standardize acronym usage.  For example, recognizing and standardizing I.B.M and ibm and “International Business Machines” to the standard “IBM.” 
c) Formats:  Recognize strings such as 647.285.2630 and +1-640-285-2630 and format them into a consistent form for phone numbers.

Verification – This comes in the form of validating strings and recognizing their semantic meaning.  In conjunction with standardization, verification capabilities such as recognizing a string as “AA######” means there’s a potentially valid Canadian passport number.  Or identifying a string of 16 numbers and determining it’s an Amex credit card number.  This verification of the semantic meaning of extracted entities enables the business to both standardize and provide an assessment of unstructured content assets.

Enriching – This means recognizing a document contains any of the above and enriching the surrounding metadata.  It enables more effective search, deeper/richer analytics and supports content ETL processes as well.  For example, identifying a document contains a credit card number is of business value, but then being able to enrich the document metadata attributes with a flag indicator that a credit card number exists is critical to effective, functional assessment and empowers a richer more effective search experience.

Data quality on unstructured content matters.  Quality makes search better, content analytics effective and integration usable.  Semantic enrichment of quality free content is akin to “lipstick on a pig.”

It’s a business imperative to make quality a component of your content strategy.

Text Analytics & Jeopardy

January 11, 2011 Leave a comment

February 14 – 16 2011, IBM is making a big gamble by attempting to compete against Ken Jennings and Brad Rutter on Jeopardy

Named Watson after one of the founders of IBM, DeepQA uses the OASIS text analytics standard UIMA on an asynchronous messaging infrastructure (presumably MQSeries) for real-time analysis of unstructured content (a question) and response (answer).

As pointed out by Stephen Baker in an article on Huffington Post this isn’t a “search” problem that any Tom, Dick or Google can solve. 

Take for example the question “Not surprisingly, more undergrads at the University of Puget Sound are from this state than any other.”  Watson needs to understand that “not surprisingly” means it’s true, or likely and also be able to reason from there what state the University of Puget Sound is in and whether Washington is the most likely true answer.

Watson is to be fully self contained with no access to external information – just like a real contestant. 

If IBM can pull this off, it will show that “search” isn’t “text analytics.”  That’s not to say that text analytics can’t (and doesn’t) deliver value to search technologies, but to try and meet the Jeopardy challenge with ancient Bayesian algorithms or traditional search indexing simply can’t work.

Dave Ferrucci is the Principal Investigator leading the Watson team, and was recently interviewed on NPR.  You can read the transcript here:  http://www.npr.org/2011/01/08/132769575/Can-A-Computer-Become-A-Jeopardy-Champ

13-Jan-2011 update:

There was a test match today that Watson won.  You can see photo’s from the match on cnet News:  http://news.cnet.com/2300-11386_3-10006289.html?tag=mncol

Is All Your Content Equal?

January 7, 2011 Leave a comment

Trusted Business Content

If you’re like the vast majority of the companies I speak to, more content is not your problem.  The problem you’re struggling with is “how can I trust this content” or “how do I figure out if this content is valuable to my business?”

Being awash in a sea of content doesn’t providing a unique business opportunity.  Being able to trust the content you have, and leverage it for new business insights does.

In the same manner that not all content is equal, not all content repositories are equal.  Identifying the content that matters to your business, and then doing something useful with that content, is the difference between a company that stores content, and one that leverages it.

At first blush, the problem of sifting through the masses of content you store to figure out what’s relevant or not can be extremely daunting.  if you’re like many of the companies I speak with, you have terabytes of content across stores like FileNet, Livelink, SharePoint and shared drives.

Sadly, reality is there’s simply no magic wand you can wave.  But there are programmatic approaches you can use to build a strategy for identifying the content that matters.

What is Trusted Business Content?

Fundamentally, Trusted Business Content (TBC) is your unstructured information that you know you can leverage to optimize your content related business activities and enable innovation. 

Trusted Business Content enables innovation.  Very simply, if your IT department can deliver accurate and complete information in context for a business user then they will be able to make the right analysis to draw out true business insights.  Using bad data as a strategic asset is simply not an option.

TBC has 4 key tenets:

  1. Insightful.  Derives meaning and new understanding from your content, as the content is created and changes
  2. In Context.  Supports the real-time and on-demand delivery of relevant information in the context of business processes, applications, systems and user demands
  3. Complete.  Related information is reconciled into a single and holistic view
  4. Accurate.  Complex and disparate content is transformed, cleansed and delivered

So, where, and how, do you start?

It’s a Journey

There’s no “one-and-done” approach to developing organizational Trusted Business Content (TBC).

Just as data quality (#dataquality), data integration (#dataintegration) and data warehousing (#datawarehousing) projects are ongoing business activities, so to is the need to identify TBC.  There’s low-hanging fruit in the most commonly accessed and used systems.

Like any business transformation exercise, find the low-hanging fruit to build a fast and measurable success for internal stakeholders.

Leverage What You Already Know

Every organization already has lists of entities they deal with.  Customers, vendors, suppliers, employees, etc.  Some companies store/control/manage a single view of these entities in their MDM (#MDM) repositories, other’s it’s in one – or more – data warehouses.

These entities are a start to help with understanding the business value of your content.  You can create entity identifiers, such as UIMA (#UIMA) dictionary annotators, for entity extraction and identification.  A usable content analytics technology (#textanalytics) enables the business to quickly and easily manage the entities being extracted with feature rich tooling

If there’s no customer name, vendor name, supplier name, or any other business entity you report on – there’s a strong possibility it’s not relevant content.

Additionally, libraries of algorithms are available for identifying strings of characters as credit card numbers, Social Insurance Numbers, phone numbers, cities, States, Countries, etc. which can be used to help identify content of business value.

Use common sense.  If a document has a credit card number or a customer name – it’s probably important.

Visualize the Results

A business user doesn’t want to see a printout of a list of file names that are (or aren’t) of business value.  An effective content analytics product will be able to visualize the results and allow for the dynamic slice & dice of the results. 

Deliver to the decision makers visual reports of how much content can be eliminated, and what the makeup of the TBC is.  This will best enable the business decision makers to understand the value of the IT work.