Archive for January, 2011

Text Analytics & Jeopardy

January 11, 2011 Leave a comment

February 14 – 16 2011, IBM is making a big gamble by attempting to compete against Ken Jennings and Brad Rutter on Jeopardy

Named Watson after one of the founders of IBM, DeepQA uses the OASIS text analytics standard UIMA on an asynchronous messaging infrastructure (presumably MQSeries) for real-time analysis of unstructured content (a question) and response (answer).

As pointed out by Stephen Baker in an article on Huffington Post this isn’t a “search” problem that any Tom, Dick or Google can solve. 

Take for example the question “Not surprisingly, more undergrads at the University of Puget Sound are from this state than any other.”  Watson needs to understand that “not surprisingly” means it’s true, or likely and also be able to reason from there what state the University of Puget Sound is in and whether Washington is the most likely true answer.

Watson is to be fully self contained with no access to external information – just like a real contestant. 

If IBM can pull this off, it will show that “search” isn’t “text analytics.”  That’s not to say that text analytics can’t (and doesn’t) deliver value to search technologies, but to try and meet the Jeopardy challenge with ancient Bayesian algorithms or traditional search indexing simply can’t work.

Dave Ferrucci is the Principal Investigator leading the Watson team, and was recently interviewed on NPR.  You can read the transcript here:

13-Jan-2011 update:

There was a test match today that Watson won.  You can see photo’s from the match on cnet News:


How Much Did “Free” Cost You?

January 10, 2011 Leave a comment

Nothing is free – especially software.  Whether willful or not, it’s common for the costs of a successful software implementation to be grossly underestimated (a key reason 80% of data integration projects fail).  Maybe it’s because too often we think some piece of software will do everything we need out-of-the-box without realizing how much really needs to be customized. 

Even still, it was a shock to me to learn that “According to Microsoft research, the services opportunity for SharePoint alone is predicted to grow to US$6.2 billion by 2011.”

As Alan Pelz-Sharpe points out in a great blog posting, that’s what “you the customer will spend.”

If a car was #SharePoint, it would be like buying the seats, tires, windows, antenna, doors and fenders separately.

Categories: Unstructured Thoughts Tags: ,

Is All Your Content Equal?

January 7, 2011 Leave a comment

Trusted Business Content

If you’re like the vast majority of the companies I speak to, more content is not your problem.  The problem you’re struggling with is “how can I trust this content” or “how do I figure out if this content is valuable to my business?”

Being awash in a sea of content doesn’t providing a unique business opportunity.  Being able to trust the content you have, and leverage it for new business insights does.

In the same manner that not all content is equal, not all content repositories are equal.  Identifying the content that matters to your business, and then doing something useful with that content, is the difference between a company that stores content, and one that leverages it.

At first blush, the problem of sifting through the masses of content you store to figure out what’s relevant or not can be extremely daunting.  if you’re like many of the companies I speak with, you have terabytes of content across stores like FileNet, Livelink, SharePoint and shared drives.

Sadly, reality is there’s simply no magic wand you can wave.  But there are programmatic approaches you can use to build a strategy for identifying the content that matters.

What is Trusted Business Content?

Fundamentally, Trusted Business Content (TBC) is your unstructured information that you know you can leverage to optimize your content related business activities and enable innovation. 

Trusted Business Content enables innovation.  Very simply, if your IT department can deliver accurate and complete information in context for a business user then they will be able to make the right analysis to draw out true business insights.  Using bad data as a strategic asset is simply not an option.

TBC has 4 key tenets:

  1. Insightful.  Derives meaning and new understanding from your content, as the content is created and changes
  2. In Context.  Supports the real-time and on-demand delivery of relevant information in the context of business processes, applications, systems and user demands
  3. Complete.  Related information is reconciled into a single and holistic view
  4. Accurate.  Complex and disparate content is transformed, cleansed and delivered

So, where, and how, do you start?

It’s a Journey

There’s no “one-and-done” approach to developing organizational Trusted Business Content (TBC).

Just as data quality (#dataquality), data integration (#dataintegration) and data warehousing (#datawarehousing) projects are ongoing business activities, so to is the need to identify TBC.  There’s low-hanging fruit in the most commonly accessed and used systems.

Like any business transformation exercise, find the low-hanging fruit to build a fast and measurable success for internal stakeholders.

Leverage What You Already Know

Every organization already has lists of entities they deal with.  Customers, vendors, suppliers, employees, etc.  Some companies store/control/manage a single view of these entities in their MDM (#MDM) repositories, other’s it’s in one – or more – data warehouses.

These entities are a start to help with understanding the business value of your content.  You can create entity identifiers, such as UIMA (#UIMA) dictionary annotators, for entity extraction and identification.  A usable content analytics technology (#textanalytics) enables the business to quickly and easily manage the entities being extracted with feature rich tooling

If there’s no customer name, vendor name, supplier name, or any other business entity you report on – there’s a strong possibility it’s not relevant content.

Additionally, libraries of algorithms are available for identifying strings of characters as credit card numbers, Social Insurance Numbers, phone numbers, cities, States, Countries, etc. which can be used to help identify content of business value.

Use common sense.  If a document has a credit card number or a customer name – it’s probably important.

Visualize the Results

A business user doesn’t want to see a printout of a list of file names that are (or aren’t) of business value.  An effective content analytics product will be able to visualize the results and allow for the dynamic slice & dice of the results. 

Deliver to the decision makers visual reports of how much content can be eliminated, and what the makeup of the TBC is.  This will best enable the business decision makers to understand the value of the IT work.

Low Loyalty Vertical

January 4, 2011 Leave a comment

The Low Loyalty Vertical is a phrase I use to describe any company that operates in any industry where the customer base is:

1 – Highly transient

2 – Margins are razor thin

3 – Your customer is a heartbeat away from becoming your competitors’ customer.

A great example is financial services companies.  I personally have 4 credit cards from 4 different financial institutions. But with one in particular I have various savings and chequing accounts, as well as a joint account with my wife. Now, imagine that my main provider does something I don’t like – ups their monthly fees for example or increases my interest rate on a credit card. I have no loyalty and the cost to me to switch is zero dollars. I walk into a competing institution, tell them I want to switch, fill out a form or two, and it all magically happens. No effort on my part, no loyalty keeping me.

Something that’s interesting to me, as I travel around talking to many different companies in many different verticals, is how they are all struggling with the same core issues around driving customer loyalty to maintain and grow their businesses.

They talk about macro-economic challenges and business dynamics like “mergers and acquisitions” or “regulatory pressures” or “innovation and growth” but what they’re really trying to do is understand how to manage loyalty. How to keep their existing customers happy, and how to get new customers from the competition (hey, this is business and nobody should apologize for that!).

Some niche technology companies seem to argue this is the purview of their sentiment anlysis products.  Products that surf around Facebook and blogs and Twitter feeds looking for comments about corporate products and services, or maybe look at the notes written by customer service representatives.

To me, that view isn’t broad enough.  It doesn’t take into account the opportunity to execute on what I call a “Moment to Impact Loyalty” where the immediate – real time – understanding of a customer as the interactions are happening can be leveraged to ensure a customer remains a customer.

The sooner a company can understand a customer is angry, and what that customer cares about, the faster they can react and impact the moment to keep the customer happy.

It’s sentiment analysis, but in a real time complex event processing scenario.