Unstructured Data – The Myth
There’s a lot of noise about “structured” and “unstructured” data. But what is this “unstructured data” beast we keep hearing about?
It’s almost impossible to see a presentation from a search or ECM or eDiscovery vendor who doesn’t talk about how “80% of a corporations data is unstructured.” Aside from the fact that much of this data is garbage (see my earlier postings), is it really fair to call it unstructured?
Personally I hate the descriptor “unstructured content.” When I hear it,all I can think of is letters spilling out of a bucket – no rhyme or reason to the flow or sense to be made (unless you’re Edward Lorenz).
What we so often call unstructured content really has lots of structure that can be used and leveraged for many types of text analytics purposes.
Every file system object has metadata attributes associated to it. This is explicit metadata which search vendors have used for years to help make content findability more successful. Things like “name,” “date,” “filetype” are examples of explicit metadata.
Since the dawn of computers, explicit metadata has been used to help computers store information, and users to find it.
This is metadata that can be defined or extracted from the content of objects in file systems. Office documents have a Properties sheet with information such as “last accessed date,” “author, and “word count.” Music files often store album information and song length.
Even deeper, within many content objects we can identify and extract things like credit card numbers, phone numbers, dates, or names. This type of entity identification and extraction enables a rich metadata view into content – something only possible because there’s _structure_ not random letters.
It’s unfair to call an email or Word document “unstructured.” I prefer the generic “content” because when you really look, there’s rich structure that can be exploited by a business. Companies can use “last accessed date” to support RM retention and disposition activities, or identification of credit card numbers to ensure proper PII strategy adherence.
Thankfully, there’s no such thing as unstructured data. Otherwise the companies I work with wouldn’t be able to use the explicit and implied metadata, and rich object structure of content in their content analytics strategies.
Wikipedia has a short entry on “text analytics” http://en.wikipedia.org/wiki/Text_analytics where the length of the concept definition is eclipsed by the list of vendors selling products / solutions / tools that claim some type of text analytics capabilities.
After reading far too many vendor data sheets and whitepapers and reviewing all the fun #textanalytics twitter posts, it really boils down to “deep inspection of content” for some tuned business purpose.
There are those who claim to have broad platforms for text analytics and cast a (very) wide net across the multitude of potential business use-cases – threat and fraud to better search to clinical trials – but without a real “solution.” (I define a “solution” as a product or platform with domain specific IP to address a business problem).
Truly the text analytics marketplace is nascent. We’re seeing huge growth in content related to the topic, and searches about the topic. You can see for yourself with Google Insights: http://www.google.com/insights/search/#q=text%20analytics&cmpt=q
There are some vendors who are positioning their technologies as platforms for text analytics, however I expect they’ll struggle as customers try and understand if they’re selling a “dessert topping or floor wax.”
The true power, and value, of text analytics comes from leveraging the capabilities as ingredients in broader business solutions. Text analytics alone won’t reduce insurance fraud, but it can be an extremely powerful addition to an insurance fraud solution which includes ECM, BPM and BI.
I expect over the next 2 years we’ll see a decline in the positioning of platforms and a marked increase in the positioning (and vendor success) of leveraging text analytics in real business solutions and enterprise applications. The power of text analytics is real – but that power only becomes truly useful when delivered in context of a business problem with definable ROI.
This is, I believe, why “consumer insight” is so popular in the text analytics field.
It’s intuitively obvious that happy customers are better than unhappy. What keeps me so interested in text analytics is the ingredient opportunities beyond consumer sentiment analytics.
According to a Bloor Research report of some time ago, 80% of data integration projects fail.
I’m not sure if that’s been updated recently, but I’d be very surprised to learn the percentage has changed much one way or the other.
The question we should ask ourselves is not “why.” But instead, as odd as it sounds, we should ask “what.”
Not “Why” Integrate
The reasons “why” to integrate data are well known and well understood – and there’s usually a solid ROI. We integrate data to create data warehouses. To create ODS’ for near-line operational BI. We integrate to migrate systems (from one vendor to another, from one database to another, from one application to another).
Not “How” to Integrate
An alphabet soup of technologies exists to answer the “how” question. Depending upon the “why” an organization might opt for any of the usual ETL, or EII, or EAI, or CDC suspects. All excellent options with very well defined usage scenarios to meet whatever the business & IT requirements are.
It’s the “What” to Integrate
Figuring out what data to integrate is the single biggest challenge, and as more systems are implemented and data grows – it’s becoming a harder problem to solve, not easier.
In the dark old days, data architects would sit around tables with printouts of data models and try and map them all together. Today, vendors like Informatica, SAP and IBM have business user focused tools for helping to identify the relationships between data elements.
Unfortunately most of the technologies in place today rely on looking at column definitions (name, datatype, size, referential integrity definitions) and try and create a logical mapping across systems through that.
When dealing with the many applications in the marketplace which use application level code to enforce PK/FK relationships, the above modelling tools simply aren’t good enough. If they were, our 80% would be much smaller.
Text Analytics Can Help
It’s not a panacea. It’s not a magic wand. But – text analytics can help data integration projects succeed.
Using text analytics, organizations can index the data across many different systems and infer relationships between columns/data sets. Text analytics can provide a company with a view into columns that store exactly alike, or very similar data (such as company names, vendor names, product names). Even going so far as to be able to recognize that in DB1.table9.column99 the data is of type “credit card” and provide a report of all the other databases, tables and columns that have the same data in them.
Text analytics is another way to get a view into your structured data assets that can help support successful data integration projects. With an 80% failure rate – anything that can help turn a challenge into success is of critical importance.
The data warehousing market is fascinating to me. Thanks to my past experiences in real-time change-data-capture technology, I’ve managed to be exposed to some fascinating use-cases and real-world technologies.
The following is based on my experiences working with (what I call) appliance vendors like Teradata and Netezza (now IBM), as well as more OLAP style systems like Oracle, SQL Server and DB2.
The Fundamental Flaw:
Businesses are event based, and event driven. Business is based on interactions, or transactions, between people, processes and systems. Every transaction may matter, some may matter more than others; some may matter not at all. Regardless, there are always transactions, or events, that do matter and when harnessed effectively will provide organizations the ability to sense and respond to changing business dynamics and requirements more effectively than their competition.
What’s required is what’s missing – the ability to capture and store the events that matter in a way that meets the business need for immediacy whilst minimizing business and IT risk. Businesses are looking for an environment that’s capable of acting as a detailed transactional system for operational reporting, the source for master information and strategic reporting activities.
A warehouse that can be all things to all business users
The detailed information about business events is more critical than ever before, especially given Master Data Management initiatives, operational intelligence requirements and the expanded use of data warehouses.
Companies that have a dynamic warehouse which contains detailed transactional information through real-time feeds of data will have a competitive advantage. These organizations will have a repository of transactional information for operational intelligence reporting as well as a complete source of information easily supporting changes to MDM definitions and report requirements
Data warehouses with highly normalized data schemas replace reporting against production environments and manual tabulation from hand written documents (incl. spreadsheets). Data warehouses were created to aid in the reporting on data contained within systems, but have at least one fundamental, critical flaw from a pure technical perspective – it’s impossible to decompose aggregates. In the past, processing power and disk space was extremely expensive. In order for the IT organization to meet the business requirements (speed and completeness) around reporting it was important that data was highly normalized from production environments – aggregates created as data is loaded to shrink the time and effort required to generate reports.
A critical problem with this is the inability to decompose an aggregate. If the answer is “42” what’s the question? Maybe it’s the sum of 10 + 30 + 2 or maybe it’s 84/2. But if the answer to the question of “average time to ship replacement inventory” is “42 hours” how can you track the ebbs and flows of business process execution? How can you correlate events across business process execution – you need detailed transactional detail from every operation of every system.
Data warehouses contain transactional detail created by execution of business processes from operations over time. The schema format is fundamentally different from the “traditional” star or snowflake forms found in First Wave data warehouses. The warehouse schema shows timestamps for the execution of every operation on each source system as well as source system identification metadata. By maintaining this metadata and transactional detail it’s possible to recreate a production system at any point in time in the past (of potential value in highly compliant companies) and report on the status and temporal characteristics of business process execution.
Businesses get better ability to understand events and IT can more easily adapt to changing business requirements for report generation.
There are no summarizations generated to speed reporting without the associated detailed data that shows how any and all aggregates can be decomposed.
Having the detailed information in the single repository allows for greater value to be derived from MDM projects. As the definition of a “gold master” (i.e. customer definition) changes over time, the past can always be reconstructed because the detailed information still exists.
The detailed data also provides value to quality processes. Over time as the business changes reporting requirements and master data definitions, quality processes will necessarily change too. Quality is no more static than a business itself. Detailed historical data allows companies to produce more value and understanding.
Operational empowerment becomes pervasive across organizations. Event propagation means more than just flowing data transactions between systems, people and processes. Companies will utilize historical context available in the data warehouse with event awareness to gain a further competitive advantage. First-fruits will probably be with customer service empowerment such as proactively alerting pre-paid cellphone customers that they’re running low on available minutes. The critical component is the utilization of the data warehouse as an area for gaining context into business events for effective, and as much as possible automated, reaction.
Predictive computational analysis through holistic system awareness expands data warehouse usage into systems management, not just business reporting.
Detailed transactional data is stored in conjunction with systems operations information such as transaction volumes, system response time, network traffic, etc. Correlating metadata about transactional operations (inserts, updates or deletes plus length of transaction) with systems information (how many active users and bandwidth usage) can be used to predict system resource requirements.
IT now uses the data warehouse to help serve business requirements better and ensure they’re meeting internal and external SLA’s.
The technology industry today is between the first and second waves.
Companies I’ve worked with understand the value of operational data in warehouse appliances but are struggling to come to terms with non-OLTP nature of existing appliance solutions and challenges in loading data quickly and efficiently for effective reporting.
Teradata, Netezza and others in the warehouse appliance space are starting to understand the limitations of their technologies in supporting the changing requirements – limitations seemingly not faced by OLTP vendors like Oracle, Microsoft and DB2.
Data quality matters to unstructured content. Just as quality is a critical requirement in structured data integration, the need is intrinsic in an effective content strategy.
Unstructured content is rife with quality issues. Spelling errors, nearly random formatting of key attributes like phone numbers, acronyms (and their variants) are just some of the quality issues confronting unstructured content – just like structured data.
There’s no such thing as an irrelevant data quality issue. To quote Ted Friedman:
“If you look at…any business function in your company, you’re going to find some direct cost there attributed to poor data quality.” http://www.gartner.com/it/products/podcasting/asset_145611_2575.jsp
The quality of your data directly impacts the businesses ability to support effective content analytics, search and content integration.
If you’re going to leverage content, you must be able to trust it – and that means executing quality processes as part of the semantic enrichment before analysis, search or content integration can be successful.
Data quality processes are key to an effective BI strategy. So to are they for content analytics, search and content integration (ETL).
Smarter content means trusted content. You can’t trust your content unless there’s a quality process around it.
Data Quality for Unstructured Content
Not intended to be comprehensive, here are some core quality activities one must undertake to create Trusted Business Content:
Standardization – This comes in the form of spelling, managing acronyms and content formats.
a) Spelling: Correcting the spellings of product, vendor or supplier names. Frequently in documents and web pages product names will be misspelt, and it’s simple yet critical to recognize those errors and correct them.
b) Acronyms: You must recognize and standardize acronym usage. For example, recognizing and standardizing I.B.M and ibm and “International Business Machines” to the standard “IBM.”
c) Formats: Recognize strings such as 647.285.2630 and +1-640-285-2630 and format them into a consistent form for phone numbers.
Verification – This comes in the form of validating strings and recognizing their semantic meaning. In conjunction with standardization, verification capabilities such as recognizing a string as “AA######” means there’s a potentially valid Canadian passport number. Or identifying a string of 16 numbers and determining it’s an Amex credit card number. This verification of the semantic meaning of extracted entities enables the business to both standardize and provide an assessment of unstructured content assets.
Enriching – This means recognizing a document contains any of the above and enriching the surrounding metadata. It enables more effective search, deeper/richer analytics and supports content ETL processes as well. For example, identifying a document contains a credit card number is of business value, but then being able to enrich the document metadata attributes with a flag indicator that a credit card number exists is critical to effective, functional assessment and empowers a richer more effective search experience.
Data quality on unstructured content matters. Quality makes search better, content analytics effective and integration usable. Semantic enrichment of quality free content is akin to “lipstick on a pig.”
It’s a business imperative to make quality a component of your content strategy.
February 14 – 16 2011, IBM is making a big gamble by attempting to compete against Ken Jennings and Brad Rutter on Jeopardy.
Named Watson after one of the founders of IBM, DeepQA uses the OASIS text analytics standard UIMA on an asynchronous messaging infrastructure (presumably MQSeries) for real-time analysis of unstructured content (a question) and response (answer).
As pointed out by Stephen Baker in an article on Huffington Post this isn’t a “search” problem that any Tom, Dick or Google can solve.
Take for example the question “Not surprisingly, more undergrads at the University of Puget Sound are from this state than any other.” Watson needs to understand that “not surprisingly” means it’s true, or likely and also be able to reason from there what state the University of Puget Sound is in and whether Washington is the most likely true answer.
Watson is to be fully self contained with no access to external information – just like a real contestant.
If IBM can pull this off, it will show that “search” isn’t “text analytics.” That’s not to say that text analytics can’t (and doesn’t) deliver value to search technologies, but to try and meet the Jeopardy challenge with ancient Bayesian algorithms or traditional search indexing simply can’t work.
Dave Ferrucci is the Principal Investigator leading the Watson team, and was recently interviewed on NPR. You can read the transcript here: http://www.npr.org/2011/01/08/132769575/Can-A-Computer-Become-A-Jeopardy-Champ
There was a test match today that Watson won. You can see photo’s from the match on cnet News: http://news.cnet.com/2300-11386_3-10006289.html?tag=mncol
Nothing is free – especially software. Whether willful or not, it’s common for the costs of a successful software implementation to be grossly underestimated (a key reason 80% of data integration projects fail). Maybe it’s because too often we think some piece of software will do everything we need out-of-the-box without realizing how much really needs to be customized.
Even still, it was a shock to me to learn that “According to Microsoft research, the services opportunity for SharePoint alone is predicted to grow to US$6.2 billion by 2011.” https://partner.microsoft.com/US/program/competencies/compportalsandcollaboration
As Alan Pelz-Sharpe points out in a great blog posting, that’s what “you the customer will spend.”
If a car was #SharePoint, it would be like buying the seats, tires, windows, antenna, doors and fenders separately.