Data Quality and Unstructured Content

Home > Unstructured Thoughts > Data Quality and Unstructured Content

Data Quality and Unstructured Content

February 6, 2011 Paul O'Hagan Leave a comment Go to comments

Data quality matters to unstructured content. Just as quality is a critical requirement in structured data integration, the need is intrinsic in an effective content strategy.

Unstructured content is rife with quality issues. Spelling errors, nearly random formatting of key attributes like phone numbers, acronyms (and their variants) are just some of the quality issues confronting unstructured content – just like structured data.

There’s no such thing as an irrelevant data quality issue. To quote Ted Friedman:

“If you look at…any business function in your company, you’re going to find some direct cost there attributed to poor data quality.” http://www.gartner.com/it/products/podcasting/asset_145611_2575.jsp

The quality of your data directly impacts the businesses ability to support effective content analytics, search and content integration.

If you’re going to leverage content, you must be able to trust it – and that means executing quality processes as part of the semantic enrichment before analysis, search or content integration can be successful.

Data quality processes are key to an effective BI strategy. So to are they for content analytics, search and content integration (ETL).

Smarter content means trusted content. You can’t trust your content unless there’s a quality process around it.

Data Quality for Unstructured Content

Not intended to be comprehensive, here are some core quality activities one must undertake to create Trusted Business Content:

Standardization – This comes in the form of spelling, managing acronyms and content formats.
a) Spelling: Correcting the spellings of product, vendor or supplier names. Frequently in documents and web pages product names will be misspelt, and it’s simple yet critical to recognize those errors and correct them.
b) Acronyms: You must recognize and standardize acronym usage. For example, recognizing and standardizing I.B.M and ibm and “International Business Machines” to the standard “IBM.”
c) Formats: Recognize strings such as 647.285.2630 and +1-640-285-2630 and format them into a consistent form for phone numbers.

Verification – This comes in the form of validating strings and recognizing their semantic meaning. In conjunction with standardization, verification capabilities such as recognizing a string as “AA######” means there’s a potentially valid Canadian passport number. Or identifying a string of 16 numbers and determining it’s an Amex credit card number. This verification of the semantic meaning of extracted entities enables the business to both standardize and provide an assessment of unstructured content assets.

Enriching – This means recognizing a document contains any of the above and enriching the surrounding metadata. It enables more effective search, deeper/richer analytics and supports content ETL processes as well. For example, identifying a document contains a credit card number is of business value, but then being able to enrich the document metadata attributes with a flag indicator that a credit card number exists is critical to effective, functional assessment and empowers a richer more effective search experience.

Data quality on unstructured content matters. Quality makes search better, content analytics effective and integration usable. Semantic enrichment of quality free content is akin to “lipstick on a pig.”

It’s a business imperative to make quality a component of your content strategy.

Categories: Unstructured Thoughts Tags: Content Analytics, Content Assessment, Data Quality, Enterprise Search

Comments (3) Trackbacks (1) Leave a comment Trackback

Lindsey Niedzielski

February 11, 2011 at 13:49

Reply

Great post Paul. Thank you for pointing out how important spelling can be in data management, I think this sometimes gets overlooked. We have a community for IM professionals (www.openmethodology.org) and have bookmarked this post for our users. Look forward to reading your work in the future.
- Paul O'Hagan
  
  February 11, 2011 at 14:08
  
  Reply
  
  Thanks Lindsey. The methodology defined on Mike2.0 is very interesting and detailed. I look forward to reading it in more depth and learning more.
Gary MacFadden

April 12, 2011 at 10:37

Reply

Paul,

Enjoyed this post. You are one of the very few individuals writing about the importance of content quality and providing substantive examples of what can be done to achieve it. Keep up the good work.

Gary

February 6, 2011 at 13:23

Tweets that mention Data Quality and Unstructured Content « Paul O'Hagan's Perspective -- Topsy.com

Paul O'Hagan's Perspective

Data Quality and Unstructured Content

Leave a comment Cancel reply

Recent Tweets

Tag Cloud

Archives

Paul O'Hagan's Perspective

Data Quality and Unstructured Content

Share this:

Related

Leave a comment Cancel reply

Recent Tweets

Tag Cloud

Archives