Archive

Posts Tagged ‘Data Warehousing’

Data Integration – How Text Analytics Can Help

February 8, 2011 Leave a comment

According to a Bloor Research report of some time ago, 80% of data integration projects fail.

I’m not sure if that’s been updated recently, but I’d be very surprised to learn the percentage has changed much one way or the other.

The question we should ask ourselves is not “why.”  But instead, as odd as it sounds, we should ask “what.”

Not “Why” Integrate

The reasons “why” to integrate data are well known and well understood – and there’s usually a solid ROI.  We integrate data to create data warehouses.  To create ODS’ for near-line operational BI.  We integrate to migrate systems (from one vendor to another, from one database to another, from one application to another).

Not “How” to Integrate

An alphabet soup of technologies exists to answer the “how” question.  Depending upon the “why” an organization might opt for any of the usual ETL, or EII, or EAI, or CDC suspects.  All excellent options with very well defined usage scenarios to meet whatever the business & IT requirements are.

It’s the “What” to Integrate

Figuring out what data to integrate is the single biggest challenge, and as more systems are implemented and data grows – it’s becoming a harder problem to solve, not easier.

In the dark old days, data architects would sit around tables with printouts of data models and try and map them all together.  Today, vendors like Informatica, SAP and IBM have business user focused tools for helping to identify the relationships between data elements.

Unfortunately most of the technologies in place today rely on looking at column definitions (name, datatype, size, referential integrity definitions) and try and create a logical mapping across systems through that.

When dealing with the many applications in the marketplace which use application level code to enforce PK/FK relationships, the above modelling tools simply aren’t good enough.  If they were, our 80% would be much smaller.

Text Analytics Can Help

It’s not a panacea.  It’s not a magic wand.  But – text analytics can help data integration projects succeed.

Using text analytics, organizations can index the data across many different systems and infer relationships between columns/data sets.  Text analytics can provide a company with a view into columns that store exactly alike, or very similar data (such as company names, vendor names, product names).  Even going so far as to be able to recognize that in DB1.table9.column99 the data is of type “credit card” and provide a report of all the other databases, tables and columns that have the same data in them.

Text analytics is another way to get a view into your structured data assets that can help support successful data integration projects.  With an 80% failure rate – anything that can help turn a challenge into success is of critical importance.

Advertisements

The 3 Waves of Data Warehousing

February 7, 2011 Leave a comment

The data warehousing market is fascinating to me. Thanks to my past experiences in real-time change-data-capture technology, I’ve managed to be exposed to some fascinating use-cases and real-world technologies.

The following is based on my experiences working with (what I call) appliance vendors like Teradata and Netezza (now IBM), as well as more OLAP style systems like Oracle, SQL Server and DB2.

The Fundamental Flaw:
Businesses are event based, and event driven. Business is based on interactions, or transactions, between people, processes and systems. Every transaction may matter, some may matter more than others; some may matter not at all. Regardless, there are always transactions, or events, that do matter and when harnessed effectively will provide organizations the ability to sense and respond to changing business dynamics and requirements more effectively than their competition.

What’s required is what’s missing – the ability to capture and store the events that matter in a way that meets the business need for immediacy whilst minimizing business and IT risk. Businesses are looking for an environment that’s capable of acting as a detailed transactional system for operational reporting, the source for master information and strategic reporting activities.

A warehouse that can be all things to all business users

The detailed information about business events is more critical than ever before, especially given Master Data Management initiatives, operational intelligence requirements and the expanded use of data warehouses.

Companies that have a dynamic warehouse which contains detailed transactional information through real-time feeds of data will have a competitive advantage. These organizations will have a repository of transactional information for operational intelligence reporting as well as a complete source of information easily supporting changes to MDM definitions and report requirements

First Wave
Data warehouses with highly normalized data schemas replace reporting against production environments and manual tabulation from hand written documents (incl. spreadsheets). Data warehouses were created to aid in the reporting on data contained within systems, but have at least one fundamental, critical flaw from a pure technical perspective – it’s impossible to decompose aggregates. In the past, processing power and disk space was extremely expensive. In order for the IT organization to meet the business requirements (speed and completeness) around reporting it was important that data was highly normalized from production environments – aggregates created as data is loaded to shrink the time and effort required to generate reports.

A critical problem with this is the inability to decompose an aggregate. If the answer is “42” what’s the question? Maybe it’s the sum of 10 + 30 + 2 or maybe it’s 84/2. But if the answer to the question of “average time to ship replacement inventory” is “42 hours” how can you track the ebbs and flows of business process execution? How can you correlate events across business process execution – you need detailed transactional detail from every operation of every system.

Second Wave
Data warehouses contain transactional detail created by execution of business processes from operations over time. The schema format is fundamentally different from the “traditional” star or snowflake forms found in First Wave data warehouses. The warehouse schema shows timestamps for the execution of every operation on each source system as well as source system identification metadata. By maintaining this metadata and transactional detail it’s possible to recreate a production system at any point in time in the past (of potential value in highly compliant companies) and report on the status and temporal characteristics of business process execution.

Businesses get better ability to understand events and IT can more easily adapt to changing business requirements for report generation.

There are no summarizations generated to speed reporting without the associated detailed data that shows how any and all aggregates can be decomposed.

Having the detailed information in the single repository allows for greater value to be derived from MDM projects. As the definition of a “gold master” (i.e. customer definition) changes over time, the past can always be reconstructed because the detailed information still exists.

The detailed data also provides value to quality processes. Over time as the business changes reporting requirements and master data definitions, quality processes will necessarily change too. Quality is no more static than a business itself. Detailed historical data allows companies to produce more value and understanding.

Operational empowerment becomes pervasive across organizations. Event propagation means more than just flowing data transactions between systems, people and processes. Companies will utilize historical context available in the data warehouse with event awareness to gain a further competitive advantage. First-fruits will probably be with customer service empowerment such as proactively alerting pre-paid cellphone customers that they’re running low on available minutes. The critical component is the utilization of the data warehouse as an area for gaining context into business events for effective, and as much as possible automated, reaction.

Third Wave
Predictive computational analysis through holistic system awareness expands data warehouse usage into systems management, not just business reporting.

Detailed transactional data is stored in conjunction with systems operations information such as transaction volumes, system response time, network traffic, etc. Correlating metadata about transactional operations (inserts, updates or deletes plus length of transaction) with systems information (how many active users and bandwidth usage) can be used to predict system resource requirements.

IT now uses the data warehouse to help serve business requirements better and ensure they’re meeting internal and external SLA’s.

Today
The technology industry today is between the first and second waves.

Companies I’ve worked with understand the value of operational data in warehouse appliances but are struggling to come to terms with non-OLTP nature of existing appliance solutions and challenges in loading data quickly and efficiently for effective reporting.

Teradata, Netezza and others in the warehouse appliance space are starting to understand the limitations of their technologies in supporting the changing requirements – limitations seemingly not faced by OLTP vendors like Oracle, Microsoft and DB2.