Sunteți pe pagina 1din 32

Reading Between the Lines: Practical Text Analytics

John B. Rollins, Ph.D., P.E., BI Solution Architect, IBM Corporation, rollins@us.ibm.com Ramasubbu Venkatesh, Ph.D., Data Mining Specialist, IBM Corporation, rvenkat@us.ibm.com Alexander Lang, Ph.D., Software Engineer, IBM Corporation, alexlang@de.ibm.com Stefan Abraham, Ph.D., Software Engineer, IBM Corporation, stefana@de.ibm.com

Session Number 2336

Agenda
Business motivation for analyzing unstructured data Text analysis: from text to structure Text analysis in InfoSphere Warehouse Practical examples InfoSphere Warehouse and IBM Content Analyzer Trends in unstructured analytics

Information Irony
Water, water, everywhere, Nor any drop to drink.
Samuel Taylor Coleridge The Rime of the Ancient Mariner

Data is being generated at an unprecedented rate IDC estimates data will grow from 161 exabytes in 2006 to 988 exabytes in 2010 (1 exabyte = 1 billion Gigabytes!) Much of it (80% by some estimates) is unstructured. Information rich Extracting value is difficult Our focus is unstructured TEXTUAL INFORMATION!
2

Information Warehouse Growth Trend


TDWI Survey, 2007
Rapid Growth in Unstructured Data

Respondents expect huge increase in unstructured data as warehouse sources


Collaborative content (Email, IM, Wikis) Content management Voice transcriptions Claims records

Chart & Survey: P. Russom,BI Search and Text Analytics, 3 TDWI Best Practices Report, 2007

Business Scenarios for Unstructured Data


Example: improve product innovation and quality
Use information from customer service records, repair notes, online reviews, and other unstructured sources Reduce reliance on static, predefined problem codes by extracting detailed problem descriptions Understand faster why a product has problems Identify gaps in product portfolio new functionality to drive product innovation

Business Scenarios for Unstructured Data


Product Innovation and Quality
Static Problem Codes (which problems occurred)

Text Analysis (why problems occurred)


5

Business Scenarios for Unstructured Data


Example: reduce customer churn
Identify unhappy customers as early as possible
Analyze text to identify emerging problems, e.g., call center complaints about dropped calls May be too late to take action by the time a problem appears in structured data, e.g., declining number of calls over time for a given cell phone service provider

Analyze customer email and call center logs to detect negative sentiment
Is the customer angry? Is the customer mentioning competitors pricing offers? Is the customer complaining about a particular problem with service or product?

Text Analysis: From Text to Structure

Text to Structure
Create new structured variables from text Example: analysis of vehicle complaints report
Extract accident attributes from text field Create new variables to represent accident attributes CONSUMER WAS SEVERLY INJURED IN AN ACCIDENT. THE ABS ANTI-LOCK BRAKE FAILED AND PASSENGERS AIR BAG DIDN'T DEPLOY.
CMPLID INJURY ACCIDENT ABS AIRBAG_ DRV -AIRBAG_ PASS No Deploy
8

17869

Severe

Yes

Failed

Types of Information Extraction


Named entity recognition
Extract person or place names, monetary expressions, etc.

Co-reference resolution
Identify expressions that refer to the same entity
Alex Lang is our co-author. He is not at IOD this year.

Relationship detection
Extract entities (e.g., products and problems) and use data mining to find relationships among them More robust than elaborate, hand-crafted rules Associations, clustering, predictive modeling
9

Which Extraction Technique to Use?


Depends on the concepts to be extracted Concept is a fixed list of instances: use dictionaries
Product names from database List of employee names from LDAP

Concept follows a simple pattern: use regular expressions


Phone numbers, product codes, etc.

Concept follows complex pattern: use advanced analysis components


Relationships among concepts/entities Additional tools/capabilities
Text exploration (e.g., OmniFind Analytics / Content Analyzer) Customized annotators (e.g., sentiments) 10

Text Analysis in Infosphere Warehouse

11

Text Analytics in Infosphere Warehouse


Data understanding: View text columns in database, text statistics, frequent terms analysis
Run advanced UIMA engines (IBM Omnifind, Partners, IBM Research)

List of terms to extract - Extract frequent term patterns - Create entity dictionary from terms Regular expressionbased extraction Enable hierarchical grouping of terms
12

Focused vs. Explorative Approaches


Focused
Create dictionaries, rules, or annotators to extract precisely the relevant concepts More upstream effort for creation and testing Allows analysis focused on a specific question
Example: Show the top 10 car parts that occur in repair reports for automobile make X

Explorative
Create dictionaries that contain terms with certain parts-of-speech patterns (e.g., adjective-noun) Use downstream analysis (e.g., association rule mining) to weed out irrelevant terms Allows detection of unknown events or relationships
Example: Part X + Failure Attribute Y Vehicle Crash
13

Explorative Approach: Example


Frequent terms + Associations analysis Associations Mining: discover correlated terms Cognos report that are relevant to
resolving the problem

Frequent Terms Analysis:


count the most frequentlyoccurring terms for Auto Make X (relative occurrence may or may not be relevant to understanding the problem)

14

Using Information Extracted from Text


Enrich database with extracted terms/variables
Information Extraction
Identify Frequent Terms Create Dictionary Use Dictionary Lookup to extract terms Regular Expression Extraction Other Extraction Techniques

Text Field Structured Data

Aggregate by Row

Enriched Data

Derived Variables

15

Using Information Extracted from Text


Use new structured variables for reporting and analytics Create additional dimensions for OLAP, e.g.:
Extract skills mentioned in job postings Add skills to OLAP cube to enable reports like What are the top 10 skills sought by my competitors?

Improve predictive power and interpretability of data mining models, e.g.:


Extract information on parts and failure attributes from safety complaints Associations find correlations between part failures and crashes Decision tree add information on parts and failure attributes to gain additional insights and improve predictive power 16

Practical Example:
Camera Product Review

17

Scenario
A company has gathered customer comments and product ratings on their digital cameras from an external forum. Goal: Improve customer satisfaction by understanding the key drivers of customer sentiment
Identify camera features that are correlated with positive/negative reviews Identify areas needing improvement and/or product differentiation

18

Live Demo
Infosphere Warehouse Camera product review

19

Practical Example:
NHTSA Vehicle Safety Complaints

20

Scenario
The National Highway Traffic Safety Administration (NHTSA) COMPLAINTS dataset contains all safetyrelated defect complaints received by NHTSA since January 1, 1995. Dataset contains structured variables (Make, Model, Year, etc.) and an unstructured text field (consumer complaints description). Goal: Enrich predictive mining models of vehicle safety by incorporating variables extracted from text
Extract key variables related to vehicle safety Combine extracted variables with existing ones to develop insights and improve predictive power of mining models of vehicle safety
21

Live Demo
Infosphere Warehouse NHTSA vehicle safety complaints

22

Infosphere Warehouse and IBM Content Analyzer

23

Scenarios Using Advanced Text Analysis


Customer sentiment requires detection of
Negation: Product X is not a good choice negative Non-facts: Product X is a good choice vs. Product X is perhaps a good choice

Identify parts that failed without a list of parts


Want to extract gasket failure, wiring harness has failed, but not severe failure, when it has failed Requires rules like:
One or two nouns, followed by one or two arbitrary words, followed by fail

Can be addressed by analysis components from other IBM products, e.g., IBM Content Analyzer
24

InfoSphere Warehouse and IBM Content Analyzer


InfoSphere Warehouse 9.5
Extract predefined concepts, based on lists and regular expressions Text analysis is a key component within ETL flow Results can be used in data mining and reporting

IBM Content Analyzer


Extract concepts based on grammar and parts-of-speech Combine search and text mining to discover insights

Combined approach
Use IBM Content Analyzer to explore documents and identify relevant concepts Operationalize insights by putting Content Analyzer text analysis into InfoSphere Warehouse ETL flows 25

Using Content Analyzer In InfoSphere Flow


Use ICA Text Analysis to extract nouns that are followed by fail or leak

Descriptions of vehicle failures

Result: car parts that failed

26

Trends in Unstructured Analytics

27

Trends in Unstructured Analytics


Speech analytics
Combines traditional text analytics with speech as the source of the text Gives access to "voice of the customer" for wide range of interesting insights into customer behavior, e.g.:
Identifying cross-sell and up-sell opportunities Identifying indicators of high risk of lapsing (e.g., expressing dissatisfaction, mentioning competitors)

Multi-modal image analysis


Improved detection of patterns and anomalies in images Example: medical imaging to look for evidence of disease or injury
28

Trends in Unstructured Analytics (contd)


Sentiment detection in web sources
Insights on products and companies (blogs, chats) Sentiments influence product/service directions

Noisy unstructured data analysis


Extract information from highly noisy unstructured text sources such as:
Online chats, text messages, emails, message boards, newsgroups, blogs, wikis, web pages, printed/handwritten text Text produced by processing speech

Noise includes:
Spelling errors, abbreviations, non-standard words, missing punctuations, missing case information, pauses, verbal fillers
29

Summary
Text analysis is becoming increasingly more important with the rapid growth in unstructured data. Infosphere Warehouse provides many capabilities for practical text analysis.
ISW provides an integrated platform for data mining, text analysis, and reporting. Practical examples illustrate how to perform text analytics and combine it with data mining

IBM Content Analyzer and UIMA-compliant annotators can extend text analysis capabilities. Emerging unstructured analytics technologies are extending the value and applications of TA in many important fields.
IBM Research is active in many of these areas.
30

Disclaimer
Copyright IBM Corporation 2008. All rights reserved. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

IBM, the IBM logo, ibm.com, Infosphere Warehouse, Content Analyzer, and Omnifind Analytics Edition are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others.

31

S-ar putea să vă placă și