Practical Text Analytics

Reading Between the Lines: Practical Text Analytics
John B. Rollins, Ph.D., P.E., BI Solution Architect, IBM Corporation, rollins@us.ibm.com Ramasubbu Venkatesh, Ph.D., Data Mining Specialist, IBM Corporation, rvenkat@us.ibm.com Alexander Lang, Ph.D., Software Engineer, IBM Corporation, alexlang@de.ibm.com Stefan Abraham, Ph.D., Software Engineer, IBM Corporation, stefana@de.ibm.com
Session Number 2336
Agenda
Business motivation for analyzing unstructured data Text analysis: from text to structure Text analysis in InfoSphere Warehouse Practical examples InfoSphere Warehouse and IBM Content Analyzer Trends in unstructured analytics
Information Irony
Water, water, everywhere, Nor any drop to drink.
Samuel Taylor Coleridge The Rime of the Ancient Mariner
Data is being generated at an unprecedented rate IDC estimates data will grow from 161 exabytes in 2006 to 988 exabytes in 2010 (1 exabyte = 1 billion Gigabytes!) Much of it (80% by some estimates) is unstructured. Information rich Extracting value is difficult Our focus is unstructured TEXTUAL INFORMATION!
2
Information Warehouse Growth Trend

TDWI Survey, 2007
Rapid Growth in Unstructured Data
Respondents expect huge increase in unstructured data as warehouse sources

Collaborative content (Email, IM, Wikis) Content management Voice transcriptions Claims records
Chart & Survey: P. Russom,BI Search and Text Analytics, 3 TDWI Best Practices Report, 2007
Business Scenarios for Unstructured Data

Example: improve product innovation and quality
Use information from customer service records, repair notes, online reviews, and other unstructured sources Reduce reliance on static, predefined problem codes by extracting detailed problem descriptions Understand faster why a product has problems Identify gaps in product portfolio new functionality to drive product innovation

Product Innovation and Quality
Static Problem Codes (which problems occurred)
Text Analysis (why problems occurred)

5

Example: reduce customer churn
Identify unhappy customers as early as possible
Analyze text to identify emerging problems, e.g., call center complaints about dropped calls May be too late to take action by the time a problem appears in structured data, e.g., declining number of calls over time for a given cell phone service provider
Analyze customer email and call center logs to detect negative sentiment
Is the customer angry? Is the customer mentioning competitors pricing offers? Is the customer complaining about a particular problem with service or product?
Text Analysis: From Text to Structure
Text to Structure
Create new structured variables from text Example: analysis of vehicle complaints report
Extract accident attributes from text field Create new variables to represent accident attributes CONSUMER WAS SEVERLY INJURED IN AN ACCIDENT. THE ABS ANTI-LOCK BRAKE FAILED AND PASSENGERS AIR BAG DIDN'T DEPLOY.
CMPLID INJURY ACCIDENT ABS AIRBAG_ DRV -AIRBAG_ PASS No Deploy
8
17869
Severe
Yes
Failed
Types of Information Extraction

Named entity recognition
Extract person or place names, monetary expressions, etc.
Co-reference resolution
Identify expressions that refer to the same entity
Alex Lang is our co-author. He is not at IOD this year.
Relationship detection
Extract entities (e.g., products and problems) and use data mining to find relationships among them More robust than elaborate, hand-crafted rules Associations, clustering, predictive modeling
9
Which Extraction Technique to Use?

Depends on the concepts to be extracted Concept is a fixed list of instances: use dictionaries
Product names from database List of employee names from LDAP
Concept follows a simple pattern: use regular expressions

Phone numbers, product codes, etc.
Concept follows complex pattern: use advanced analysis components

Relationships among concepts/entities Additional tools/capabilities
Text exploration (e.g., OmniFind Analytics / Content Analyzer) Customized annotators (e.g., sentiments) 10
Text Analysis in Infosphere Warehouse
11
Text Analytics in Infosphere Warehouse

Data understanding: View text columns in database, text statistics, frequent terms analysis
Run advanced UIMA engines (IBM Omnifind, Partners, IBM Research)
List of terms to extract - Extract frequent term patterns - Create entity dictionary from terms Regular expressionbased extraction Enable hierarchical grouping of terms
12
Focused vs. Explorative Approaches

Focused
Create dictionaries, rules, or annotators to extract precisely the relevant concepts More upstream effort for creation and testing Allows analysis focused on a specific question
Example: Show the top 10 car parts that occur in repair reports for automobile make X
Explorative
Create dictionaries that contain terms with certain parts-of-speech patterns (e.g., adjective-noun) Use downstream analysis (e.g., association rule mining) to weed out irrelevant terms Allows detection of unknown events or relationships
Example: Part X + Failure Attribute Y Vehicle Crash
13
Explorative Approach: Example

Frequent terms + Associations analysis Associations Mining: discover correlated terms Cognos report that are relevant to
resolving the problem
Frequent Terms Analysis:

count the most frequentlyoccurring terms for Auto Make X (relative occurrence may or may not be relevant to understanding the problem)
14
Using Information Extracted from Text

Enrich database with extracted terms/variables
Information Extraction
Identify Frequent Terms Create Dictionary Use Dictionary Lookup to extract terms Regular Expression Extraction Other Extraction Techniques
Text Field Structured Data
Aggregate by Row
Enriched Data
Derived Variables
15
Using Information Extracted from Text

Use new structured variables for reporting and analytics Create additional dimensions for OLAP, e.g.:
Extract skills mentioned in job postings Add skills to OLAP cube to enable reports like What are the top 10 skills sought by my competitors?
Improve predictive power and interpretability of data mining models, e.g.:

Extract information on parts and failure attributes from safety complaints Associations find correlations between part failures and crashes Decision tree add information on parts and failure attributes to gain additional insights and improve predictive power 16
Practical Example:
Camera Product Review
17
Scenario
A company has gathered customer comments and product ratings on their digital cameras from an external forum. Goal: Improve customer satisfaction by understanding the key drivers of customer sentiment
Identify camera features that are correlated with positive/negative reviews Identify areas needing improvement and/or product differentiation
18
Live Demo
Infosphere Warehouse Camera product review
19
Practical Example:
NHTSA Vehicle Safety Complaints
20
Scenario
The National Highway Traffic Safety Administration (NHTSA) COMPLAINTS dataset contains all safetyrelated defect complaints received by NHTSA since January 1, 1995. Dataset contains structured variables (Make, Model, Year, etc.) and an unstructured text field (consumer complaints description). Goal: Enrich predictive mining models of vehicle safety by incorporating variables extracted from text
Extract key variables related to vehicle safety Combine extracted variables with existing ones to develop insights and improve predictive power of mining models of vehicle safety
21
Live Demo
Infosphere Warehouse NHTSA vehicle safety complaints
22
Infosphere Warehouse and IBM Content Analyzer
23
Scenarios Using Advanced Text Analysis

Customer sentiment requires detection of
Negation: Product X is not a good choice negative Non-facts: Product X is a good choice vs. Product X is perhaps a good choice
Identify parts that failed without a list of parts

Want to extract gasket failure, wiring harness has failed, but not severe failure, when it has failed Requires rules like:
One or two nouns, followed by one or two arbitrary words, followed by fail
Can be addressed by analysis components from other IBM products, e.g., IBM Content Analyzer
24
InfoSphere Warehouse and IBM Content Analyzer

InfoSphere Warehouse 9.5
Extract predefined concepts, based on lists and regular expressions Text analysis is a key component within ETL flow Results can be used in data mining and reporting
IBM Content Analyzer

Extract concepts based on grammar and parts-of-speech Combine search and text mining to discover insights
Combined approach
Use IBM Content Analyzer to explore documents and identify relevant concepts Operationalize insights by putting Content Analyzer text analysis into InfoSphere Warehouse ETL flows 25
Using Content Analyzer In InfoSphere Flow

Use ICA Text Analysis to extract nouns that are followed by fail or leak
Descriptions of vehicle failures
Result: car parts that failed
26
Trends in Unstructured Analytics
27
Trends in Unstructured Analytics

Speech analytics
Combines traditional text analytics with speech as the source of the text Gives access to "voice of the customer" for wide range of interesting insights into customer behavior, e.g.:
Identifying cross-sell and up-sell opportunities Identifying indicators of high risk of lapsing (e.g., expressing dissatisfaction, mentioning competitors)
Multi-modal image analysis

Improved detection of patterns and anomalies in images Example: medical imaging to look for evidence of disease or injury
28
Trends in Unstructured Analytics (contd)

Sentiment detection in web sources
Insights on products and companies (blogs, chats) Sentiments influence product/service directions
Noisy unstructured data analysis

Extract information from highly noisy unstructured text sources such as:
Online chats, text messages, emails, message boards, newsgroups, blogs, wikis, web pages, printed/handwritten text Text produced by processing speech
Noise includes:
Spelling errors, abbreviations, non-standard words, missing punctuations, missing case information, pauses, verbal fillers
29
Summary
Text analysis is becoming increasingly more important with the rapid growth in unstructured data. Infosphere Warehouse provides many capabilities for practical text analysis.
ISW provides an integrated platform for data mining, text analysis, and reporting. Practical examples illustrate how to perform text analytics and combine it with data mining
IBM Content Analyzer and UIMA-compliant annotators can extend text analysis capabilities. Emerging unstructured analytics technologies are extending the value and applications of TA in many important fields.
IBM Research is active in many of these areas.
30
Disclaimer
Copyright IBM Corporation 2008. All rights reserved. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.
IBM, the IBM logo, ibm.com, Infosphere Warehouse, Content Analyzer, and Omnifind Analytics Edition are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others.
31

Practical Text Analytics

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Practical Text Analytics

Încărcat de

Drepturi de autor:

Formate disponibile

Reading Between the Lines: Practical Text Analytics

Session Number 2336

Information Warehouse Growth Trend

Respondents expect huge increase in unstructured data as warehouse sources

Business Scenarios for Unstructured Data

Business Scenarios for Unstructured Data

Text Analysis (why problems occurred)

Business Scenarios for Unstructured Data

Text Analysis: From Text to Structure

Types of Information Extraction

Which Extraction Technique to Use?

Concept follows a simple pattern: use regular expressions

Concept follows complex pattern: use advanced analysis components

Text Analysis in Infosphere Warehouse

Text Analytics in Infosphere Warehouse

Focused vs. Explorative Approaches

Explorative Approach: Example

Frequent Terms Analysis:

Using Information Extracted from Text

Text Field Structured Data

Using Information Extracted from Text

Improve predictive power and interpretability of data mining models, e.g.:

Infosphere Warehouse and IBM Content Analyzer

Scenarios Using Advanced Text Analysis

Identify parts that failed without a list of parts

InfoSphere Warehouse and IBM Content Analyzer

IBM Content Analyzer

Using Content Analyzer In InfoSphere Flow

Descriptions of vehicle failures

Result: car parts that failed

Trends in Unstructured Analytics

Trends in Unstructured Analytics

Multi-modal image analysis

Trends in Unstructured Analytics (contd)

Noisy unstructured data analysis

S-ar putea să vă placă și