Sunteți pe pagina 1din 32

Hands on Classification with Learning Based Java

Gourab Kundu Adapted from a talk by Vivek Srikumar

Goals of this tutorial


At the end of these lectures, you will be able to
1.
2.

Get started with Learning Based Java


Use a generic, black box text classifier for different applications
and write your own text classifier, if needed

3.

Understand how features can impact the classifier performance


and add features to improve your application

4.

Build a badge classifier based on character features

A Quick Recap

Given: Examples (x,f(x)) of some unknown function f Find: A good approximation of f x provides some representation of the input

The process of mapping a domain element into a representation is called Feature Extraction. (Hard; ill-understood; important) x {0,1}n or x Rn f(x) {-1,+1} f(x) {1,2,3,.,k-1} Binary Classification Multi-class classification

The target function (label)


What is text classification?


A document A classifier (black box)

Some labels

Several applications fit this framework

Spam detection

Sentiment classification

What else can you do, if you had such a black box system that can classify text? Try to spend 30 seconds brainstorming

Outline of this session


Getting started with LBJ Writing our first classifier: Spam/Ham Playing with features Looking inside the black box classifier for feature weights

Writing classifiers

LEARNING BASED JAVA

What is Learning Based Java?

A modeling language for learning and inference

Supports

Programming using learned models High level specification of features and constraints between classifiers Inference with constraints Different learning algorithms

The learning operator


Classifiers are functions defined in terms of data Learning happens at compile time

What does LBJ do for you?

Abstracts away the feature representation, learning and inference Allows you to write learning based programs Application developers can reason about the application at hand

Demo

A learning based program

First, we will write an application that assumes the existence of a black box classifier

SPAM DETECTION

Spam detection
Which of these (if any) are email spam?
Subject: save over 70 % on name brand software ppharmacy devote fink tungstate brown lexicon pawnshop crescent railroad distaff cytosine barium cain do How application elegy donnelly hydrochloride common embargo shakespearean bassett trustee nucleolus chicano narbonne telltale tagging swirly lank delphinus bragging bravery cornea asiatic susanne

Subject: please keep in touch just like to say that it has been great meeting and working with you all . i will be leaving enron effective july 5 th you know? to do investment banking in hong kong . i will initially be based in new york and will be moving to hong kong after a few months . do contact me when you are in the vicinity .

What do we need to build a classifier?


1. 2. 3.

Annotated documents* A feature representation of the documents A learning algorithm

* Here we are dealing with supervised learning

Our first LBJ program


/** A learned text classifier; its definition comes from data. */ discrete TextClassifier(Document d) <learn TextLabel using WordFeatures from new DocumentReader("data/spam/train") with SparseAveragedPerceptron { learningRate = 0.1 ; thickness = 3.5; } 5 rounds testFrom new DocumentReader("data/spam/test) end

Defines a classifier The object being classified

The function being learned The feature representation The source of the training data The learning algorithm

Demo

Lets build a spam detector

How to train?
How do different learning algorithms perform? Does this choice matter much?

Features

Our current spam detector uses words as features

Can we do better?

Lets try it out

MORE TEXT CLASSIFICATION

Sentiment classification
Which of these product reviews is positive?
I recently made the switch from PC to Mac, and I can say that I'm not sure why I waited so long. Considering that I have only had do How my computer a few weeks I can't say much about the durability and longevity of the hardware, but I can say that the operating system (mine shipped with Lion) and software is top notch. I've been an Apple user for a long time, but my most recent MacBook Pro purchase has convinced me to reconsider. I've know? had several hardware issues, including a failed keyboard, battery failure, and a bad DVD drive. Now, the backlight on the display fails to turn on when waking from sleep

you

Classifying news groups


Which mailing list should this message be posted to?
I am looking for Quick C or Microsoft C code for image decoding from file for VGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I have scoured the Internet, but its like trying to find a Dr. Seuss spell checker TSR. It must be out there, and there's no need to reinvent the wheel.

How alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball

do you know?

rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc

Demo

Converting our spam classifier into a

Sentiment classifier A newsgroup classifier

Note: How different are these at the implementation level?

Most of the engineering lies in the features


A document A classifier (black box)

Some labels

Summary

What is LBJ? How do we use it?

Writing a simple spam detector


Playing with features How much do we need to change to move to a different application?

Assignment before Next Class (Not Graded)

Download the code & data (http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.html) for this class and play with it Try to solve the Badges game puzzle with LBJ

Think about what features are needed Write a parser for reading the data Write a classifier for solving the puzzle

Next Class

We will solve the Badges Game puzzle by Machine Learning

We will look at more text classification examples


We will think about a famous people classifier

Questions

Badge Classifier

Brainstorm the possible Features

Characters in entire name Two consecutive Characters Character as Vowel, Character as Consonant .

Feature Engineering is Important (especially if labeled data is small) What is the baseline? 70 +, 24 -

THE FAMOUS PEOPLE CLASSIFIER

The Famous People Classifier

f(

) = Politician

f( f(

) = Athlete ) = Corporate Mogul

The NLP version of the fame classifier


All sentences in the news, which the string Barack Obama occurs

Represented by

All sentences in the news, which the string Roger Federer occurs

All sentences in the news, which the string Bill Gates occurs

Our goal

Find famous athletes, corporate moguls and politicians


Athlete Politician
Bill Clinton George W. Bush

Corporate Mogul
Warren Buffet Larry Ellison

Michael Schumacher Michael Jordan

Lets brainstorm

How do we build a fame classifier?


Remember, we start off with just raw text from a news website

One solution

Let us label entities using features defined on mentions

All sentences in the news, which the string Barack Obama occurs

Identify mentions using the named entity recognizer Define features based on the words, parts of speech and dependency trees Train a classifier

Summary
1.

Get started with Learning Based Java

2.

Use a generic, black box text classifier for different applications


and write your own text classifier, if needed

3.

Understand how features can impact the classifier performance


and add features to improve your application

4.

Build a badge classifier based on character features

Questions

S-ar putea să vă placă și