IS2140 Hanying Huang: Unit 13 Reading Notes

Given a set of classes, we seek to determine which class(es) a given object belongs to. A class need not be as narrowly focused as the standing query multicore computer chips. Apart from manual classiﬁcation and hand-crafted rules, there is a third approach to text classiﬁcation, namely, machine learning-based text classiﬁcation.

In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of class C = {c1, c2, … , cj}. Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:
γ : X → C

The Bernoulli model
The multinomial NB model is formally identical to the multinomial unigram language model.
Properties of Naive Bayes
To gain a better understanding of the two models and the assumptions they make, let us go back and examine how we derived their classiﬁcation rules in Chapters 11 and 12.
In reality, the conditional independence assumption does not hold for text data. Terms are conditionally dependent on each other. But as we will discuss shortly, NB models perform well despite the conditional independence assumption.

Feature selection
Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classiﬁcation. Feature selection serves two main purposes.

Contiguity hypothesis.
Documents in the same class form a contiguous region and regions of different classes do not overlap. When applying two-class classiﬁers to problems with more than two classes, there are one-of tasks – a document must be assigned to exactly one of several mutually exclusive classes – and any-of tasks – a document can be assigned to any number of classes as we will explain.
Document representations and measures of relatedness in vector spaces Decisions of many vector space classiﬁers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classiﬁcation. We will use Euclidean distance in this chapter as the underlying distance measure.

Rocchio classiﬁcation
Documents are shown as circles, diamonds and X’s. The boundaries in the figure, which we call decision boundaries, are chosen to separate the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the ﬁgure, we determine the region it occurs in and assign it the class of that region – China in this case. Our task in vector space classiﬁcation is to devise algorithms that compute good boundaries where “good” means high classiﬁcation accuracy on data unseen during training.

k nearest neighbor
Unlike Rocchio, k nearest neighbor or kNN classiﬁcation determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classiﬁcation is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

IS2140 Hanying Huang

Sunday, April 13, 2014

Unit 13 Reading Notes

No comments:

Post a Comment