Given a set of classes, we seek to determine which class(es) a given object belongs to. A class need not be as narrowly focused as the standing query multicore computer chips. Apart from manual classification and hand-crafted rules, there is a third approach to text classification, namely, machine learning-based text classification.
In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of class C = {c1, c2, … , cj}. Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:
γ : X → C
The Bernoulli model
The multinomial NB model is formally identical to the multinomial unigram language model.
Properties of Naive Bayes
To gain a better understanding of the two models and the assumptions they make, let us go back and examine how we derived their classification rules in Chapters 11 and 12.
In reality, the conditional independence assumption does not hold for text data. Terms are conditionally dependent on each other. But as we will discuss shortly, NB models perform well despite the conditional independence assumption.
Feature selection
Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification. Feature selection serves two main purposes.
Contiguity hypothesis.
Documents in the same class form a contiguous region and regions of different classes do not overlap. When applying two-class classifiers to problems with more than two classes, there are one-of tasks – a document must be assigned to exactly one of several mutually exclusive classes – and any-of tasks – a document can be assigned to any number of classes as we will explain.
Document representations and measures of relatedness in vector spaces Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. We will use Euclidean distance in this chapter as the underlying distance measure.
Rocchio classification
Documents are shown as circles, diamonds and X’s. The boundaries in the figure, which we call decision boundaries, are chosen to separate the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the figure, we determine the region it occurs in and assign it the class of that region – China in this case. Our task in vector space classification is to devise algorithms that compute good boundaries where “good” means high classification accuracy on data unseen during training.
k nearest neighbor
Unlike Rocchio, k nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.
No comments:
Post a Comment