Sunday, April 20, 2014
Sunday, April 13, 2014
Unit 13 Reading Notes
Given a set of classes, we seek to determine which class(es) a given object belongs to. A class need not be as narrowly focused as the standing query multicore computer chips. Apart from manual classification and hand-crafted rules, there is a third approach to text classification, namely, machine learning-based text classification.
In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of class C = {c1, c2, … , cj}. Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:
γ : X → C
The Bernoulli model
The multinomial NB model is formally identical to the multinomial unigram language model.
Properties of Naive Bayes
To gain a better understanding of the two models and the assumptions they make, let us go back and examine how we derived their classification rules in Chapters 11 and 12.
In reality, the conditional independence assumption does not hold for text data. Terms are conditionally dependent on each other. But as we will discuss shortly, NB models perform well despite the conditional independence assumption.
Feature selection
Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification. Feature selection serves two main purposes.
Contiguity hypothesis.
Documents in the same class form a contiguous region and regions of different classes do not overlap. When applying two-class classifiers to problems with more than two classes, there are one-of tasks – a document must be assigned to exactly one of several mutually exclusive classes – and any-of tasks – a document can be assigned to any number of classes as we will explain.
Document representations and measures of relatedness in vector spaces Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. We will use Euclidean distance in this chapter as the underlying distance measure.
Rocchio classification
Documents are shown as circles, diamonds and X’s. The boundaries in the figure, which we call decision boundaries, are chosen to separate the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the figure, we determine the region it occurs in and assign it the class of that region – China in this case. Our task in vector space classification is to devise algorithms that compute good boundaries where “good” means high classification accuracy on data unseen during training.
k nearest neighbor
Unlike Rocchio, k nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.
In text classification, we are given a description d ∈ X of a document, where X is the document space; and a fixed set of class C = {c1, c2, … , cj}. Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:
γ : X → C
The Bernoulli model
The multinomial NB model is formally identical to the multinomial unigram language model.
Properties of Naive Bayes
To gain a better understanding of the two models and the assumptions they make, let us go back and examine how we derived their classification rules in Chapters 11 and 12.
In reality, the conditional independence assumption does not hold for text data. Terms are conditionally dependent on each other. But as we will discuss shortly, NB models perform well despite the conditional independence assumption.
Feature selection
Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classification. Feature selection serves two main purposes.
Contiguity hypothesis.
Documents in the same class form a contiguous region and regions of different classes do not overlap. When applying two-class classifiers to problems with more than two classes, there are one-of tasks – a document must be assigned to exactly one of several mutually exclusive classes – and any-of tasks – a document can be assigned to any number of classes as we will explain.
Document representations and measures of relatedness in vector spaces Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. We will use Euclidean distance in this chapter as the underlying distance measure.
Rocchio classification
Documents are shown as circles, diamonds and X’s. The boundaries in the figure, which we call decision boundaries, are chosen to separate the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the figure, we determine the region it occurs in and assign it the class of that region – China in this case. Our task in vector space classification is to devise algorithms that compute good boundaries where “good” means high classification accuracy on data unseen during training.
k nearest neighbor
Unlike Rocchio, k nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.
Friday, April 4, 2014
Unit 11 Muddiest Point
Is it a good idea to use English as a bridge language for multilingual queries? For instance, once the system receives a query other than english, it first translates it into English before continuing.
Friday, March 28, 2014
Unit 11 Reading Notes
Indexing proceeds in three stages:
· Language and character set identification,
· Language-specific processing, and
· Construction of an “inverted index” that allows rapid identification of which documents contain specific terms.
Approaches to embed translation knowledge in the system design:
· Translate each term using the context in which that word appears to help select the right translation,
· Count the terms and then translate the aggregate counts without regard to the context of individual occurrences
· Compute some more sophisticated aggregate “term weight” for each term, and then translate those weights.
To deal with a large amount of data, they requires multiple machines to process those documents and queries, we can partition index or replicate them to be at several nodes and process queries simultaneously. For instance, MapReduce can be used to parallelize other tasks such as index building, link analysis, etc. It is able to handle massive collections of processors and huge data sets. Maps and Reduces can be shared into smaller sets. It can be used to manipulate a large-scale search engine. The concept of this technique is a function of key-value pairs which can be run in parallel.
However, increasing a number of machine in search engine to process increase a chance that there is one or more machine are likely to fail. This may cause incomplete search results or even bring down the entire search engine. In MapReduce, it will take over the fault management and other tasks besides the index building.
· Language and character set identification,
· Language-specific processing, and
· Construction of an “inverted index” that allows rapid identification of which documents contain specific terms.
Approaches to embed translation knowledge in the system design:
· Translate each term using the context in which that word appears to help select the right translation,
· Count the terms and then translate the aggregate counts without regard to the context of individual occurrences
· Compute some more sophisticated aggregate “term weight” for each term, and then translate those weights.
To deal with a large amount of data, they requires multiple machines to process those documents and queries, we can partition index or replicate them to be at several nodes and process queries simultaneously. For instance, MapReduce can be used to parallelize other tasks such as index building, link analysis, etc. It is able to handle massive collections of processors and huge data sets. Maps and Reduces can be shared into smaller sets. It can be used to manipulate a large-scale search engine. The concept of this technique is a function of key-value pairs which can be run in parallel.
However, increasing a number of machine in search engine to process increase a chance that there is one or more machine are likely to fail. This may cause incomplete search results or even bring down the entire search engine. In MapReduce, it will take over the fault management and other tasks besides the index building.
Friday, March 21, 2014
Unit 10 Reading Notes
Unit
9 Reading Notes
IIR
19 &21
l
The
essential feature that led to the explosive growth of the web – decentralized
content publishing with essentially no central control of authorship – is the
biggest challenge for web search engines in their quest to index and retrieve
this content.
l
Web
search users tend to not know (or care) about the heterogeneity of web content,
the syntax of query languages and the art of phrasing queries
l
Three
broad categories into which common web search queries can be grouped: (i)
informational, (ii) navigational and (iii) transactional.
l
The
sampling approach, random queries is noteworthy for two reasons: it has been
successfully built upon for a series of increasingly refined estimates, and
conversely it has turned out to be the approach most likely to be
misinterpreted and carelessly implemented, leading to misleading measurements.
l
Chapter
21 focus on the use of hyperlinks for ranking web search results.
l
Technique
for link analysis:
n
PageRank.
The PageRank of a node will depend on the link structure of the web graph.
l
Given
a query, every web page is assigned two scores. One is called its hub score and
the other its authority score.
l
Procedures
for compiling the subset of the Web for which to compute hub and authority
scores
n
Given
a query (say leukemia), use a text index to get all pages containing leukemia.
Call this the root set of pages.
n
Build
the base set of pages, to include the root set as well as any page that
either links to a page in the root set, or is linked to by a page in the root
set.
n
Use
the base set for computing hub and authority scores.
Reading:
Authoritative Sources in a Hyperlinked Environment
l
This
paper introduces a technique for locating high-quality information related to a
broad search topic on the www, based on a structural analysis of the link
topology surrounding “authoritative” pages on the topic.
l
It
defines an algorithm for identifying hubs and authorities in a sub-graph of the
www with respect to a broad search topic and some of the applications of this
algorithm.
Reading:
The Anatomy of a Large-Scale Hyper-textual
Web Search Engine
l
Google
is designed to provide high quality search results over a rapidly growing World
Wide Web.
l
The
biggest problem facing users of web search engines today is the quality of the
results they get back.
l
The
use of link text as a description of what the link points to helps the search
engine return relevant (and to some degree high quality) results. Finally, the
use of proximity information helps increase relevance a great deal for many
queries.
l
Google
is efficient in both space and time, and constant factors are very important
when dealing with the entire Web. And also Google is a research tool.
Tuesday, March 18, 2014
Friday, February 28, 2014
Unit 8 Reading Notes
Unit
8 Reading Notes
MIR
Ch10
·
This
reading discusses user interfaces for communication between human information
seekers and information retrieval systems. The well-designed interface would
significantly improve the user experience and the search performance.
·
Principles
for design of user interfaces: provide informative feedback, permit easy
reversal of actions, support an internal locus of control, reduce working
memory load, and provide alternative interfaces for novice and expert users.
·
Information
visualization provides visual depictions of very large information spaces. Main
techniques: icons, color highlighting, brushing and linking, panning and
zooming, focus-plus-context, magic lenses, and animation.
·
The
article introduces four kinds of starting points which should be provided by
the search interfaces: lists, overviews, examples and automated source
selection.
·
There
are five primary human-computer interaction styles: command language, form fill
in, menu selection, direct manipulation and natural language. Each technique
has been used in query specification interfaces and each has advantages and
disadvantages.
·
It
also explains how to show the relationship of the document set to query terms,
collection overviews, descriptive metadata, hyperlink structure, document
structure, and to other documents within the set (via context), in order to
make the document set more understandable.
·
Relevance
feedback is an effective technique used for query reformulation. A standard
interface for relevance feedback consists of a list of titles with checkboxes
beside the titles that allow the user to mark relevant documents.
Reading:
Search User Interfaces
·
The
web search interfaces always keep simple and unchanging for the following
reasons:
n
Search
is a means towards some other end, rather than a goal in itself.
n
Search
is a mentally intensive task.
n
The
interface design must be understandable and appealing to a wide variety of
users of all ages, cultures and backgrounds, applied to an enormous variety of
information needs.
·
An
important quality of a user interface (UI) is its usability which includes five
basic components: learnability, efficiency, memorability, errors, and
satisfication.
·
Eight
design desiderata for search user interfaces generally:
·
Offer
informative feedback.
·
Support
user control.
·
Reduce
short term memory load.
·
Provide
shortcuts for skilled users.
·
Reduce
errors; Offer simple error handling.
·
Strive
for consistency.
·
Permit
easy reversal of actions.
·
Design
for closure.
Thursday, February 27, 2014
Unit 7 Muddiest Point
In
slide 40, it talks about the utility - quality, novelty, importance,
credibility and many other features of the documents to the user’s need. However,
in the IR system, how are the documents evaluated based on utility? If it can
be only evaluated manually, is it valuable to spend much time on the evaluation
taking the improvement of the search results into consideration?
Friday, February 21, 2014
Unit 7 Reading Notes
IIR
Chapter 9
·
The
Relevance Feedback is the idea that the system involves the user’s feedback to
refine the searching results.
·
Algorithms
for implementing relevance feedback.
n
Rocchio
Algorithm: incorporating
relevance feedback information into the vector space model.
n
Naive
Bayes probabilistic model
·
Relevance
feedback can improve both recall (more effective) and precision.
·
Requirements
for effective relevance feedback.
n
The
user has to have sufficient knowledge to be able to make an initial query.
n
Relevant
documents to be similar to each other.
·
Evaluating
the effectiveness of relevance feedback
n
Start
with an initial query q0 and to compute a precision-recall graph.
n
Use
documents in the residual collection for the second round of evaluation.
·
Pseudo
relevance feedback automates the manual part of relevance feedback, so that the
user gets improved retrieval performance with- out an extended interaction.
·
Indirect
relevance feed back uses indirect sources of evidence.
·
Implicit
feedback is less reliable than explicit feedback, but is more useful than
pseudo relevance feedback.
·
Three
global methods for expanding a query: by simply aiding the user in doing so, by
using a manual thesaurus, and through building a thesaurus automatically.
Reading:
Improving the Effectiveness of Information Retrieval with Local Context
Analysis
·
This
paper proposes a new technique for automatic query expansion, called local
context analysis, which selects expansion terms based on co-occurrence with the
query terms within the top-ranked documents.
·
Existing
techniques for automatic query expansion can be categorized as either global or
local.
·
Local
context analysis is a local technique, but it employs co-occurrence analysis, a
primary tool for global techniques, for query expansion.
·
The
metrics used by local context analysis for concept selection: co-occurrence
metric, combining the degrees of co-occurrence with all query terms, differentiating
rare and common query terms
·
Experimental
results on a number of collections show that local context analysis is more
effective than existing techniques.
Reading:
A Study of Methods for Negative Relevance Feedback
·
This
paper focuses on the analysis of negative relevance feedback. The Experiment
results on several TREC collections show that language model based negative
feedback methods are generally more effective than those based on vector-space
models, and using multiple negative models is an effective heuristic for
negative feedback.
·
General
strategies with some variations for negative feedback: (1) SingleQuery: query
modification strategy; (2) SingleNeg: score combination with a single negative
query model; (3) MultiNeg: score combination with multiple negative query
models.
·
Two
heuristics to increase the robustness of using negative feedback information:
Local Neighborhood and Global Neighborhood.
Subscribe to:
Comments (Atom)

