IIR sections 1.2, chapters 2 and 3
·
Section
1.2 illustrates the construction of the inverted index. The indexing result is
split into a dictionary and postings. It is important to choose a proper data
structure for the posting list, because the size of postings might be extreme
large and the storage space need to be taken into consideration.
·
Chapter
2 explains the term vocabulary and posting lists in details.
·
Different
decoders are needed for different documents because of various encoding
schemes, like ASCII and UTF-8. And more importantly, the granularity is
necessary to index a large document. For example, split the document into small
unit, index each unit and then shuffle and merge the separate results into a total
result.
·
One
interesting issue is how to tokenize the document and normalize the term. How
to determine the meaningful term and character sequence? Some special
characters/words needs further handling, like apostrophe, stop words and
hyphenation.
·
I agree
with the idea that “lowercasing everything often remains the most practical
solution”. Because when users are typing the query, they always use the
lowercase even though the uppercase is right, such as typing windows instead of
Windows.
·
In
summary, there are several important issues while preprocessing the documents:
tokenization, stop words, normalization, stemming, lemmatization,
·
There
are also different ways to improve the efficiency of IR like posting list
intersect with skip pointer.
·
Chapter
3 focuses on the data structures that help the search for terms in an inverted
index, wildcard query and tolerant retrieval.
·
It
is the first time I see the terminology “wildcard query”. We can use two trees
to handle the wildcard query with single *, the normal B-tree and the reverse
B-tree. And there are two techniques for handling the general wildcard query:
permuterm indexes and k-gram indexes.
·
In
section 3.3.1, when two correctly spelled queries are tied (or nearly tied), the
author introduces two algorithms to select the one that is more common:
considering the number of occurrences of the term in the collection, or using the correction that is most common
among queries typed in by other users. In my opinion, I prefer the latter one
because it is closer to more users’ information needs.
Not understand the above
figure (Figure 3.6 in Chapter 3)

No comments:
Post a Comment