Thursday, January 9, 2014

Week 2 Reading Notes

IIR sections 1.2, chapters 2 and 3
·          Section 1.2 illustrates the construction of the inverted index. The indexing result is split into a dictionary and postings. It is important to choose a proper data structure for the posting list, because the size of postings might be extreme large and the storage space need to be taken into consideration.

·          Chapter 2 explains the term vocabulary and posting lists in details.
·          Different decoders are needed for different documents because of various encoding schemes, like ASCII and UTF-8. And more importantly, the granularity is necessary to index a large document. For example, split the document into small unit, index each unit and then shuffle and merge the separate results into a total result.
·          One interesting issue is how to tokenize the document and normalize the term. How to determine the meaningful term and character sequence? Some special characters/words needs further handling, like apostrophe, stop words and hyphenation.
·          I agree with the idea that “lowercasing everything often remains the most practical solution”. Because when users are typing the query, they always use the lowercase even though the uppercase is right, such as typing windows instead of Windows.
·          In summary, there are several important issues while preprocessing the documents: tokenization, stop words, normalization, stemming, lemmatization,
·          There are also different ways to improve the efficiency of IR like posting list intersect with skip pointer.

·          Chapter 3 focuses on the data structures that help the search for terms in an inverted index, wildcard query and tolerant retrieval.
·          It is the first time I see the terminology “wildcard query”. We can use two trees to handle the wildcard query with single *, the normal B-tree and the reverse B-tree. And there are two techniques for handling the general wildcard query: permuterm indexes and k-gram indexes.
·          In section 3.3.1, when two correctly spelled queries are tied (or nearly tied), the author introduces two algorithms to select the one that is more common: considering the number of occurrences of the term in the collection, or using the correction that is most common among queries typed in by other users. In my opinion, I prefer the latter one because it is closer to more users’ information needs.




Not understand the above figure (Figure 3.6 in Chapter 3)

No comments:

Post a Comment