IS2140 Hanying Huang: March 2014

Friday, March 28, 2014

Unit 11 Reading Notes

Indexing proceeds in three stages:
· Language and character set identification,
· Language-specific processing, and
· Construction of an “inverted index” that allows rapid identification of which documents contain specific terms.

Approaches to embed translation knowledge in the system design:
· Translate each term using the context in which that word appears to help select the right translation,
· Count the terms and then translate the aggregate counts without regard to the context of individual occurrences
· Compute some more sophisticated aggregate “term weight” for each term, and then translate those weights.

To deal with a large amount of data, they requires multiple machines to process those documents and queries, we can partition index or replicate them to be at several nodes and process queries simultaneously. For instance, MapReduce can be used to parallelize other tasks such as index building, link analysis, etc. It is able to handle massive collections of processors and huge data sets. Maps and Reduces can be shared into smaller sets. It can be used to manipulate a large-scale search engine. The concept of this technique is a function of key-value pairs which can be run in parallel.

However, increasing a number of machine in search engine to process increase a chance that there is one or more machine are likely to fail. This may cause incomplete search results or even bring down the entire search engine. In MapReduce, it will take over the fault management and other tasks besides the index building.

Unit 10 Muddiest Point

How to compute LTL and LLT based on L?

Friday, March 21, 2014

Unit 10 Reading Notes

Unit 9 Reading Notes

IIR 19 &21

l The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship – is the biggest challenge for web search engines in their quest to index and retrieve this content.

l Web search users tend to not know (or care) about the heterogeneity of web content, the syntax of query languages and the art of phrasing queries

l Three broad categories into which common web search queries can be grouped: (i) informational, (ii) navigational and (iii) transactional.

l The sampling approach, random queries is noteworthy for two reasons: it has been successfully built upon for a series of increasingly refined estimates, and conversely it has turned out to be the approach most likely to be misinterpreted and carelessly implemented, leading to misleading measurements.

l Chapter 21 focus on the use of hyperlinks for ranking web search results.

l Technique for link analysis:

n PageRank. The PageRank of a node will depend on the link structure of the web graph.

l Given a query, every web page is assigned two scores. One is called its hub score and the other its authority score.

l Procedures for compiling the subset of the Web for which to compute hub and authority scores

n Given a query (say leukemia), use a text index to get all pages containing leukemia. Call this the root set of pages.

n Build the base set of pages, to include the root set as well as any page that either links to a page in the root set, or is linked to by a page in the root set.

n Use the base set for computing hub and authority scores.

Reading: Authoritative Sources in a Hyperlinked Environment

l This paper introduces a technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic.

l It defines an algorithm for identifying hubs and authorities in a sub-graph of the www with respect to a broad search topic and some of the applications of this algorithm.

Reading: The Anatomy of a Large-Scale Hyper-textual Web Search Engine

l Google is designed to provide high quality search results over a rapidly growing World Wide Web.

l The biggest problem facing users of web search engines today is the quality of the results they get back.

l The use of link text as a description of what the link points to helps the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity information helps increase relevance a great deal for many queries.

l Google is efficient in both space and time, and constant factors are very important when dealing with the entire Web. And also Google is a research tool.

Tuesday, March 18, 2014

Unit 8 Muddiest Point

This unit is easy to understand. No muddiest point.