IS2140 Hanying Huang: Unit 11 Reading Notes

Indexing proceeds in three stages:
· Language and character set identification,
· Language-specific processing, and
· Construction of an “inverted index” that allows rapid identification of which documents contain specific terms.

Approaches to embed translation knowledge in the system design:
· Translate each term using the context in which that word appears to help select the right translation,
· Count the terms and then translate the aggregate counts without regard to the context of individual occurrences
· Compute some more sophisticated aggregate “term weight” for each term, and then translate those weights.

To deal with a large amount of data, they requires multiple machines to process those documents and queries, we can partition index or replicate them to be at several nodes and process queries simultaneously. For instance, MapReduce can be used to parallelize other tasks such as index building, link analysis, etc. It is able to handle massive collections of processors and huge data sets. Maps and Reduces can be shared into smaller sets. It can be used to manipulate a large-scale search engine. The concept of this technique is a function of key-value pairs which can be run in parallel.

However, increasing a number of machine in search engine to process increase a chance that there is one or more machine are likely to fail. This may cause incomplete search results or even bring down the entire search engine. In MapReduce, it will take over the fault management and other tasks besides the index building.

IS2140 Hanying Huang

Friday, March 28, 2014

Unit 11 Reading Notes

No comments:

Post a Comment