MapReduce: Difference between revisions

Browse history interactively ← Previous edit Next edit →Content deleted Content addedVisual WikitextInline

Revision as of 17:00, 19 December 2005 editMarudubshinki (talk \| contribs)49,641 edits →Other Implementations ← Previous edit		Revision as of 18:06, 19 December 2005 edit undoMarudubshinki (talk \| contribs)49,641 edits xpandNext edit →
Line 1:		Line 1:
	~~In ]~~ '''MapReduce''' programming ~~model~~, parallel computations over large data sets are ~~implemented~~ by ~~specifying~~ a ''Map'' ~~function~~ ~~that~~ ~~maps~~ ~~key-value~~ ~~pairs~~ to ~~new~~ ~~key-value~~ ~~pairs~~ ~~and~~ a ~~subsequent~~ ~~''Reduce''~~ ~~function~~ ~~that~~ ~~consolidates~~ ~~all~~ ~~mapped~~ ~~key-value~~ ~~pairs~~ ~~sharing~~ ~~the~~ ~~same~~ ~~keys~~ to ~~single~~ ~~key-value pairs.~~		'''MapReduce''' is a programming tool developed by ] in ], in which parallel computations over large (> 1 ]) data sets are performed. The terminology of "Map" and "Reduce", and their general idea, is borrowed from ]s use of the constructs ] and ] in ]. {{ref\|map}}

			The actual software is implemented by specifying a ''Map'' function that maps key-value pairs to new key-value pairs and a subsequent ''Reduce'' function that consolidates all mapped key-value pairs sharing the same keys to single key-value pairs.
⚫	MapReduce is ~~often~~ used in conjunction with ], for greater parallelization.

			==map and reduce==
			In simpler terms, what a map function does is go over a conceptual list of independent elements (for example, a list of test scores) and performs a specified operation on each element (with the previous example, one might have discovered a flaw in the test that gave each student a score too high by one; one could then define a map function of "minus 1"- it would subtract one from each score, correcting them.); the fact that each element is operated on independently, and that the original list is not being modified because a new list is created to hold the answers means that it is very easy to make a map operation highly parallel, and thus useful in high-performance applications and domains like ].

			A reduce operation on the other hand, usually takes a list and combines elements appropriately (Continuing the preceding example, what if one wanted to know the class average? One could define a reduce function which halved the size of the list by adding an entry in the list to its neighbor, recursively continuing until there is only one (large) entry, and dividing the total sum by the original entry of elements to get the average); while since a reduce always ends up with a single answer, it is not as parallelizable as a map function, the large number of fairly independent calculations means that reduce functions are still useful in highly parallel environments.

			==Uses==
			According to Google, they use MapReduce in a wide range of applications, including: "distributed ], distributed sort, web link-graph reversal, term-vector per host, web access log stats inverted index construction, document clustering, ], statistical ]..."

		⚫	MapReduce is used in conjunction with ], for greater parallelization.

	==Other Implementations==		==Other Implementations==
Line 9:		Line 19:
	* Dean, Jeffrey & Ghemawat, Sanjay (2004). . Retrieved Apr. 6, 2005.		* Dean, Jeffrey & Ghemawat, Sanjay (2004). . Retrieved Apr. 6, 2005.

			{{note\|map}} ''"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages."'' -"MapReduce: Simplified Data Processing on Large Clusters"
	{{Compu-soft-stub}}

			==External link==
			*- a paper on an internal tool at Google, Sawzall, which acts as an interface to MapReduce, intended to make MapReduce much easier to use.
			* on ''Lambda the Ultimate''.

	]		]
			]

	]		]

Revision as of 18:06, 19 December 2005

MapReduce is a programming tool developed by Google in C++, in which parallel computations over large (> 1 terabyte) data sets are performed. The terminology of "Map" and "Reduce", and their general idea, is borrowed from functional programming languages use of the constructs map and reduce in functional programming.

The actual software is implemented by specifying a Map function that maps key-value pairs to new key-value pairs and a subsequent Reduce function that consolidates all mapped key-value pairs sharing the same keys to single key-value pairs.

map and reduce

In simpler terms, what a map function does is go over a conceptual list of independent elements (for example, a list of test scores) and performs a specified operation on each element (with the previous example, one might have discovered a flaw in the test that gave each student a score too high by one; one could then define a map function of "minus 1"- it would subtract one from each score, correcting them.); the fact that each element is operated on independently, and that the original list is not being modified because a new list is created to hold the answers means that it is very easy to make a map operation highly parallel, and thus useful in high-performance applications and domains like parallel programming.

A reduce operation on the other hand, usually takes a list and combines elements appropriately (Continuing the preceding example, what if one wanted to know the class average? One could define a reduce function which halved the size of the list by adding an entry in the list to its neighbor, recursively continuing until there is only one (large) entry, and dividing the total sum by the original entry of elements to get the average); while since a reduce always ends up with a single answer, it is not as parallelizable as a map function, the large number of fairly independent calculations means that reduce functions are still useful in highly parallel environments.

Uses

According to Google, they use MapReduce in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats inverted index construction, document clustering, machine learning, statistical machine translation..."

MapReduce is used in conjunction with Google File System, for greater parallelization.

Other Implementations

The Nutch project has developed an experimental implementation of MapReduce.

References

Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.

"Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters"

External link

Interpreting the Data: Parallel Analysis with Sawzall- a paper on an internal tool at Google, Sawzall, which acts as an interface to MapReduce, intended to make MapReduce much easier to use.
Discussion on Lambda the Ultimate.

Categories: