Big Data & Data Science

Big Data (HADOOP AND MAPREDUCE?)

Big Data (HADOOP AND MAPREDUCE?)

What is Hadoop? Simple answer, Hadoop lets you store files bigger than what can be stored on one particular node or server. So you can store very, very large files and many files on multiple servers/computers in a distributed fashion.

Advantages of Hadoop include affordability (it runs on industry standard hardware and agility (store any data, run any analysis).


Hadoop is an Apache open source project that provides a parallel storage and processing framework. Its primary purpose is to run MapReduce batch programs in parallel on tens to thousands of server nodes.

Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers.

Hadoop comes with libraries and utilities needed by other Hadoop modules. Hadoop consists of the Hadoop Common package, which provides file system and OS level abstractions, a MapReduce engine. The Hadoop Common package contains the necessary JAVA files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section that includes projects from the Hadoop Community

Hadoop Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers.

HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow with demand while remaining economical at every size.

What is Map Reduce?


Map reduce is a framework for processing the data. The data is not moved in a conventional fashion using the network because it is slow for huge amount of data and media. MapReduce uses a better approach to fit well with big data sets. So rather than move the data to the software, MapReduce moves the processing software to the data.

Map Reduce – a programming model for large scale data processing. MapReduce refers to the application modules written by a programmer that run in two phases: first mapping the data (extract) then reducing it (transform).


Hadoop’s greatest benefits is the ability of programmers to write application modules in almost any language and run them in parallel on the same cluster that stores the data. With Hadoop, any programmer can harness the power and capacity of thousands of CPUs and hard drives simultaneously.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s