Sunteți pe pagina 1din 10

Hadoop Setup

Prerequisite:
System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments for challenge. 2. Cygwin on Windows is not recommended, for its instability and unforeseen bugs. Java Runtime Environment, JavaTM 1.6.x recommended ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Hadoop Setup

Single Node Setup (Usually for debug)


Untar hadoop-*.**.*.tar.gz to your user path
About Version: The latest stable version 1.0.1 is recommended.

edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation edit the files to configure properties:
conf/core-site.xml: conf/hdfs-site.xml: <configuration> <configuration> <property> <property> <name> <name> fs.default.name dfs.replication </name> </name> <value> <value> hdfs://localhost:9000 1 </value> </value> </property> </property> </configuration> </configuration> conf/mapred-site.xml: <configuration> <property> <name> mapred.job.tracker </name> <value> localhost:9001 </value> </property> </configuration> Hadoop

Setup

Cluster Setup ( the only acceptable setup for HW)


Same steps as single node setup Set dfs.name.dir and dfs.data.dir property in hdfs-site.xml Add the masters node name to conf/master Add all the slaves node name to conf/slaves Edit /etc/hosts in each node: add IP and node name item for each node Suppose your masters node name is ubuntu1 and its IP is 192.168.0.2, then add line 192.168.0.2 ubuntu1 to the file Copy the folder to the same path of all nodes Notice: JAVA_HOME may not be set the same in each node

Hadoop Setup

Execution
generating ssh keygen. Passphrase will be omitted when starting up: $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ ssh localhost

Format a new distributed-filesystem: $ bin/hadoop namenode format Start the hadoop daemons: $ bin/start-all.sh The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
Hadoop Setup

Execution(continued)
Copy the input files into the distributed filesystem: $ bin/hadoop fs -put conf input Run some of the examples provided: $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' Examine the output files: View the output files on the distributed filesystem: $ bin/hadoop fs -cat output/* When you're done, stop the daemons with: $ bin/stop-all.sh

Hadoop Setup

Details About Configuration Files


Hadoop configuration is driven by two types of important configuration files: 1.Read-only default configuration: src/core/core-default.xml src/hdfs/hdfs-default.xml src/mapred/mapred-default.xml conf/mapred-queues.xml.template. 2.Site-specific configuration: conf/core-site.xml conf/hdfs-site.xml conf/mapred-site.xml conf/mapred-queues.xml

Hadoop Setup

Details About Configuration Files (continued)


conf/core-site.xml:
Parameter fs.default.name Value URI of NameNode. Notes hdfs://hostname/

conf/hdfs-site.xml:
Parameter Value Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. Notes If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

dfs.name.dir

dfs.data.dir

If this is a comma-delimited Comma separated list of list of directories, then data paths on the local filesystem will be stored in all named of a DataNode where it directories, typically on should store its blocks. different devices.

Hadoop Setup

Details About Configuration Files (continued)


conf/mapred-site.xml:
Parameter mapred.job.tracker mapred.system.dir Value Host or IP and port of JobTracker. Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/. Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. Notes host:port pair. This is in the default filesystem (HDFS) and must be accessible from both the server and client machines. Multiple paths help spread disk i/o.

mapred.local.dir

The maximum number of Map/Reduce tasks, which Defaults to 2 (2 maps and 2 reduces), but vary it mapred.tasktracker.{map|reduce}.tasks.maximum are run simultaneously on a given TaskTracker, depending on your hardware. individually. If necessary, use these files to control the list of dfs.hosts/dfs.hosts.exclude List of permitted/excluded DataNodes. allowable datanodes. If necessary, use these files to control the list of mapred.hosts/mapred.hosts.exclude List of permitted/excluded TaskTrackers. allowable TaskTrackers.

mapred.queue.names

The Map/Reduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Comma separated list of queues to which jobs can Once queues are defined, users can submit jobs to be submitted. a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.

Hadoop Setup

You may get detailed information from


The official site: http://hadoop.apache.org Course slides & Textbooks: http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html

Michael G. Noll's Blog (a good guide): http://www.michael-noll.com/


If you have good materials to share, please send them to TA.

Hadoop Setup

S-ar putea să vă placă și