Documente Academic
Documente Profesional
Documente Cultură
Prerequisite:
System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments for challenge. 2. Cygwin on Windows is not recommended, for its instability and unforeseen bugs. Java Runtime Environment, JavaTM 1.6.x recommended ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Hadoop Setup
edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation edit the files to configure properties:
conf/core-site.xml: conf/hdfs-site.xml: <configuration> <configuration> <property> <property> <name> <name> fs.default.name dfs.replication </name> </name> <value> <value> hdfs://localhost:9000 1 </value> </value> </property> </property> </configuration> </configuration> conf/mapred-site.xml: <configuration> <property> <name> mapred.job.tracker </name> <value> localhost:9001 </value> </property> </configuration> Hadoop
Setup
Hadoop Setup
Execution
generating ssh keygen. Passphrase will be omitted when starting up: $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ ssh localhost
Format a new distributed-filesystem: $ bin/hadoop namenode format Start the hadoop daemons: $ bin/start-all.sh The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
Hadoop Setup
Execution(continued)
Copy the input files into the distributed filesystem: $ bin/hadoop fs -put conf input Run some of the examples provided: $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' Examine the output files: View the output files on the distributed filesystem: $ bin/hadoop fs -cat output/* When you're done, stop the daemons with: $ bin/stop-all.sh
Hadoop Setup
Hadoop Setup
conf/hdfs-site.xml:
Parameter Value Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. Notes If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.name.dir
dfs.data.dir
If this is a comma-delimited Comma separated list of list of directories, then data paths on the local filesystem will be stored in all named of a DataNode where it directories, typically on should store its blocks. different devices.
Hadoop Setup
mapred.local.dir
The maximum number of Map/Reduce tasks, which Defaults to 2 (2 maps and 2 reduces), but vary it mapred.tasktracker.{map|reduce}.tasks.maximum are run simultaneously on a given TaskTracker, depending on your hardware. individually. If necessary, use these files to control the list of dfs.hosts/dfs.hosts.exclude List of permitted/excluded DataNodes. allowable datanodes. If necessary, use these files to control the list of mapred.hosts/mapred.hosts.exclude List of permitted/excluded TaskTrackers. allowable TaskTrackers.
mapred.queue.names
The Map/Reduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Comma separated list of queues to which jobs can Once queues are defined, users can submit jobs to be submitted. a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.
Hadoop Setup
Hadoop Setup