Documente Academic
Documente Profesional
Documente Cultură
1.a) What are the three types of big data that are a big deal for market?
Discuss briefly each of them.(Jun/Jul 2015)(10 Marks)
Digital Marketing
Introduction
Database Marketers, Pioneers of Big Data Big
Data & New School of Marketing Cross
Channel Life cycle Marketing Social and
Affiliate Marketing
Empowering marketing with Social Intelligence
Digital marketing encompasses using any sort of online media (profit or non-profit) for
driving people to a website, a mobile app etc. and retaining them and interacting with them to
understand what consumers really want.
Digital \marketing is easy when consumers interact with corporate` primary platform (ie. The
one the corporate own) because corporate get good information about them. But corporate
get very little information once people start interacting with other platforms (eg. Face book,
Twitter, Google +).
One of the problems that have to be dealt with in a digital existence, is that corporate do not
have access to data for all consumers (ie. There is very little visibility about people when
they interact in social networking sites). So corporate lose control of their ability to access
the data they need, in order to make smart, timely decisions.
Big data on the web will completely transform a company’s ability to understand the
effectiveness of its marketing and the ability to understand how its competitors are behaving.
Rather than a few people making big decisions, the organizations will be able to make
hundreds and thousands of smart decisions every day.
individuals, using that information to better understand those individuals and communicating
effectively with some of those individuals to drive business value.
Cross-Channel Lifecycle Marketing really starts with the capture of customer permission, contact
information, and preferences for multiple channels. It also requires marketers to have the right
integrated marketing and customer information systems, so that (1) they can have complete
understanding of customers through stated preferences and observed behavior at any given time;
and (2) they can automate and optimize their programs and processes throughout the customer
lifecycle. Once marketers have that, they need a practical framework for planning marketing
activities. The various loops that guide marketing strategies and tactics in the Cross-Channel
Lifecycle Marketing approach: conversion, repurchase, stickiness, win-back, and re-permission
are shown in the following figure.
Social & Affiliate Marketing or Pay for Performance Marketing on the Internet
The concept of affiliate marketing, or pay for performance marketing on the internet is often
credited to William J. Tobin, the founder of PC Flowers & Gifts. Amazon.com launched its own
affiliate program in 1996 and middleman affiliate networks like Link-share and Commission
Junction emerged preceding the 1990s Internet boom, providing the tools and technology to allow
any brand to put affiliate marketing practices to use. Today, most of the major brands have a
thriving affiliate program. Today, industry analysts estimate affiliate marketing to be a $3 billion
industry. It’s an industry that largely goes anonymous. Unlike email and banner advertising,
affiliate marketing is a behind the scenes channel most consumers are unaware of.
In 2012, the emergence of the social web brings these concepts together. What only
professional affiliate marketers could do prior to Facebook, Twitter, and Tumblr, now any
consumer with a mouse can do. Couponmountain.com and other well known affiliate sites
generate multimillion dollar yearly revenues for driving transactions for the merchants they
promote. The expertise required to build, host, and run a business like Couponmountain.com is
Dept. of CSE, SJBIT Page 3
Managing Big Data 14SCS21
no longer needed when a consumer with zero technical or business background can now publish
the same content simply by clicking ―Update Status or ―Tweet. The barriers to enter the
affiliate marketing industry as an affiliate no longer exist.
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably
unrelated data. The file system is more general, and databases are built on top of the general data.
There are also differences in the expected level of service provided by file systems and
databases. While databases must be self consistent at any instant in time (think about banks
tracking money!), provide isolated transactions and durable writes, a file system provides much
looser guarantees about consistency, isolation and durability. The database uses sophisticated
algorithms and protocols to implement reliable storage on top of potentially unreliable file
systems. It is these algorithms that make database storage more expensive in terms of processing
and storage costs that make general file systems an attractive option for data that does not
require.
As technology moves forward, though, the lines are blurring, as some file systems pick
up features previously the domain of databases (transactions, advanced queries) and some
databases relax the traditional constraints of consistency, isolation and durability. ZFS and
BTRFS might be considered examples of the former, MongoDB and CouchDB examples of the
latter.
- Credit risk management focuses on reducing risks to acceptable levels that can boost
profitability
- Risk data sources and advanced analytics are instrumental for credit risk management
- Social media and cell phone usage data are opening up new opportunities to whenever
customer behavior that can be used for credit decisioning.
- There are four critical parts in the typical credit risk framework as illustrated in the following
figure. They are planning, customer acquisition, account management and collections. All the
four are handled through the use of Big data.
- As Figure illustrates, there are four critical parts of the typical credit risk framework:
planning, customer acquisition, account management, and collections. All four parts are
handled in unique ways through the use of Big Data.
b). How is it possible to detect brand with Big data in Social Network
analysis? (Jun/Jul 2015)(10 Marks)
Empowering Marketing with Social intelligence
As a result of the growing popularity and use of social media around the world and across
nearly every demographic, the amount of user-generated content—or ―big data—created is
immense, and continues growing exponentially. Millions of status updates, blog posts,
photographs, and videos are shared every second. Successful organizations will not only need to
identify the information relevant to their company and products—but also be able to dissect it,
make sense of it, and respond to it—in real time and on a continuous basis, drawing business
intelligence—or insights—that help predict likely future customer behavior. Very intelligent
Dept. of CSE, SJBIT Page 5
Managing Big Data 14SCS21
software is required to parse all that social data to define things like the sentiment of a post.
Marketers now have the opportunity to mine social conversations for purchase intent and brand
lift through Big Data. So, marketers can communicate with consumers regardless of the channel.
Since this data is captured in real-time, Big Data is forcing marketing organizations to quickly
optimize media and message. Since this data provides details on all aspects of consumer
behavior, companies are eliminating silos within the organization to prescription across channels,
across media, and across the path to purchase
MODULE 2
NOSQL DATA MANGEMENT
Polyglot persistence – using more than one database for a specific application – has emerged as
the norm in recent years. From travel giant Orbitz to innovative SEO startup Moz, engineers
have chosen utility over simplicity, and reaped the benefits (and downsides) of running multiple
databases in production.
This choice has been driven by a number of factors, but the central one is that users like
applications with rich experiences. They need to be able to easily search their data, connect with
friends, and find relevant places near them.
Simultaneously, an abundance of new databases have emerged to fill various storage, querying,
and scaling niches making polyglot persistence no longer the domain of just the biggest web
properties.
Though generally credited to Scott Leberknight, Martin Fowler of Thought Works has
popularized the concept of polyglot persistence. He defines it as “using multiple data storage
technologies, chosen based upon the way data is being used by individual applications.” It’s a
pragmatic notion that immediately seems obvious.
Fowler goes on to say that any organization (or application for the matter) will require “a variety
of different data storage technologies for different kinds of data.” Fowler employs the term as
way to organize the trend associated with the explosion of new databases.
In his explanation of the polyglot persistence, Fowler doesn’t identify the main driving force
behind the current push to polyglot systems. The growth in users’ expectations of what
applications should do is driving developers to adopt multiple querying and storage technologies.
Any nontrivial multi-platform application ingesting data of varying formats and scopes requires
differentiated querying modes. These modes include key-value lookup, full-text search,
geolocation, graph traversal, and time-ordered events.
This pattern can be seen in many common applications. Take Foursquare as an example. It
integrates social networking (graph traversal) with geolocation and faceted search. It coalesces
check-ins, comments, and reviews based on time and place. The functionality of other common
apps like Yelp, Yammer, and Twitter expose the same pattern, a variety of data types alongside a
variety of query modes.
Databases have emerged to handle specific query types. The maturation of Lucene has put
sophisticated search into the hands of average developers. The development of BigTable and
Dynamo clones have introduced to the masses KV-style stores with horizontal scaling properties.
Graph stores like Neo4J have introduced a different kind of persistence and querying model to
many developers.
Unfortunately, no individual database can handle all of query types that drive rich user
experiences.
b)How is crowd sourcing a great way to capitalize on the resources that can build
algorithms and predictive models? (Jun/Jul 2015)(5 Marks)
The Cloud and Big Data
Market economics are demanding that capital- intensive infrastructure costs disappear and
business challenges are forcing clients to consider newer models. The cloud-deployment model
satisfies such needs. With a cloud model, payment is on subscription basis with no capital
expense. Typical 30% maintenance fees are not incurred and all the updates on the platform are
automatically available.
There is a simple example of powerful visualization that the Tableau team is referring to. A
company uses an interactive dashboard to track the critical metrics driving their business. Every
day, the CEO and other executives are plugged in real-time to see how their markets are
performing in terms of sales and profit, what the service quality scores look like against
advertising investments, and how products are performing in terms of revenue and profit.
Interactivity is key: a click on any filter lets the executive look into specific markets or products.
She can click on any data point in any one view to show the related data in the other views. She
can look into any unusual pattern or outlier by showing details on demand. Or she can click
through the underlying information in a split-second.
- Business intelligence needs to work the way people’s minds work. Users need to navigate and
interact with data any way they want to—asking and answering questions on their own and in big
groups or teams.
- Qliktech has designed a way for users to leverage direct—and indirect—search. With QlikView
search, users type relevant words or phrases in any order and get instant, associative results. With
a global search bar, users can search across the entire data set. With search boxes on individual
list boxes, users can confine the search to just that field. Users can conduct both direct and
indirect searches. For example, if a user wanted to identify a sales rep but couldn’t remember the
sales rep’s name—just details about the person, such as that he sells fish to customers in the
Nordic region—the user could search on the sales rep list box for ―Nordic‖ and ―fish‖ to
narrow the search results to just the people who meet those criteria.
- Innovation engines for new product innovation, drug discovery and consumer and fashion
trends to predict new product formulations and new purchases.
- Consumer insight engines that integrate a wide variety of consumer-related information
including sentiment, behavior and emotions.
- Optimization engines that optimize complex interrelated operations and decisions that are too
complex to handle.
c).What is Data aggregation?What are different models used in NOSQL?Give example and
explain? (Jun/Jul 2015)(12 Marks)
Data aggregation is any process in which information is gathered and expressed in a summary
form, for purposes such as statistical analysis. A common aggregation purpose is to get more
information about particular groups based on specific variables such as age, profession, or
income.
Aggregate Data Models
The relational model takes the information that we want to store and divides it into tuples rows).
A tuple is a limited data structure: It captures a set of values, so you cannot nest one tuple within
another to get nested records, nor can you put a list of values or tuples within another. This
simplicity underpins the relational model—it allows us to think of all operations as operating on
and returning tuples.
Aggregate orientation takes a different approach. It recognizes that often, you want to operate on
data in units that have a more complex structure than a set of tuples. It can be handy to think in
terms of a complex record that allows lists and other record structures to be nested inside it. As
we’ll see, key-value, document, and column-family databases all make use of this more complex
record. However, there is no common term for this complex record; in this book we use the term
“ aggregate.”
Aggregate is a term that comes from Domain-Driven Design. In Domain-Driven Design, an
aggregate is a collection of related objects that we wish to treat as a unit. In particular, it is a unit
for data manipulation and management of consistency. Typically, we like to update aggregates
with atomic operations and communicate with our data storage in terms of aggregates. This
definition matches really well with how key-value, document, and column-family databases
work.
Dealing in aggregates makes it much easier for these databases to handle operating on a cluster,
since the aggregate makes a natural unit for replication and sharding. Aggregates are also often
easier for application programmers to work with, since they often manipulate data through
aggregate structures.
Example of Relations and Aggregates
At this point, an example may help explain what we’re talking about. Let’s assume we have to
build an e-commerce website; we are going to be selling items directly to customers over the
web, and we will have to store information about users, our product catalog, orders, shipping
addresses, billing addresses, and payment data. We can use this scenario to model the data using
a relation data store as well as NoSQL data stores and talk about their pros and cons. Now let’s
see how this model might look when we think in more aggregate oriented terms as shown in
Figure 2.1.
document database is able to see a structure in the aggregate. The advantage of opacity is that we
can store whatever we like in the aggregate. The database may impose some general size limit,
but other than that we have complete freedom.
A document database imposes limits on what we can place in it, defining allowable structures
and types. In return, however, we get more flexibility in access. With a key-value store, we can
only access an aggregate by lookup based on its key. With a document database, we can submit
queries to the database based on the fields in the aggregate, we can retrieve part of the aggregate
rather than the whole thing, and database can create indexes based on the contents of the
aggregate.
In practice, the line between key-value and document gets a bit blurry. People often put an ID
field in a document database to do a key-value style lookup. Databases classified as key-value
databases may allow you structures for data beyond just an opaque aggregate. For example, Riak
allows you to add metadata to aggregates for indexing and interaggregate links, Redis allows you
to break down the aggregate into lists or sets. You can support querying by integrating search
tools such as Solr.
As an example, Riak includes a search facility that uses Solrlike searching on any aggregates that
are stored as JSON or XML structures. Despite this blurriness, the general distinction still holds.
With key-value databases, we expect to mostly look up aggregates using a key. With document
databases, we mostly expect to submit some form of query based on the internal structure of the
document; this might be a key, but it’s more likely to be something else.
Key Points
Relationships
Aggregates are useful in that they put together data that is commonly accessed together. But
there are still lots of cases where data that’s related is accessed differently. Consider the
relationship between a customer and all of his orders. Some applications will want to access the
order history whenever they access the customer; this fits in well with combining the customer
with his order history into a single aggregate. Other applications, however, want to process
orders individually and thus model orders as independent aggregates.
In this case, you’ll want separate order and customer aggregates but with some kind of
relationship between them so that any work on an order can look up customer data. The simplest
way to provide such a link is to embed the ID of the customer within the order’s aggregate data.
That way, if you need data from the customer record, you read the order, ferret out the customer
ID, and make another call to the database to read the customer data. This will work, and will be
just fine in many scenarios—but the database will be ignorant of the relationship in the data. This
can be important because there are times when it’s useful for the database to know about these
links.
As a result, many databases—even key-value stores—provide ways to make these relationships
visible to the database. Document stores make the content of the aggregate available to the
database to form indexes and queries. Riak, a key-value store, allows you to put link information
in metadata, supporting partial retrieval and link-walking capability. An important aspect of
relationships between aggregates is how they handle updates. Aggregate-oriented databases treat
the aggregate as the unit of dataretrieval. Consequently, atomicity is only supported within the
contents of a single aggregate. If you update multiple aggregates at once, you have to deal
yourself with a failure partway through. Relational databases help you with this by allowing you
to modify multiple records in a single transaction, providing ACID guarantees while altering
many rows.
All of this means that aggregate-oriented databases become more awkward as you need to
operate across multiple aggregates. There are various ways to deal with this, which we’ll explore
later in this chapter, but the fundamental awkwardness remains. This may imply that if you have
data based on lots of relationships, you should prefer a relational database over a NoSQL store.
While that’s true for aggregate-oriented databases, it’s worth remembering that relational
databases
aren’t all that stellar with complex relationships either. While you can express queries involving
joins in SQL, things quickly get very hairy—both with SQL writing and with the resulting
performance—as the number of joins mounts up. This makes it a good moment to introduce
another category of databases that’s often lumped into the NoSQL pile.
Graph Databases
Graph databases are an odd fish in the NoSQL pond. Most NoSQL databases were inspired by
the need to run on clusters, which led to aggregate-oriented data models of large records with
simple connections. Graph databases are motivated by a different frustration with relational
databases and thus have an opposite model small records with complex interconnections shown
in Figure 2.2.
Materialized Views
When we talked about aggregate-oriented data models, we stressed their advantages. If you want
to access orders, it’s useful to have all the data for an order contained in a single aggregate that
can be stored and accessed as a unit. But aggregate-orientation has a corresponding
disadvantage: What happens if a product manager wants to know how much a particular item has
sold over the last couple of weeks? Now the aggregate-orientation works against you, forcing
you to potentially read every order in the database to answer the question. You can reduce this
burden by building an index on the product, but you’re still working against the aggregate
structure.
Relational databases have an advantage here because their lack of aggregate structure allows
them to support accessing data in different ways. Furthermore, they provide a convenient
mechanism that allows you to look at data differently from the way it’s stored—views. A view is
like a relational table (it is a relation) but it’s defined by computation over the base tables. When
you access a view, the database computes the data in the view—a handy form of encapsulation.
Views provide a mechanism to hide from the client whether data is derived data or base data—
but can’t avoid the fact that some views are expensive to compute. To cope with this,
materialized views were invented, which are views that are computed in advance and cached on
disk. Materialized views are effective for data that is read heavily but can stand being somewhat
stale. Although NoSQL databases don’t have views, they may have precomputed and cached
queries, and they reuse the term “ materialized view” to describe them.
It’s also much more of a central aspect for aggregate-oriented databases than it is for relational
systems, since most applications will have to deal with some queries that don’t fit well with the
aggregate structure. (Often, NoSQL databases create materialized views using a map-reduce
computation. which There are two rough strategies to building a materialized view. The first is
the eager approach where you update the materialized view at the same time you update the base
data for it. In this case, adding an order would also update the purchase history aggregates for
each product. This approach is good when you have more
frequent reads of the materialized view than you have writes and you want the materialized
views to be as fresh as possible. The application database approach is valuable here as it makes it
easier to ensure that any updates to base data also update materialized views.
If you don’t want to pay that overhead on each update, you can run batch jobs to update the
materialized views at regular intervals. You’ll need to understand your business requirements to
assess how stale your materialized views can be. You can build materialized views outside of the
database by reading the data, computing the view, and saving it back to the database. More often
databases will support building materialized views themselves. In this case, you provide the
computation that needs to be done, and the database executes the computation when needed
according to some parameters that you configure. This is particularly handy for eager updates of
views with incremental map-reduce (“ Incremental Map-Reduce,”).
Materialized views can be used within the same aggregate. An order document might include an
order summary element that provides summary information about the order so that a query for an
order summary does not have to transfer the entire order document. Using different column
families for materialized views is a common feature of column-family databases. An advantage
of doing this is that it allows you to update the materialized view within the same atomic
operation.
Distribution Models
The primary driver of interest in NoSQL has been its ability to run databases on a large cluster.
As data volumes increase, it becomes more difficult and expensive to scale up—buy a bigger
server to run the database on. A more appealing option is to scale out—run the database on a
cluster of servers. Aggregate orientation fits well with scaling out because the aggregate is a
natural unit to use for distribution.
Depending on your distribution model, you can get a data store that will give you the ability to
handle larger quantities of data, the ability to process a greater read or write traffic, or more
availability in the face of network slowdowns or breakages. These are often important benefits,
but they come at a cost. Running over a cluster introduces complexity—so it’s not something to
do unless the benefits are compelling. Broadly, there are two paths to data distribution:
replication and sharding. Replication takes the same data and copies it over multiple nodes.
Sharding puts different data on different nodes. Replication and sharding are orthogonal
techniques: You can use either or both of them. Replication comes into two forms: master-slave
and peer-to-peer. We will now discuss these techniques starting at the
simplest and working up to the more complex: first single-server, then master-slave replication,
then sharding, and finally peer-to-peer replication.
4.a)What is the main process of operation is master slave replication?What are the pros
and cons of maser slave replication? (Jun/Jul 2015)(10Marks)
Initial Deployment
To configure a master-slave deployment, start two mongod instances: one in master mode, and
the other in slave mode.
To start a mongod instance in master mode, invoke mongod as follows:
mongod --master --dbpath /data/masterdb/
With the --master option, the mongod will create a local.oplog.$main collection, which the
“operation log” that queues operations that the slaves will apply to replicate operations from the
master. The --dbpath is optional.
To start a mongod instance in slave mode, invoke mongod as follows:
mongod --slave --source <masterhostname><:<port>> --dbpath /data/slavedb/
Specify the hostname and port of the master instance to the --source argument. The --dbpath is
optional.
For slave instances, MongoDB stores data about the source server in the local.sources collection.
Configuration Options for Master-Slave Deployments
As an alternative to specifying the --source run-time option, can add a document to local.sources
specifying the master instance, as in the following operation in the mongo shell:
use local
db.sources.find()
db.sources.insert( { host: <masterhostname> <,only: <databasename>> } );
In line 1, you switch context to the local database. In line 2, the find() operation should return no
documents, to ensure that there are no documents in the sources collection. Finally, line 3 uses
db.collection.insert() to insert the source document into the local.sources collection. The model
of the local.sources document is as follows:
host
The host field specifies the master mongod instance, and holds a resolvable hostname, i.e.
IP address, or a name from a host file, or preferably a fully qualified domain name.
You can append <:port> to the host name if the mongod is not running on the default
27017 port.
only
Optional. Specify a name of a database. When specified, MongoDB will only replicate
the indicated database.
Operational Considerations for Replication with Master Slave Deployments
Master instances store operations in an oplog which is a capped collection. As a result, if a slave
falls too far behind the state of the master, it cannot “catchup” and must re-sync from scratch.
Slave may become out of sync with a master if:
The slave falls far behind the data updates available from that master.
The slave stops (i.e. shuts down) and restarts later after the master has overwritten the
relevant operations from the master.
When slaves are out of sync, replication stops. Administrators must intervene manually to restart
replication. Use the resync command. Alternatively, the --autoresync allows a slave to restart
replication automatically, after ten second pause, when the slave falls out of sync with the
master. With --autoresync specified, the slave will only attempt to re-sync once in a ten minute
period.
To prevent these situations you should specify a larger oplog when you start the master instance,
by adding the --oplogSize option when starting mongod. If you do not specify --oplogSize,
mongod will allocate 5% of available disk space on start up to the oplog, with a minimum of 1
GB for 64-bit machines and 50 MB for 32-bit machines.
Run time Master-Slave Configuration
MongoDB provides a number of command line options for mongod instances in master-slave
deployments. See the Master-Slave Replication Command Line Options for options.
Diagnostics
On a master instance, issue the following operation in the mongo shell to return replication status
from the perspective of the master:
rs.printReplicationInfo()
New in version 2.6: rs.printReplicationInfo(). For previous versions, use
db.printReplicationInfo().
On a slave instance, use the following operation in the mongo shell to return the replication
status from the perspective of the slave:
rs.printSlaveReplicationInfo()
New in version 2.6: rs.printSlaveReplicationInfo(). For previous versions, use
db.printSlaveReplicationInfo().
Use the serverStatus as in the following operation, to return status of the replication:
db.serverStatus( { repl: 1 } )
See server status repl fields for documentation of the relevant section of output.
Security
When running with authorization enabled, in master-slave deployments configure a keyFile so
that slave mongod instances can authenticate and communicate with the master mongod
instance.
To enable authentication and configure the keyFile add the following option to your
configuration file:
keyFile = /srv/mongodb/keyfile
Note
You may chose to set these run-time configuration options using the --keyFile option on the
command line.
Setting keyFile enables authentication and specifies a key file for the mongod instances to use
when authenticating to each other. The content of the key file is arbitrary but must be the same
on all members of the deployment can connect to each other.
The key file must be less one kilobyte in size and may only contain characters in the base64 set.
The key file must not have group or “world” permissions on UNIX systems. Use the following
command to use the OpenSSL package to generate “random” content for use in a key file:
openssl rand -base64 741
Master-Slave Replication
Pros
*Analytic applications can read from the slave(s) without impacting the master
*Slaves can be taken offline and sync back to the master without any downtime
Cons
*In the instance of a failure a slave has to be promoted to master to take over its place. No
automatic failover
*Each additional slave add some load* to the master since the binary log have to be read and
data copied to each slave
4.b)What does CAP theory convey?Give an example to support CAP theorem. (Jun/Jul
2015)(10Marks)
The CAP Theorem states that, in a distributed system (a collection of interconnected nodes that
share data.), you can only have two out of the following three guarantees across a write/read pair:
Consistency, Availability, and Partition Tolerance - one of them must be sacrificed. However, as
you will see below, you don't have as many options here as you might think.
Consistency - A read is guaranteed to return the most recent write for a given
client.
Partition Tolerance - The system will continue to function when network partitions
occur.
Before moving further, we need to set one thing straight. Object Oriented Programming !=
Network Programming! There are assumptions that we take for granted when building
applications that share memory, which break down as soon as nodes are split across space and
time.
One such fallacy of distributed computing is that networks are reliable. They aren't. Networks
and parts of networks go down frequently and unexpectedly. Network failures happen to your
system and you don't get to choose when they occur.
Given that networks aren't completely reliable, you must tolerate partitions in a distributed
system, period. Fortunately, though, you get to choose what to do when a partition does occur.
According to the CAP theorem, this means we are left with two options: Consistency and
Availability.
AP - Availability/Partition Tolerance - Return the most recent version of the data you
have, which could be stale. This system state will also accept writes that can be processed
later when the partition is resolved. Choose Availability over Consistency when your
business requirements allow for some flexibility around when the data in the system
synchronizes. Availability is also a compelling option when the system needs to continue
to function in spite of external errors (shopping carts, etc.)
The decision between Consistency and Availability is a software trade off. You can choose what
to do in the face of a network partition - the control is in your hands. Network outages, both
temporary and permanent, are a fact of life and occur whether you want them to or not - this
exists outside of your software.
Building distributed systems provide many advantages, but also adds complexity. Understanding
the trade-offs available to you in the face of network errors, and choosing the right path is vital to
the success of your application. Failing to get this right from the beginning could doom your
application to failure before your first deployment.
MODULE 3
BASICS OF HADOOP
5.a) With a neat diagram explain the sequence of events that takes palce when reading a
file between the client and HDFS? . (Jun/Jul 2015)(10Marks)
Anatomy of a File Read
To get an idea of how data flows between the client interacting with HDFS, the
namenode and the datanodes, consider Figure 3.2, which shows the main sequence of events
when reading a file.
The client opens the file it wishes to read by calling open() on the FileSystem object, which for
HDFS is an instance of DistributedFileSystem.
DistributedFileSystem calls the namenode, using RPC, to determine the locations of the locks
for the first few blocks in the file (step 2). For each block, the namenode returns the addresses of
the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to
their proximity to the client (according to the topology of the cluster’s network; see “Network
Topology and Hadoop” ). If the client is itself a datanode (in the case of a MapReduce task, for
instance), then it will read from the local datanode, if it hosts a copy of the block.
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file
seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the datanode and namenode I/O. The client then calls read() on the stream (step
3). DFSInputStream, which has stored the datanode addresses for the first few blocks in the file,
then connects to the first (closest) datanode for the first block in the file. Data is streamed from
the datanode back to the client, which calls read() repeatedly on the stream (step 4). When the
end of the block is reached, DFSInputStream will close the connection to the datanode, then find
the best datanode for the next block (step 5). This happens transparently to the client, which from
its point of view is just reading a continuous stream. Blocks are read in order with the
DFSInputStream opening new connections to datanodes as the client reads through the stream. It
will also call the namenode to retrieve the datanode locations for the next batch of blocks as
needed. When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
In order to understand the concept of Hadoop Combiner effectively, let’s consider the
following use case, which identifies the number of complaints reported in each state. Below is
the code snippet:
?
1 package com.evoke.bigdata.mr.complaint;
2
import org.apache.hadoop.fs.Path;
3 import org.apache.hadoop.io.IntWritable;
4 import org.apache.hadoop.io.Text;
5 import org.apache.hadoop.mapred.FileInputFormat;
6 import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
7 import org.apache.hadoop.mapred.JobConf;
8 import org.apache.hadoop.conf.Configured;
9 import org.apache.hadoop.util.Tool;
10import org.apache.hadoop.util.ToolRunner;
11
12public class ComplaintCombiner extends Configured implements Tool {
@Override
13 public int run(String[] args) throws Exception {
14 String input, output;
15 if(args.length == 2) {
16 input = args[0];
17 output
} else {
= args[1];
18 input = "your-input-dir";
Dept. of CSE, SJBIT Page 25
Managing Big Data 14SCS21
19 output = "your-output-dir";
20 }
JobConf conf = new JobConf(getConf(), ComplaintCombiner.class);
21 conf.setJobName(this.getClass().getName());
22
23 FileInputFormat.setInputPaths(conf, new Path(input));
24 FileOutputFormat.setOutputPath(conf, new Path(output));
25
26 conf.setMapperClass(StateMapper.class);
conf.setReducerClass(SumCombiner.class);
27
28 //conf.setCombinerClass(SumCombiner.class);
29
30 conf.setMapOutputKeyClass(Text.class);
31 conf.setMapOutputValueClass(IntWritable.class);
32
33 conf.setOutputKeyClass(Text.class);
34 conf.setOutputValueClass(IntWritable.class);
35
JobClient.runJob(conf);
36 return 0;
37 }
38 public static void main(String[] args) throws Exception {
39 int exitCode = ToolRunner.run(new ComplaintCombiner(), args);
System.exit(exitCode);
40 }
A combiner will still be implementing the Reducer interface. Combiners can only be used in
specific cases which are going to be job dependent.
Difference # 1
One constraint that a Combiner will have, unlike a Reducer, is that the input/output key and
value types must match the output types of your Mapper.
Difference #2
Combiners can only be used on the functions that are commutative(a.b = b.a) and associative
{a.(b.c) = (a.b).c} . This also means that combiners may operate only on a subset of your keys
and values or may not execute at all, still you want the output of the program to remain same.
Dept. of CSE, SJBIT Page 26
Managing Big Data 14SCS21
Difference #3
Reducers can get data from multiple Mappers as part of the partitioning process. Combiners can
only get its input from one Mapper.
Map Reduce:
- Because Hadoop stores the entire dataset in small pieces across a collection of servers,
analytical jobs can be distributed in parallel to each of the servers storing part of the data.
Each server evaluates the question against its local fragment simultaneously and reports its
result back for collation into a comprehensive answer. Map Reduce is the agent that
distributes the work and collects the results.
- Both HDFS and Map Reduce are designed to continue to work even if there are failures.
- HDFS continuously monitors the data stored on the cluster. If a server becomes unavailable,
a disk drive fails or data is damaged due to hardware or software problems, HDFS
automatically restores the data from one of the known good replicas stored elsewhere on the
cluster.
- Map Reduce monitors the progress of each of the servers participating in the job, when an
analysis job is running. If one of them is slow in returning an answer or fails before
completing its work, Map Reduce automatically starts another instance of the task on another
server that has a copy of the data.
- Because of the way that HDFS and Map Reduce work, Hadoop provides scalable, reliable
and fault-tolerant services for data storage and analysis at very low cost.
MODULE 4
MAPREDUCE APPILATIONS
7 a) List the different failures that need to be considered for mapReduce programs running
on YARN.What is the effect of each failure? (Jun/Jul 2015)(10Marks)
totals).YARN separates these two roles into two independent daemons: a resource managerto
manage the use of resources across the cluster, and an application master to manage the lifecycle
of applications running on the cluster. The idea is that an application master negotiates with the
resource manager for cluster resources—described in terms of a number of containers each with
a certain memory limit—then runs applicationspecific processes in those containers. The
containers are overseen by node managers running on cluster nodes, which ensure that the
application does not use more resources than it has been allocated.
Figure 4.3. How status updates are propagated through the MapReduce 1 system.
In contrast to the jobtracker, each instance of an application—here a MapReduce job—has a
dedicated application master, which runs for the duration of the application.This model is
actually closer to the original Google MapReduce paper, which describes how a master process
is started to coordinate map and reduce tasks running on a set of workers.
As described, YARN is more general than MapReduce, and in fact MapReduce is just one type
of YARN application. There are a few other YARN applications—such as a distributed shell that
can run a script on a set of nodes in the cluster—and others are actively being worked on (some
are listed at http://wiki.apache.org/hadoop/Powered ByYarn). The beauty of YARN’s design is
that different YARN applications can co-exist on the same cluster—so a MapReduce application
can run at the same time as an MPI application, for example—which brings great benefits for
managability and cluster utilization.Furthermore, it is even possible for users to run different
versions of MapReduce on the same YARN cluster, which makes the process of upgrading
MapReduce more managable. (Note that some parts of MapReduce, like the job history server
and the shuffle handler, as well as YARN itself, still need to be upgraded across the
cluster.)MapReduce on YARN involves more entities than classic MapReduce. They are:4
• The client, which submits the MapReduce job.
• The YARN resource manager, which coordinates the allocation of compute resources on the
cluster.
• The YARN node managers, which launch and monitor the compute containers on machines in
the cluster.
• The MapReduce application master, which coordinates the tasks running the MapReduce job.
The application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.
• The distributed filesystem (normally HDFS, ), which is used for sharing job files between the
other entities.The process of running a job is shown in Figure 4.4, and described in the following
sections.
Job Submission:
Jobs are submitted in MapReduce 2 using the same user API as MapReduce 1 (step
1).MapReduce 2 has an implementation of ClientProtocol that is activated when map
reduce.framework.name is set to yarn. The submission process is very similar to the classic
implementation. The new job ID is retrieved from the resource manager (rather than the
jobtracker), although in the nomenclature of YARN it is an application ID (step 2).The job client
checks the output specification of the job; computes input splits (although there is an option to
generate them on the cluster, yarn.app.mapreduce.am.com pute-splits-in-cluster, which can be
beneficial for jobs with many splits); and copies job resources (including the job JAR,
configuration, and split information) to HDFS(step 3). Finally, the job is submitted by calling
submitApplication() on the resource manager (step 4).
creates a map task object for each split, and a number of reduce task objects determined by the
mapreduce.job.reduces property.The next thing the application master does is decide how to run
the tasks that make up the MapReduce job. If the job is small, the application master may choose
to run them in the same JVM as itself, since it judges the overhead of allocating new containers
and running tasks in them as outweighing the gain to be had in running them in
parallel,compared to running them sequentially on one node. (This is different to MapReduce1,
where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run
as an uber task.What qualifies as a small job? By default one that has less than 10 mappers, only
one reducer, and the input size is less than the size of one HDFS block. (These values may be
changed for a job by setting mapreduce.job.ubertask.maxmaps, mapreduce.job.uber
task.maxreduces, and mapreduce.job.ubertask.maxbytes.) It’s also possible to disable uber tasks
entirely (by setting mapreduce.job.ubertask.enable to false).Before any tasks can be run the job
setup method is called (for the job’s OutputCommitter), to create the job’s output directory. In
contrast to MapReduce 1, where it is called in a special task that is run by the tasktracker, in the
YARN implementation the method is called directly by the application master.
Task Assignment:
If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager(step 8). Each
request, which are piggybacked on heartbeat calls, includes information about each map task’s
data locality, in particular the hosts and corresponding racks that the input split resides on. The
scheduler uses this information to make scheduling decisions (just like a jobtracker’s scheduler
does): it attempts to place tasks on data-local nodes in the ideal case, but if this is not possible the
scheduler prefers rack-local placement to non-local placement.Requests also specify memory
requirements for tasks. By default both map and reduce tasks are allocated 1024 MB of memory,
but this is configurable by setting map reduce.map.memory.mb and mapreduce.reduce.memory
The way memory is allocated is different to MapReduce 1, where tasktrackers have a fixed
number of “slots”, set at cluster configuration time, and each task runs in a single slot. Slots have
a maximum memory allowance, which again is fixed for a cluster, and which leads both to
problems of under utilization when tasks use less memory (since other waiting tasks are not able
to take advantage of the unused memory) and problems of job failure when a task can’t complete
since it can’t get enough memory to run correctly.In YARN, resources are more fine-grained, so
both these problems can be avoided. In particular, applications may request a memory capability
that is anywhere between the minimum allocation and a maximum allocation, and which must be
a multiple of the minimum allocation. Default memory allocations are scheduler-specific, and for
the capacity scheduler the default minimum is 1024 MB (set by
yarn.scheduler.capacity.minimum-allocation-mb), and the default maximum is 10240 MB (set by
yarn.scheduler.capacity.maximum-allocation-mb). Thus, tasks can request any memory
allocation between 1 and 10 GB (inclusive), in multiples of 1 GB (the scheduler will round to the
nearest multiple if needed), by setting mapreduce.map.memory.mb and map
reduce.reduce.memory.mb appropriately.
Task Execution:
Once a task has been assigned a container by the resource manager’s scheduler, the application
master starts the container by contacting the node manager (steps 9a
and
Figure 4.5. How status updates are propagated through the MapReduce 2 system
9b). The task is executed by a Java application whose main class is YarnChild. Before it can run
the task it localizes the resources that the task needs, including the job configuration and JAR
file, and any files from the distributed cache (step 10). Finally, it runs the map or reduce task
(step 11).The YarnChild runs in a dedicated JVM, for the same reason that tasktrackers spawn
new JVMs for tasks in MapReduce 1: to isolate user code from long-running system daemons.
Unlike MapReduce 1, however, YARN does not support JVM reuse so eachtask runs in a new
JVM.Streaming and Pipes programs work in the same way as MapReduce 1. The YarnChild
launches the Streaming or Pipes process and communicates with it using standard input/output or
a socket (respectively), as shown in Figure 4.2 (except the child and subprocesses run on node
managers, not tasktrackers).
Progress and Status Updates:
When running under YARN, the task reports its progress and status (including counters)back to
its application master every three seconds (over the umbilical interface),which has an aggregate
view of the job. The process is illustrated in Figure 4.5. Contrast this to MapReduce 1, where
progress updates flow from the child through the tasktracker to the jobtracker for
aggregation.The client polls the application master every second (set via
mapreduce.client.progressmonitor.pollinterval) to receive progress updates, which are usually
displayed to the user.
Job Completion:
As well as polling the application master for progress, every five seconds the client checks
whether the job has completed when using the waitForCompletion() method on Job. The polling
interval can be set via the mapreduce.client.completion.polli nterval configuration
property.Notification of job completion via an HTTP callback is also supported like in
MapReduce 1. In MapReduce 2 the application master initiates the callback. On job completion
the application master and the task containers clean up their working state, and the
OutputCommitter’s job cleanup method is called. Job information is archived by the job history
server to enable later interrogation by users if desired.
Failures:
In the real world, user code is buggy, processes crash, and machines fail. One of the major
benefits of using Hadoop is its ability to handle such failures and allow your job to complete.
Consider first the case of the child task failing. The most common way that this happens is when
user code in the map or reduce task throws a runtime exception. If this happens,the child JVM
reports the error back to its parent tasktracker, before it exits. The error ultimately makes it into
the user logs. The tasktracker marks the task attempt as failed, freeing up a slot to run another
task.For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is marked as
failed. This behavior is governed by the stream.non.zero.exit.is.failure property(the default is
true).Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM bug that
causes the JVM to exit for a particular set of circumstances exposed by the Map-Reduce user
code. In this case, the tasktracker notices that the process has exited andmarks the attempt as
failed.Hanging tasks are dealt with differently. The tasktracker notices that it hasn’t received a
progress update for a while and proceeds to mark the task as failed. The child JVM process will
be automatically killed after this period.5 The timeout period after which tasks are considered
failed is normally 10 minutes and can be configured on a per-job basis (or a cluster basis) by
setting the mapred.task.timeout property to a value in milliseconds.Setting the timeout to a value
of zero disables the timeout, so long-running tasks are never marked as failed. In this case, a
hanging task will never free up its slot, and overtime there may be cluster slowdown as a result.
This approach should therefore be avoided, and making sure that a task is reporting progress
periodically will suffice (“What Constitutes Progress in MapReduce?”).When the jobtracker is
notified of a task attempt that has failed (by the tasktracker’s heartbeat call), it will reschedule
execution of the task. The jobtracker will try to avoid rescheduling the task on a tasktracker
where it has previously failed. Furthermore, if a task fails four times (or more), it will not be
retried further. This value is configurable:the maximum number of attempts to run a task is
controlled by the mapred.map.max.attempts property for map tasks and
mapred.reduce.max.attempts for reduce tasks. By default, if any task fails four times (or
whatever the maximum number of attempts is configured to), the whole job fails.For some
applications, it is undesirable to abort the job if a few tasks fail, as it may be possible to use the
results of the job despite some failures. In this case, the maximum percentage of tasks that are
allowed to fail without triggering job failure can be set for the job. Map tasks and reduce tasks
are controlled independently, using the mapred.max.map.failures.percent and
mapred.max.reduce.failures.percent properties.A task attempt may also be killed, which is
different from it failing. A task attempt may be killed because it is a speculative duplicate (for
more, see “Speculative Execution”), or because the tasktracker it was running on failed, and the
Dept. of CSE, SJBIT Page 35
Managing Big Data 14SCS21
jobtracker marked all the task attempts running on it as killed. Killed task attempts do not count
against the number of attempts to run the task (as set by mapred.map.max.attempts and
mapred.reduce.max.attempts), since it wasn’t the task’s fault that an attempt was killed.Users
may also kill or fail task attempts using the web UI or the command line (type hadoop job to see
the options). Jobs may also be killed by the same mechanisms.
Tasktracker Failure:
Failure of a tasktracker is another failure mode. If a tasktracker fails by crashing, or running very
slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently). The
jobtracker will notice a tasktracker that has stopped sending heartbeats(if it hasn’t received one
for 10 minutes, configured via the mapred.task tracker.expiry.interval property, in milliseconds)
and remove it from its pool of tasktrackers to schedule tasks on. The jobtracker arranges for map
tasks that were run and completed successfully on that tasktracker to be rerun if they belong to
incomplete jobs, since their intermediate output residing on the failed tasktracker’s local
filesystem may not be accessible to the reduce task. Any tasks in progress are also rescheduled.
A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed. If
more than four tasks from the same job fail on a particular tasktracker (set by
(mapred.max.tracker.failures), then the jobtracker records this as a fault. A tasktracker is
blacklisted if the number of faults is over some minimum threshold (four, set by
mapred.max.tracker.blacklists) and is significantly higher than the average number of faults for
tasktrackers in the cluster cluster.Blacklisted tasktrackers are not assigned tasks, but they
continue to communicate with the jobtracker. Faults expire over time (at the rate of one per day),
so tasktrackers get the chance to run jobs again simply by leaving them running. Alternatively, if
there is an underlying fault that can be fixed (by replacing hardware, for example), the
tasktracker will be removed from the jobtracker’s blacklist after it restarts and rejoins the cluster.
Jobtracker Failure:
Failure of the jobtracker is the most serious failure mode. Hadoop has no mechanism for dealing
with failure of the jobtracker—it is a single point of failure—so in this case the job fails.
However, this failure mode has a low chance of occurring, since the chance of a particular
machine failing is low. The good news is that the situation is improved in YARN, since one of its
design goals is to eliminate single points of failure in Map-Reduce.After restarting a jobtracker,
any jobs that were running at the time it was stopped will need to be re-submitted. There is a
however, the client will experience a timeout when it issues a status update, at which point the
client will go back to the resource manager to ask for the new application master’s address.
Node Manager Failure:
If a node manager fails, then it will stop sending heartbeats to the resource manager,and the node
manager will be removed from the resource manager’s pool of available nodes. The property
yarn.resourcemanager.nm.liveness-monitor.expiry-intervalms,which defaults to 600000 (10
minutes), determines the minimum time the resource manager waits before considering a node
manager that has sent no heartbeat in that time as failed.Any task or application master running
on the failed node manager will be recovered using the mechanisms described in the previous
two sections.Node managers may be blacklisted if the number of failures for the application is
high.Blacklisting is done by the application master, and for MapReduce the application master
will try to reschedule tasks on different nodes if more than three tasks fail on a node manager.
The threshold may be set with mapreduce.job.maxtaskfailures.per.tracker.
Resource Manager Failure:
Failure of the resource manager is serious, since without it neither jobs nor task containers can be
launched. The resource manager was designed from the outset to be able to recover from crashes,
by using a checkpointing mechanism to save its state to persistent storage, although at the time of
writing the latest release did not have a complete implementation.After a crash, a new resource
manager instance is brought up (by an adminstrator) and it recovers from the saved state. The
state consists of the node managers in the system as well as the running applications. (Note that
tasks are not part of the resource manager’s state, since they are managed by the application.
Thus the amount of state to be stored is much more managable than that of the jobtracker.)The
storage used by the reource manager is configurable via the yarn.resource manager.store.class
property. The default is org.apache.hadoop.yarn.server.resource manager.recovery.MemStore,
which keeps the store in memory, and is therefore not highly-available. However, there is a
ZooKeeper-based store in the works that will support reliable recovery from resource manager
failures in the future.
7 b) What are different job scheduler available in later version of hadoop?Expalin the
merit and demerit of each?(jun/july 2015)(10 marks)
Job Scheduling:
Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order
of submission, using a FIFO scheduler. Typically, each job would use the whole cluster, so jobs
had to wait their turn. Although a shared cluster offers great potential for offering large resources
to many users, the problem of sharing resources fairly between users requires a better scheduler.
Production jobs need to complete in a timely manner, while allowing users who are making
smaller ad hoc queries to get results back in a reasonable time.Later on, the ability to set a job’s
priority was added, via the mapred.job.priority property or the setJobPriority() method on
JobClient (both of which take one of the values VERY_HIGH, HIGH, NORMAL, LOW,
VERY_LOW). When the job scheduler is choosing the next job to run, it selects one with the
highest priority. However, with the FIFO scheduler, priorities do not support preemption, so a
high-priority job can still be blocked by a long-running low priority job that started before the
high-priority job was scheduled.MapReduce in Hadoop comes with a choice of schedulers. The
default in MapReduce 1 is the original FIFO queue-based scheduler, and there are also multiuser
schedulers called the Fair Scheduler and the Capacity Scheduler.MapReduce 2 comes with the
Capacity Scheduler (the default), and the FIFO scheduler.
The Fair Scheduler:
The Fair Scheduler aims to give every user a fair share of the cluster capacity over time.If a
single job is running, it gets all of the cluster. As more jobs are submitted, free task slots are
given to the jobs in such a way as to give each user a fair share of the cluster.A short job
belonging to one user will complete in a reasonable time even while another user’s long job is
running, and the long job will still make progress.Jobs are placed in pools, and by default, each
user gets their own pool. A user who submits more jobs than a second user will not get any more
cluster resources than the second, on average. It is also possible to define custom pools with
guaranteed minimum capacities defined in terms of the number of map and reduce slots, and to
set weightings for each pool.The Fair Scheduler supports preemption, so if a pool has not
received its fair share for a certain period of time, then the scheduler will kill tasks in pools
running over capacity in order to give the slots to the pool running under capacity.The Fair
Scheduler is a “contrib” module. To enable it, place its JAR file on Hadoop’sclasspath, by
copying it from Hadoop’s contrib/fairscheduler directory to the lib directory.Then set the
Dept. of CSE, SJBIT Page 39
Managing Big Data 14SCS21
spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of
data written to disk.
As the copies accumulate on disk, a background thread merges them into larger, sorted files. This
saves some time merging later on. Note that any map outputs that were compressed (by the map
task) have to be decompressed in memory in order to perform a merge on them.When all the map
outputs have been copied, the reduce task moves into the sortphase (which should properly be
called the merge phase, as the sorting was carried out on the map side), which merges the map
outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50
map outputs, and the merge factor was 10 (the default, controlled by the io.sort.factor property,
just like in the map’s merge),then there would be 5 rounds. Each round would merge 10 files into
one, so at the end there would be five intermediate files.Rather than have a final round that
merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding
the reduce function in what is the last phase: the reduce phase. This final merge can come from a
mixture of in-memory and on-disk segments.The number of files merged in each round is
actually more subtle than this example suggests. The goal is to merge the minimum number of
files to get to the merge factor for the final round. So if there were 40 files, the merge would not
merge 10 files in each of the four rounds to get 4 files. Instead, the first round would merge only
4 files, and the subsequent three rounds would merge the full 10 files. The 4 merged files, and
the 6 (as yet unmerged) files make a total of 10 files for the final round. The process is illustrated
in Figure 4.7.Note that this does not change the number of rounds, it’s just an optimization to
minimize the amount of data that is written to disk, since the final round always merges directly
into the reduce.
Figure 4.7. Efficiently merging 40 file segments with a merge factor of 10.
During the reduce phase, the reduce function is invoked for each key in the sorted output. The
output of this phase is written directly to the output filesystem, typically HDFS. In the case of
HDFS, since the tasktracker node (or node manager) is also running a datanode, the first block
replica will be written to the local disk.
MODULE 5
checksum doesn’t exactly match the original. This technique doesn’t offer any way to fix the
data—merely error detection. (And this is a reason for not using low-end hardware; in particular,
be sure to use ECC memory.) Note that
it is possible that it’s the checksum that is corrupt, not the data, but this is very unlikely,since the
checksum is much smaller than the data.
A commonly used error-detecting code is CRC-32 (cyclic redundancy check), which computes a
32-bit integer checksum for input of any size.
Data Integrity in HDFS:
HDFS transparently checksums all data written to it and by default verifies checksums when
reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data. The
default is 512 bytes, and since a CRC-32 checksum is 4 bytes long,the storage overhead is less
than 1%.Datanodes are responsible for verifying the data they receive before storing the data and
its checksum. This applies to data that they receive from clients and from other datanodes during
replication. A client writing data sends it to a pipeline of datanodes (as explained in Chapter 3),
and the last datanode in the pipeline verifies the checksum.
If it detects an error, the client receives a ChecksumException, a subclass of IOException, which
it should handle in an application-specific manner, by retrying the operation,for example.
When clients read data from datanodes, they verify checksums as well, comparing them with the
ones stored at the datanode. Each datanode keeps a persistent log of checksum verifications, so it
knows the last time each of its blocks was verified. When a client successfully verifies a block, it
tells the datanode, which updates its log. Keeping statistics such as these is valuable in detecting
bad disks.Aside from block verification on client reads, each datanode runs a DataBlockScanner
in a background thread that periodically verifies all the blocks stored on the datanode.This is to
guard against corruption due to “bit rot” in the physical storage media. See“Datanode block
scanner” for details on how to access the scanner reports.
Since HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good
replicas to produce a new, uncorrupt replica. The way this works is that if a client detects an
error when reading a block, it reports the bad block and the datanode it was trying to read from to
the namenode before throwing a ChecksumException. The namenode marks the block replica as
corrupt, so it doesn’t direct clients to it, or try to copy this replica to another datanode. It then
schedules a copy of the block to be replicated on another datanode, so its replication factor is
back at the expected level.Once this has happened, the corrupt replica is deleted.It is possible to
c) What are the main features of pig Latin script? (Jun/Jul 2015)(5Marks)
Pig Latin:
This section gives an informal description of the syntax and semantics of the Pig Latin
programming language.3 It is not meant to offer a complete reference to the language,4 but there
should be enough here for you to get a good understanding of Pig Latin’s constructs.
Structure:
A Pig Latin program consists of a collection of statements. A statement can be thought of as an
operation, or a command.5 For example, a GROUP operation is a type of statement:
2. Or as the Pig Philosophy has it, “Pigs eat anything.”
3. Not to be confused with Pig Latin, the language game. English words are translated into Pig
Latin by moving the initial consonant sound to the end of the word and adding an “ay” sound.
For example, “pig”
becomes “ig-pay,” and “Hadoop” becomes “Adoop-hay.”
4. Pig Latin does not have a formal language definition as such, but there is a comprehensive
guide to the language that can be found linked to from the Pig website at http://pig.apache.org/.
grouped_records = GROUP records BY year; The command to list the files in a Hadoop
filesystem is another example of a statement:ls /Statements are usually terminated with a
semicolon, as in the example of the GROUP statement. In fact, this is an example of a statement
that must be terminated with a semicolon: it is a syntax error to omit it. The ls command, on the
other hand, does not have to be terminated with a semicolon. As a general guideline, statements
or commands for interactive use in Grunt do not need the terminating semicolon. This group
includes the interactive Hadoop commands, as well as the diagnostic operators like DESCRIBE.
It’s never an error to add a terminating semicolon, so if in doubt, it’s simplest to add
one.Statements that have to be terminated with a semicolon can be split across multiple lines for
readability:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
Pig Latin has two forms of comments. Double hyphens are single-line comments.Everything
from the first hyphen to the end of the line is ignored by the Pig Latin interpreter:-- My program
DUMP A; -- What's in A?
C-style comments are more flexible since they delimit the beginning and end of the comment
block with /* and */ markers. They can span lines or be embedded in a single line:
/*