Sunteți pe pagina 1din 29

По СУБД – „Обработка на данните в NoSQL DB Kasandra”

https://academy.datastax.com/planet-cassandra/what-is-nosql

What is NoSQL?
What is a NoSQL (Not Only SQL) Database?
A NoSQL database environment is, simply put, a non-relational and largely
distributed database system that enables rapid, ad-hoc organization and analysis of
extremely high-volume, disparate data types. NoSQL databases are sometimes
referred to as cloud databases, non-relational databases, Big Data databases and a
myriad of other terms and were developed in response to the sheer volume of data
being generated, stored and analyzed by modern users (user-generated data) and
their applications (machine-generated data).
In general, NoSQL databases have become the first alternative to relational
databases, with scalability, availability, and fault tolerance being key deciding
factors. They go well beyond the more widely understood legacy, relational
databases (such as Oracle, SQL Server and DB2 databases) in satisfying the needs
of today’s modern business applications. A very flexible and schema-less data
model, horizontal scalability, distributed architectures, and the use of languages and
interfaces that are “not only” SQL typically characterize this technology.
From a business standpoint, considering a NoSQL or ‘Big Data’ environment has
been shown to provide a clear competitive advantage in numerous industries. In the
‘age of data’, this is compelling information as a great saying about the importance of
data is summed up with the following “if your data isn’t growing then neither is your
business”.
Types of NoSQL Databases
There are four general types of NoSQL databases, each with their own specific
attributes:

 Graph database – Based on graph theory, these databases are designed for
data whose relations are well represented as a graph and has elements which
are interconnected, with an undetermined number of relations between them.
Examples include: Neo4j and Titan.
 Key-Value store – we start with this type of database because these are some
of the least complex NoSQL options. These databases are designed for
storing data in a schema-less way. In a key-value store, all of the data within
consists of an indexed key and a value, hence the name. Examples of this
type of database include:Cassandra, DyanmoDB, Azure Table Storage (ATS),
Riak, BerkeleyDB.
 Column store – (also known as wide-column stores) instead of storing data in
rows, these databases are designed for storing data tables as sections of
columns of data, rather than as rows of data. While this simple
description sounds like the inverse of a standard database, wide-column
stores offer very high performance and a highly scalable
architecture. Examples include: HBase, BigTable and HyperTable.
 Document database – expands on the basic idea of key-value stores where
“documents” contain more complex in that they contain data and each
document is assigned a unique key, which is used to retrieve the document.
These are designed for storing, retrieving, and managing document-oriented
information, also known as semi-structured data. Examples include:
MongoDB and CouchDB.

The following table lays out some of the key attributes that should be considered
when evaluating NoSQL databases.

Datamodel Performance Scalability Flexibility

Key-value store High High High

Column Store High High Moderate

Document Store High Variable (High) High

Graph Database Variable Variable High


Why NoSQL?
Advantages of NoSQL Databases over Relational Databases
The reasons for businesses to adopt a NoSQL database environment over a
relational database have almost everything to do with the following market drivers
and technical requirements.
When making the switch, consider checking out this roadmap relational database to
NoSQL database for a walkthrough of NoSQL education, migration and success.
The Growth of Big Data
Big Data is one of the key forces driving the growth and popularity of NoSQL for
business. The almost limitless array of data collection technologies ranging from
simple online actions to point of sale systems to GPS tools to smartphones and
tablets to sophisticated sensors – and many more – act as force multipliers for data
growth.
In fact, one of the first reasons to use NoSQL is because you have a Big Data
project to tackle. A Big Data project is normally typified by:

1. High data velocity – lots of data coming in very quickly, possibly from different
locations.
2. Data variety – storage of data that is structured, semi-structured and
unstructured.
3. Data volume – data that involves many terabytes or petabytes in size.
4. Data complexity – data that is stored and managed in different locations or
data centers.

Continuous Data Availability


In today’s marketplace, where the competition is just a click away, downtime can be
deadly to a company’s bottom line and reputation. Hardware failures can and will
occur, fortunately NoSQL database environments are built with a distributed
architecture so there are no single points of failure and there is built-in redundancy of
both function and data. If one or more database servers, or ‘nodes’ goes down, the
other nodes in the system are able to continue with operations without data loss,
thereby showing true fault tolerance. In this way, NoSQL database environments
are able to provide continuous availability whether in single locations, across data
centers and in the cloud. When deployed appropriately, NoSQL databases can
supply high performance at massive scale, which never go down. This is
immensely beneficial as any system updates or modifications can be made
without having to take the database offline. This fact alone draws the attention of
businesses that are serving customers who expect availability of applications and
where downtime equates to real dollars lost.
Real Location Independence
The term “location independence” means the ability to read and write to a database
regardless of where that I/O operation physically occurs and to have any write
functionality propagated out from that location, so that it’s available to users and
machines at other sites. Such functionality is very difficult to architect for
relational databases. Some techniques can be employed such as
master/slave architectures and database sharding can sometimes meet the need
for location independent read operations, but writing data everywhere is a different
matter, especially when those data volumes are high. Other scenarios where
location independence is an advantage are many and include servicing customers in
many different geographies and needing to keep data local at those sites for fast
access.
Modern Transactional Capabilities
The concept of transactions appears to be changing in the Internet age, and it’s been
demonstrated that ACID transactions are no longer a requirement in database driven
systems. At first blush, this assertion sounds extreme, as transactional integrity is a
characteristic of most every data system – especially those with information
requirements that demand accuracy and safety. However, what this refers to is not
the jeopardizing of data, but rather the new way modern applications ensure
transactional consistency across widely distributed systems. The “C” in ACID refers
to data Consistency in relational database management systems which is enforced
via foreign keys/referential integrity constraints. This type of consistency is not
utilized in progressive data management systems such as NoSQL databases
because there are no JOIN operations, as this would require more rigid enforcement
of consistency. Instead, the “Consistency” that concerns NoSQL databases is found
in the CAP theorem, which signifies the immediate or eventual consistency of data
across all nodes that participate in a distributed database. The data is still safe and
meets the AID portion of the RDBMS ACID definition, but its consistency is
maintained differently given the nature and architecture of the system.
Flexible Data Models
One of the major reasons businesses move to a NoSQL database system from a
relational database management system (RDBMS) is the more flexible data
model that’s found in most NoSQL databases. The relational data model is based on
defined relationships between tables, which themselves are defined by a determined
column structure, all of which are explicitly organized in a database schema – all
very strict and uniform. Problems begin to arise with the relational model
around scalability and performance when trying to manage the large data volumes
that are becoming a fact of life in a modern IT and business environment. A NoSQL
data model – often referred to as schema-less – can support many of these use
cases and others that don’t fit well into a RDBMS. A NoSQL database is able to
accept all types of data – structured, semi-structured, and unstructured – much more
easily than a relational database which rely on a predefined schema.
This characteristic of a relational database can be a hindrance on flexibility because
a predefined schema rigidly determines how the database and database data are
organized. Many of today’s business applications actually have the ability to enforce
rules on data usage themselves making a schema-less database platform a viable
option.
Finally, performance factors come into play with an RDBMS’ data model, especially
where “wide rows” are involved and update actions are many, which can have real
implications on performance. However, a NoSQL data model easily handles such
situations and delivers very fast performance for both read and write operations.
Better Architecture
Another reason to use a NoSQL database is because you need a more suitable
architecture for a particular application. It’s critical that organizations adopt a NoSQL
platform that allows them to keep their very high volume data in the context of their
applications. Some, but not all, NoSQL solutions provide modern architectures that
can tackle the type of applications that require high degrees of scale,
data distribution, and continuous availability. Data center support, and as is more
common, multiple data center support, should be a use case with which a NoSQL
environment complies. It’s not just what your big data needs look like today but also
out to greater time horizons that decisions should be made.
Analytics and Business Intelligence
A key strategic driver of implementing a NoSQL database environment is the ability
to mine the data that is being collected so as to derive insights that puts your
business at a competitive advantage. Extracting meaningful business intelligence
from very high volumes of data is a very difficult task to achieve with traditional
relational database systems. Modern NoSQL database systems not only provide
storage and management of business application data but also deliver
integrated data analytics that deliver instant understanding of complex data sets and
facilitate flexible decision-making.

Which NoSQL database should you use?


Choosing the Right NoSQL Database
To this point we have been able to detail many reasons for a business to adopt a
NoSQL, or non-relational database platform, and also covered different types of
NoSQL database options. Determining which NoSQL database to adopt is an
exercise that requires a business to a long hard look at itself and its applications –
which is always good. Understand clearly what your business goals are, what your
application(s) needs – and challenges – are today, and what those needs may be on
the horizon. Also, make sure that both the business and the technology goals are
aligned so that the platform decision made delivers for all key stakeholders. Then
work back to the NoSQL platform that will satisfy these needs and provide the most
solid foundation for you to build on.
Key considerations when choosing your NoSQL platform include:

 Workload diversity – Big Data comes in all shapes, colors and sizes. Rigid
schemas have no place here; instead you need a more flexible design. You
want your technology to fit your data, not the other way around. And you want
to be able to do more with all of that data – perform transactions in real-time,
run
analytics just as fast and find anything you want in an instant from oceans of
data, no matter what from that data may take.
 Scalability – With big data you want to be able to scale very rapidly and
elastically, whenever and wherever you want. This applies to all situations,
whether
scaling across multiple data centers and even to the cloud if needed.
 Performance – As has already been discussed, in an online world where
nanosecond delays can cost you sales, Big Data must move at extremely high
velocities no matter how much you scale or what workloads your database
must perform. Performance of your environment, namely your applications,
should be high on the list of requirements for deploying a NoSQL platform.
 Continuous Availability – Building off of the performance consideration, when
you rely on big data to feed your essential, revenue-generating 24/7
business applications, even high availability is not high enough. Your data
can never go down, therefore there should be no single point of failure in your
NoSQL environment, thus ensuring applications are always available.
 Manageability – Operational complexity of a NoSQL platform should be kept
at a minimum. Make sure that the administration and development required to
both maintain and maximize the benefits of moving to a NoSQL environment
are achievable.
 Cost – This is certainly a glaring reason for making the move to a NoSQL
platform as meeting even one of the considerations presented here with
relational database technology can cost become prohibitively expensive.
Deploying NoSQL properly allows for all of the benefits above while also
lowering operational costs.
 Strong Community – This is perhaps one of the more important factors to
keep in mind as you move to a NoSQL platform. Make sure there is a solid
and capable community around the technology, as this will provide an
invaluable resource for the individuals and teams that will be managing
the environment. Involvement on the part of the vendor should not
only include strong support and technical resource availability, but
also consistent outreach to the user base. Good local user groups
and meetups will provide many opportunities for communicating with
other individuals and teams that will provide great insight into how to
work best with the platform of choice.
https://towardsdatascience.com/top-5-reasons-to-use-the-apache-cassandra-database-
d541c6448557

Top 5 reasons to use the


Apache Cassandra
Database 1

Vladimir FedakFollow

Mar 2, 2018

Cassandra is an incredibly popular database that underpins


heavy-load applications like Facebook. Today we will list even
more reasons for adding Cassandra to your toolkit.

Aside from being a backbone for Facebook and Netflix,


Cassandra is a very scalable and resilient database that is easy to
master and simple to configure, providing neat solutions for
1
https://academy.datastax.com/planet-cassandra/what-is-nosql
quite complex problems. Event logging, metrics collection and
evaluation, monitoring the historical data — all of these tasks are
quite hard to accomplish correctly, given the variety of OS’s,
platforms, browsers and devices both startup products and
enterprise systems face in their daily operations.

We would like to highlight 5 important advantages of using


Cassandra:

1. Helps solve complicated tasks with ease


2. Has a short learning curve
3. Lowers admin overhead and costs for a DevOps engineer
4. Rapid writing and lightning-fast reading

This is what we mean by the benefits above.

Cassandra helps solve complicated


tasks with ease
Event logging, metrics collection, performing queries against
the historical data — all of these tasks can seem dull, yet they are
of utmost importance both for Big Data and DevOps workflows.
Configuring the centralized storage for logs can be quite a
daunting task, given the variety of the data and a plentitude of
its sources, which we mentioned above.

Building a centralized storage for logs and metrics and


retrieving historical information from this storage is a task
Cassandra deals with utmost ease. Once the table structure is
chosen and designed, the database works like a charm, easily
scaling at your demand.
Cassandra has a short learning curve
Cassandra operates CQL — Cassandra Query Language. It is
basically SQL, but stripped of the more advanced features.
While this is a downside of sorts, this is also a huge advantage,
as the tool is really able to perform exceptionally well using
quite a limited list of variables, commands and functions. Due
to this simplicity a Big Data Engineer can master Cassandra in
around 30 days, thus obviously shortening the time to market
drastically for your product.

Cassandra lowers admin overhead


and costs DevOps engineer
As we described earlier, event logging, metrics collection and
working with historical data are the obvious uses for Cassandra.
However, as your team is able to use the tool the fullest extent
(and this time will come quite soon, mind you) — they will
definitely find much more tasks that can be delegated to
Cassandra and performed excellently. Low admin costs make
this tool incredibly useful, as your team is able to concentrate
more on their core tasks like improving the product and its
features, instead of constant firefighting like deciphering the
logs and dealing with the issues.

Cassandra offers rapid writing and


lightning-fast reading
As this database was developed for Facebook, where millions of
reads and writes happen at each given second, it provides
incredible levels of performance. What is even more important,
these values are linear and scale nearly effortlessly. This means
after measuring the read/write performance values on one
server, you can simply calculate how many more servers you
should add to the cluster to reach the required performance
levels, and scale easily. In addition, slicing over partition rows
(done with so-called partition keys) ensures lightning-fast
response to queries.
Cassandra provides extreme
resilience and fault tolerance
As Cassandra is a masterless cluster, there is no Single Point of
Failure. The data hashes are being constantly replicated
throughout the cluster to ensure 100% service uptime regardless
of the temporary unavailability of up to 1/2 of the servers. This
is especially useful when performing rolling updates or some
maintenance on the clusters. Regardless if the server, the full
server rack or the whole data center goes out of connection, your
customers will not face the service downtime. Your app will
always be out there, just like Facebook and Netflix.

Final thoughts
We hope you now agree Cassandra database can be extremely
useful for a multitude of tasks. Even if the project requirements
do not allow using this tool as the main database, mastering it
will be hugely beneficial for any Big Data Engineer, who aims to
become a valuable specialist.
https://dzone.com/articles/nosql-cassandra-in-plain-english

NoSQL and Cassandra in Plain


English
Cassandra has been deployed at scale by companies like Netflix,
eBay, Spotify, and Apple. Come learn why it's been so successful
and also look at its drawbacks.

by

John Ryan

Jan. 31, 18 · Database Zone · Tutorial

Like (8)

Comment (0)

Save

Tweet
3,767 Views

Join the DZone community and get the full member experience.
JOIN FOR FREE

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading
NoSQL Document Databases on 9 features to find out which is the best solution for
your next project.
Cassandra is a highly scalable distributed database originally built at
Facebook. It's now a top-level Apache open-source project with
widespread adoption and is deployed at scale by Netflix, eBay, Spotify,
and Apple.
Its main strengths are:

 Ease of use: Accessed via a subset of SQL (CQL), it supports select,


insert, update, and delete operations against a table and column
structure familiar to many database developers.
 Built for volume and velocity: A Netflix benchmark
test demonstrated around 2.8 billion operations per hour using a
mixed read/write workload. It managed an average write latency
of just 1.75 milliseconds and an average throughput of 200,000
writes per second against a total of 70 terabytes of data.
 Built for scalability on-premises or the cloud: While you can
build your own in-house cluster, Cassandra is well supported by
the Amazon Elastic Compute platform. The entire Netflix
benchmark ran on 285 nodes on the EC2 platform, costing around
$400 per hour — then had no cost once the operation completed.
 Transparent node balancing: Data is automatically sharded
(distributed) across nodes in the cluster for resilience. As demand
grows and new nodes are added, the system transparently
distributes the data to the new servers. On a cloud-based
deployment, this could potentially be automated, providing elastic
scalability.
 Highly fault tolerant: Many NoSQL databases (i.e. HBase) use a
"master" node for write operations with an optional fail-over
capability for recovery. Cassandra works in a "ring" topology with
no single point of failure. Writes are automatically replicated to
three (or more) nodes in a cluster, and data can be replicated to
another data center in real-time, providing built-in disaster
recovery.
 Hadoop integration: Cassandra works well with Hadoop and
HDFS and is accessible via Apache Spark, with support for multiple
languages including Java, Python, Ruby, and C++.
The diagram above illustrates a globally distributed system whereby
users gain millisecond performance on a cluster of machines at each
location, while results are automatically replicated globally. Using
Amazon EC2, if the Singapore-based system crashed, the workload could
automatically resume using London and New York-based servers at a
reduced speed.
This benefit was illustrated in April 2011 when a major US-East outage
on Amazon left many websites down, but Netflix operations running
Cassandra were unaffected because of cross-regional replication in
place.

Why Avoid Cassandra?


Like any specialized, highly tuned tool, Cassandra is built to solve a
specific problem. It has high volume and low latency read/write lookups
from a massive user base. It does, however, have significant limitations:

 No transaction support: An RDBMS like Oracle provides full read


consistency and transaction isolation, which is absent from nearly
all NoSQL databases. On Cassandra, every write operation is
immediately visible to all users, and this has massive implications
for many applications. Caveat emptor: buyer beware.
 Joins not supported: Cassandra is not a relational database, and
join operations are not supported. It does support simple parent-
child relationships using compound keys, but database design
must be focused on query access paths, using extensive data
denormalization to maximize performance.
 Not suitable for OLAP: Designed for fast primary key lookups, it's
not suitable for data warehouse-type queries, which aggregate
data over millions of rows. While it supports secondary indexes,
they're not recommended for high cardinality (unique) values, as
performance degrades rapidly. This compares poorly to a
traditional RDBMS, which often handles a mixed workload with
ease.
 Limited sort operations: Most read operations focus on a single
row, and there's no option to sort results based upon an arbitrary
column. It's possible to store and retrieve data in a predefined key
sequence, but again, this is inflexible compared to a typical RDBMS.

Cassandra and Eventual Consistency


An Oracle database supports immediate consistency, as there's only one
copy of the data. However, on a distributed database, the data is
replicated to multiple independently running nodes, and to maximize
throughput, we may relax the consistency rule and settle for eventual
consistency.
The diagram below illustrates a situation where an update is applied to
one node and eventually written to the two other replicas. In the interim
(maybe a few microseconds), a reader can potentially read the old
(stale) copy of the data. Eventually, all three replicas will have consistent
results — hence the name.
While not the best solution in every case, the Netflix
benchmark demonstrates the benefits of relaxing the consistency rule, as
read throughput increased from 600,000 to nearly one million reads per
second. This clearly illustrates the trade-off of performance against
consistency and resilience. If you don't need the guaranteed absolute
latest entry, it's a significant performance gain.
Cassandra is also highly flexible and supports 11 levels of consistency.
The most commonly used are:

 ONE — eventual consistency: For maximum performance on


reads, this returns the first available version from the three (or
more) replicas. On writes, the operation is complete once written
to one node, which provides write throughput at the expense of
resilience.
 QUORUM — mid-ground: A balance of performance and
resilience, this marks a write operation complete as soon as a
majority (two of the three) replicas are written and combines data
from multiple replicas on read, taking the latest data available.
 ALL — maximum resilience: The slowest option, this favors
resilience over performance, as a write must be written to all
replicas before it's marked complete.

Note: If an operation is written using ALL consistency, it can be safely


read using ONE, which maximizes read throughput overwrites. This
provides huge flexibility (and an opportunity for hard-to-identify bugs),
but it's a powerful option to have available.

Warning: Transactions Not Supported


Although Cassandra has flexible support for consistency, this must not
be confused with "read consistency" or support for transactions — a
feature often taken for granted on traditional database platforms.
Effectively. on Cassandra, every write operation is implicitly committed
and immediately visible to all readers without delay.
Take a simple example: an application that transfers money from a
savings account to a current account would typically involve two
operations committed in a single transaction (one to deduct from the
savings account and another to credit the current account). In Cassandra,
another user could read the balance of both accounts midway through
with potentially catastrophic results.
Cassandra does support a lightweight test and set operation to set a
value depending upon a value, and the ability to batch load multiple
inserts, but this provides, at most, a rather basic level of lightweight
transaction support.
In conclusion, if you have a web or sensor streaming-based application
with a massive or highly unpredictable workload needing very low
latency read/write operations using mainly unique keys, Cassandra is an
excellent option. If you can deploy a cloud-based solution and need
potentially massive scalability, all the better.
If however, you have a data warehouse workload summarizing billions
of rows, a database like Vertica might be a better option. Likewise, if you
need full transactional support, you should look elsewhere — perhaps at
VoltDB.
There are over 150 NoSQL databases available, many of which support
massive throughput and scalability. However, where Cassandra is
unique is in providing tuneable consistency along with a high level of
resilience and native replication across a potentially globally distributed
system.
Provided you can work with the lack of transaction support and eventual
consistency, it's a potentially powerful tool to deploy.
https://n2ws.com/blog/backup-and-recovery/nosql-database-mongodb-cassandra

Which NoSQL
Database: MongoDB
or Cassandra?
 September 4
 AWS Cloud, Backup and Recovery

These days, NoSQL databases are


used in some of the most data-heavy applications available, such as cloud-
based solutions, e-commerce websites, and airlines’ management
systems. There are plenty of NoSQL database options available in the
market, including Cassandra, MongoDB, Hbase, CouchDB, Riak, Redis,
and others. However, Cassandra and MongoDB are the most popular and
are ranked among the top 10 databases according to the DB engine
database popularity metric.

The company then known as 10gen began developing MongoDB in 2007.


The final version of MongoDB was released in 2009. Cassandra, however,
was developed by Avinash Lakshman and Prashant Malik, two developers
who work at Facebook. They originally developed this NoSQL database for
Facebook as an inbox search feature. In July 2008, Facebook released
Cassandra as an open-source project.

Feature-set Comparison
Although both Cassandra and MongoDB are types of NoSQL databases,
they vary in terms of implementation and feature set. If you are looking for
a NoSQL database for your application, you should consider the below
comparison between the two.

Data Model
MongoDB uses a document-oriented data model. It stores data
in BSON (Binary JSON) format documents which provides the flexibility to
combine and insert multi-structured data without declaring the schema.
These BSON documents are equivalent to a record in the RDBMS, where a
table can be equated with the collection of documents.

Cassandra, on the other hand, is a columnar NoSQL database, storing


data in columns instead of rows. A column in a Cassandra database
contains three fields: the name of the column or key, the value against the
key, and a time stamp.

Figure 1 – Structure of a Column in Cassandra Database

Data Replication and High Availability


MongoDB works on the master-slave model. Read and write operations are
performed on the primary replica. In case the master is unavailable, one of
the slaves will be auto-elected as the new master. The auto-election
process takes approximately 10-40 seconds to go into effect. During the
election process, write operations are not possible on the replica sets.
Unlike MongoDB, Cassandra avoids using the master or primary replica
concept. Failure of a single node does not cause any downtime because, in
this architecture, the request is handled by the remaining nodes. As a
result, Cassandra provides 100% uptime.

Support for Queries


As MongoDB is based on JSON-like documents, it offers a variety of
choices for effective querying and data fetching. Options include the
existing Mongo shell, Compass, Python, Java, Node.js, PHP, Perl, and
Ruby. One needs to have knowledge of any of these platforms to query
data objects. Cassandra uses Cassandra Query Language (CQL) for
queries and data fetching. Because this language is very similar to
Structured Query Language, it’s easy for SQL administers to learn.

Scalable Write Requests


In MongoDB, all the write operations are performed on the master node.
The primary node records all the modifications in the data sets in its
operational logs. These are used to update the secondary data sets. As we
can have only one master at a time, the write scalability is greatly
restricted. It can, however, be enhanced by introducing sharding
techniques.

Cassandra’s write scalability is more efficient because there is no single


master. The more the number of nodes in the cluster, the more scalability,
and write-handling capacity of the database.

Use of Secondary Indexes


A secondary index is an efficient and quick way to access records in a
database using information other than the primary key. MongoDB supports
secondary indexes to enhance the search speed of the documents
requested in the query. Cassandra, on the other hand, does not completely
support secondary indexes. Queries will only work if you are using primary
keys to search the data.

Licensing Models
Both Cassandra and MongoDB databases offer free open-source versions.
MongoDB provides various versions, such as MongoDB Enterprise
Advanced, MongoDB Enterprise for OEM, and MongoDB Atlas, a cloud-
based solution with “free,” “essential,” and “professional tiers.”

In the Cassandra space, DataStax is the leader, enhancing the open-


source version of Cassandra by making it enterprise-
ready. DataStax provides a subscription-based model and offers
Cassandra at three subscription levels: “Basic,” “Enterprise,” and “DataStax
Managed Cloud.” Additionally, both MongoDB and Cassandra are found in
the AWS and Azure marketplaces and can be hosted on public clouds.

Summary
Making the right NoSQL database choice will depend on the features your
application requires. From the perspective of availability and scalability,
Cassandra has the upper hand. However, MongoDB offers secondary
indexes for quick searches. MongoDB also is ideal for efficiently storing
unstructured data and running the high-speed logging and caching
operations required by real-time analytics. Cassandra, on the other hand, is
easy to set up and maintain while requiring no downtime in the event of a
node failure. Additionally, Cassandra can be used for high-growth database
applications. Whichever database you choose, NoSQL databases
guarantee faster responses to your big data set queries.
https://en.wikipedia.org/wiki/Apache_Cassandra

Apache Cassandra
From Wikipedia, the free encyclopedia

Jump to navigationJump to search

Apache Cassandra

Original author(s) Avinash Lakshman, Prashant Malik

Developer(s) Apache Software Foundation

Initial release July 2008; 10 years ago

Stable release 3.11.3 / August 1, 2018

 git.apache.org/cassandra.git
Repository

Written in Java

Operating system Cross-platform

Available in English

Type NoSQL Database, data store

License Apache License 2.0

Website cassandra.apache.org
Apache Cassandra is a free and open-source, distributed, wide column
store, NoSQL database management system designed to handle large amounts of data across
many commodity servers, providing high availability with no single point of failure. Cassandra
offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless
replication allowing low latency operations for all clients.

Contents

 1History
o 1.1Releases
 2Main features
o 2.1Cassandra Query Language
o 2.2Known issues
o 2.3Data model
 3Management and monitoring
 4Notable applications
 5See also
 6References
 7Bibliography
 8External links

History[edit]
Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik initially
developed Cassandra at Facebook to power the Facebook inbox search feature. Facebook
released Cassandra as an open-source project on Google code in July 2008.[2] In March 2009 it
became an Apache Incubator project.[3] On February 17, 2010 it graduated to a top-level project.[4]
Facebook developers named their database after the Trojan mythological prophet Cassandra,
with classical allusions to a curse on an oracle.[5]
Releases[edit]
Releases after graduation include

 0.6, released Apr 12 2010, added support for integrated caching, and Apache
Hadoop MapReduce[6]
 0.7, released Jan 08 2011, added secondary indexes and online schema changes[7]
 0.8, released Jun 2 2011, added the Cassandra Query Language (CQL), self-tuning
memtables, and support for zero-downtime upgrades[8]
 1.0, released Oct 17 2011, added integrated compression, leveled compaction, and
improved read-performance[9]
 1.1, released Apr 23 2012, added self-tuning caches, row-level isolation, and support for
mixed ssd/spinning disk deployments[10]
 1.2, released Jan 2 2013, added clustering across virtual nodes, inter-node communication,
atomic batches, and request tracing[11]
 2.0, released September 4, 2013; 5 years ago, added lightweight transactions (based on
the Paxos consensus protocol), triggers, improved compactions
 2.1 released Sep 10 2014 [12]
 2.2 released July 20, 2015
 3.0 released November 11, 2015
 3.1 through 3.10 releases were monthly releases using a tick-tock-like release model, with
even-numbered releases providing both new features and bug fixes while odd-numbered
releases will include bug fixes only.[13]
 3.11 released June 23, 2017 as a stable 3.11 release series and bug fix from the last tick-
tock feature release.

Version Original release date Latest version Release date Status[14]

0.6 2010-04-12 0.6.13 2011-04-18 No longer supported

0.7 2011-01-10 0.7.10 2011-10-31 No longer supported

0.8 2011-06-03 0.8.10 2012-02-13 No longer supported

1.0 2011-10-18 1.0.12 2012-10-04 No longer supported

1.1 2012-04-24 1.1.12 2013-05-27 No longer supported

1.2 2013-01-02 1.2.19 2014-09-18 No longer supported

2.0 2013-09-03 2.0.17 2015-09-21 No longer supported

2.1 2014-09-16 2.1.20 2018-02-16 Still supported, critical fixes only

2.2 2015-07-20 2.2.13 2018-08-01 Still supported

3.0 2015-11-09 3.0.17 2018-08-01 Still supported

3.11 2017-06-23 3.11.3 2018-08-01 Latest release

Legend:

Old version

Older version, still supported

Latest version

Latest preview version


Main features[edit]
Distributed
Every node in the cluster has the same role. There is no single point of failure. Data is
distributed across the cluster (so each node contains different data), but there is no
master as every node can service any request.
Supports replication and multi data center replication
Replication strategies are configurable.[15] Cassandra is designed as a distributed system,
for deployment of large numbers of nodes across multiple data centers. Key features of
Cassandra’s distributed architecture are specifically tailored for multiple-data center
deployment, for redundancy, for failover and disaster recovery.
Scalability
Designed to have read and write throughput both increase linearly as new machines are
added, with the aim of no downtime or interruption to applications.
Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across
multiple data centers is supported. Failed nodes can be replaced with no downtime.
Tunable consistency
Cassandra is typically classified as an AP system, meaning that availability and partition
tolerance are generally considered to be more important than consistency in
Cassandra[16], Writes and reads offer a tunable level of consistency, all the way from
"writes never fail" to "block for all replicas to be readable", with the quorum level in the
middle.[17]
MapReduce support
Cassandra has Hadoop integration, with MapReduce support. There is support also
for Apache Pig and Apache Hive.[18]
Query language
Cassandra introduced the Cassandra Query Language (CQL). CQL is a simple interface
for accessing Cassandra, as an alternative to the traditional Structured Query
Language (SQL).
Cassandra Query Language[edit]
Cassandra introduced the Cassandra Query Language (CQL). CQL
is a simple interface for accessing Cassandra, as an alternative to
the traditional Structured Query Language (SQL). CQL adds an
abstraction layer that hides implementation details of this structure
and provides native syntaxes for collections and other common
encodings. Language drivers are available for Java (JDBC), Python
(DBAPI2), Node.JS (Helenus), Go (gocql) and C++.[19]
Below an example of keyspace creation, including a column family
in CQL 3.0:[20]

CREATE KEYSPACE MyKeySpace


WITH REPLICATION = { 'class' : 'SimpleStrategy',
'replication_factor' : 3 };

USE MyKeySpace;

CREATE COLUMNFAMILY MyColumns (id text, Last text,


First text, PRIMARY KEY(id));
INSERT INTO MyColumns (id, Last, First) VALUES ('1',
'Doe', 'John');

SELECT * FROM MyColumns;

Which gives:

id | first | last
----+-------+------
1 | John | Doe

(1 rows)

Known issues[edit]
Up to Cassandra 1.0, Cassandra was not row level
consistent,[21] meaning that inserts and updates into the table that
affect the same row that are processed at approximately the same
time may affect the non-key columns in inconsistent ways. One
update may affect one column while another affects the other,
resulting in sets of values within the row that were never specified
or intended. Cassandra 1.1 solved this issue by introducing row-
level isolation.[22]
Data model[edit]
Cassandra is wide column store, and, as such, essentially a hybrid
between a key-value and a tabular database management system.
Its data model is a partitioned row store with tunable
consistency.[17] Rows are organized into tables; the first component
of a table's primary key is the partition key; within a partition, rows
are clustered by the remaining columns of the key.[23] Other columns
may be indexed separately from the primary key.[24]
Tables may be created, dropped, and altered at run-time without
blocking updates and queries.[25]
Cassandra cannot do joins or subqueries. Rather, Cassandra
emphasizes denormalization through features like collections.[26]
A column family (called "table" since CQL 3) resembles a table in an
RDBMS. Column families contain rows and columns. Each row is
uniquely identified by a row key. Each row has multiple columns,
each of which has a name, value, and a timestamp. Unlike a table
in an RDBMS, different rows in the same column family do not have
to share the same set of columns, and a column may be added to
one or multiple rows at any time.[27]
Each key in Cassandra corresponds to a value which is an object.
Each key has values as columns, and columns are grouped
together into sets called column families. Thus, each key identifies a
row of a variable number of elements. These column families could
be considered then as tables. A table in Cassandra is a distributed
multi dimensional map indexed by a key. Furthermore, applications
can specify the sort order of columns within a Super Column or
Simple Column family.

Management and monitoring[edit]


Cassandra is a Java-based system that can be managed and
monitored via Java Management Extensions (JMX). The JMX-
compliant nodetool utility, for instance, can be used to manage a
Cassandra cluster (adding nodes to a ring, draining nodes,
decommissioning nodes, and so on).[28] Nodetool also offers a
number of commands to return Cassandra metrics pertaining to disk
usage, latency, compaction, garbage collection, and more.[29]
Since Cassandra 2.0.2 in 2013, measures of several metrics are
produced via the Dropwizard metrics framework[30], and may be
queried via JMX using tools such as JConsole or passed to external
monitoring systems via Dropwizard-compatible reporter plugins[31].

S-ar putea să vă placă și