Sunteți pe pagina 1din 5

El procesamiento de datos puede implicar varios procesos, incluyendo:

Validacin - Garantizar que los datos suministrados se "limpia, correcta y til"

Ordenando - "disponer cosas de alguna secuencia y / o en diferentes conjuntos."

Recapitulacin - la reduccin de los datos detallados de sus puntos principales.

Agregacin - la combinacin de mltiples piezas de datos.

Anlisis - la "recopilacin, organizacin, anlisis, interpretacin y presentacin de los


datos.".

Informes - Lista de detalle o de resumen de datos o informacin computarizada.

Clasificacin - separa los datos en varias categoras.

PIG
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces
sequences of Map-Reduce programs, for which large-scale parallel implementations already
exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual
language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple,


"embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple
interrelated data transformations are explicitly encoded as data flow sequences, making
them easy to write, understand, and maintain.

Optimization opportunities. The way in which tasks are encoded permits the
system to optimize their execution automatically, allowing the user to focus on semantics
rather than efficiency.

Extensibility. Users can create their own functions to do special-purpose


processing.

http://pig.apache.org/

ECL
https://hpccsystems.com/download/documentation/learning-ecl

is a programming
language designed and used with HPCC Systems. It is
specifically designed for data management and query
processing. ECL code is written using the ECL IDE
programming development tool.
ECL is a transparent and implicitly parallel
programming language which is both powerful and
flexible. It is optimized for data-intensive operations,
declarative, non-procedural and dataflow oriented. ECL
uses intuitive syntax which is modular, reusable,
extensible and highly productive. It combines data
representation and algorithm implementation.
ECL (Enterprise Control Language)

The ECL programming language and system were an extensible high-level programming
language and development environment developed at Harvard University in the 1970s. The
name 'ECL' stood for 'Extensible Computer Language' or 'EClectic Language'. Some
publications used the name 'ECL' for the entire system and 'EL/1' (Extensible Language) for
the language itself.
ECL was an interactive system where programs were represented within the system; there was
a compatible compiler and interpreter. It had an ALGOL-like syntax and an extensible data
type system, with data types as first-class citizens. Data objects were values, not references,
and the calling conventions gave a choice between call by value and call by reference for each
argument.
ECL was primarily used for research and teaching in programming language
design, programming methodology (in particular programming by transformational refinement),
and programming environments at Harvard, though it was said to be used at some government

agencies as well. It was first implemented on the PDP-10, with a later (interpreted-only)
implementation on the PDP-11 written in BLISS-11 and cross-compiled on the PDP-10.

HCCP
http://www.ianux.com/Solucion%20HPCC%20(Cluster%20de%20Alto
%20Rendimiento)
Este tipo de tecnologa nos permite que un conjunto de computadoras trabajen en paralelo,
dividiendo el trabajo en varias tareas ms pequeas las cuales se pueden desarrollar de forma
paralela.
La Solucin HPCC de IANUX Soluciones, se compone de:
Sistema Operativo Linux

(diferentes distribuciones). La distribucin


se elige segn la compatibilidad y/o certificacin con diferentes
componentes hardware o software, as como las propias necesidades del cliente.

Utilidades para el desarrollo


Se incluye software para desarrollo en diferentes lenguajes de
programacin, junto con los depuradores y herramientas estndar para programacin.

Bibliotecas de desarrollo paralelo


Se incluyen diferentes bibliotecas MPI y PVM para facilitar la creacin de aplicaciones
optimizadas para su ejecucin en el cluster.

Administracin y monitorizacin
Herramientas que facilitan la administracin. Permiten recoger estadsticas de uso y
mostrarlas de forma dinmica en tiempo real a travs de una interfaz web.

Software de Gestin de Colas

Permite distribuir la carga de trabajos segn las necesidades de nuestro propio cliente.
Sistema de Reinstalacin de Nodos
Herramienta fundamental para facilitar la administracin en Clusters compuestos por un gran
numero de nodos.

Por qu elegir esta solucin?


Alto Rendimiento. Esta solucin le permite obtener el mximo gracias al proceso paralelos de
sus trabajos. Es independiente del hardware a utilizar, pudiendo adaptarse a cualquier tipo de
servidores, dispositivos de red, configuraciones, etc.

Fcil Escalabilidad. Podr incrementar el numero de nodos de calculo cuando usted quiera, sin
tener que pagar licencias de software adicionales.
Gran Flexibilidad. La solucin esta basada en software libre, lo que le permite la modificacin o
adiccin de nuevas funcionalidades segn lo necesite.

SQOOP Y FLUME

http://www.guru99.com/introduction-to-flume-and-sqoop.html

S-ar putea să vă placă și