Documente Academic
Documente Profesional
Documente Cultură
Cloud
Provision and manage Tensorflow cluster with OpenStack
SW Systems Data
Hardware Fast Data
Science
Data
Hardware Fast Data SW Systems
Science
Data
Hardware Fast Data SW Systems
Science
Data
Hardware Fast Data SW Systems
Science
* By 6/28/2016
A more detailed summary and comparison https://github.com/zer0n/deepframeworks
9 of 38 Copyright 2016 Dell Inc.
Tensorflow
Flexible to construct the compute, define the operators
Auto-Differentiation for difficult algorithms
Portable to run in PC or cloud, different hardware such as CPU, GPU or other cards
Connect Research and Production by providing Training-> Serving model
Distributed training and serving
Continue Training
Learn Model Model Updated
Cluster 1
Export Model
Online Serving
Cluster 2 Serving
Consume
13 of 38 Copyright 2016 Dell Inc.
Tensorflow
Flexible to construct the compute, define the operators, support different language
Auto-Differentiation for difficult algorithms
Portable to run in PC or cloud, different hardware such as CPU, GPU or other cards
Connect Research and Production by providing Training-> Serving model
Distributed training and serving
Cluster spec
Distributed (since v0.8)
Cluster
Workers Parameter
Server
Task 1 Task 2
gRPG
Parameters
* Image from: Large Scale Distributed Deep Networks
Master Worker
14 of 38 Copyright 2016 Dell Inc.
(Session) Service
Deep Learning & Cloud
Cluster spec
Cluster
Workers Parameter
Server
Task 1 Task 2
gRPG
Parameters
Master Worker
(Session) Service
Cluster spec
Cluster
Workers Parameter But, in production environment, there are
Server thousands of servers need to be coordinated,
complex environments
Task 1 Task 2
gRPG
Parameters
Master Worker
(Session) Service
Cluster spec
Cluster
Workers Parameter But, in production environment, there are
Server thousands of servers need to be coordinated,
complex environments
Task 1 Task 2
We need
gRPG
Parameters
Master Worker
(Session) Service
Scalabilities?
Can support heterogeneous environment?
Flexible enough for extending new features?
Can hide the plumbing for system engineers and data scientists?
Scalabilities?
Can support heterogeneous environment?
Flexible enough for extending new features?
Can hide the plumbing for system engineers and data scientists?
tf_worker:v1
tf_ps:v1
tf_incenption:v1
tf_worker:v1
tf_ps:v1
tf_incenption:v1
Tensorflow has good support on Kubernetes, and OpenStack doesnt have direct control/integration
well provisioned by Magnum to Tensorflow, which is managed by upper-level
Ready to use. No need to implement extra container platform
plugins or drivers No one-step deployment
Benefit from Magnum features such as tenant Tensorflow deep learning is not made the first-
management, volumes, scaling, baremetal class OpenStack citizen
supported, which fully leverage OpenStack
ecosystem
Extra benefit:
Integrated with Mesos, which can provide more intelligent scheduling features
Sample Verify
Simply Apply updated spec: Check the progress:
Sample Verify
1. Heapster (Influxdb) collect the metrics: 1. Increase workload
Auth. components
Data Access Layer (DAL)
Secure Storage Access Layer
Provisioning Engine
Vendor plugins
Elastic Data Processing (EDP)
REST API
Python Sahara Client
Sahara pages
Magnum Docker
But
Tranditional OpenStack deployment is based on Virtulization;
Not all the hardware functions support VT, even some declared they supported.
Key components:
ironic-api
ironic-conductor
ironic-python-agent
Nova-driver
Magnum and Sahara are able to work on hybrid environment contains bare metal and virtual
machines:
* https://www.openstack.org/summit/tokyo-2015/videos/presentation/delivering-hybrid-bare-metal-and-virtual-infrastructure-using-ironic-and-openstack
Notified to scale
* Ref: http://docs.openstack.org/developer/nova/filter_scheduler.html