Cloud Computing A New Era : June 2014

Sunday, 8 June 2014

Big data Architecture

Big data means immense amount of data, so much so that it is difficult to collect, store, manage, and analyze via general database software. In general, the meaning of “immense amount of data” is classified into three types as follows:

Volume: There is too much data to be stored and require too many processes—semantic analysis/data processing. These are the two elements that we need to understand.

Velocity: It means storage and processing speed.

Variety: The demand for unstructured data, such as text and images is increasing as well as refined-type data that can be standardized and previously defined like the RDBMS table record.

Bigdata Reference Architecture

Data Processing Flow

Data Transformation Flow

Bigdata Platform

Apache Hadoop

The Apache Hadoop software library framework allows for distributed processing of large datasets across clusters of computers on commodity hardware. This solution is designed for flexibility and scalability, with an architecture that scales to thousands of servers and petabytes of data. The library detects and handles failures at the application layer, delivering a high-availability service on commodity hardware.

Hadoop

Hadoop is a Platform which enables you to store and analyze large volumes of data. Batch oriented (high throughput and low latency) and strongly consistent (data is always available).

Hadoop is best utilized for:

Large scale batch analytics
Unstructured or semi-structured data
Flat files

Hadoop is comprised of two major subsystems

HDFS (File System)
Map Reduce

HDFS

Is a file system that supports large files

Files are broken into 64MB+ Blocks that are normally triple replicated.

NameNode

Is essentially the master meta data server.

The NameNode only persists metadata. It does not persist the location of each data node that hosts a block. The metadata is stored persistently on a local disk in the form of two files:

Name Space Image File (FS Image)

Edit Log

Secondary NameNode

Fetches the FS Image and the Edit Log and merges them together into a single file preventing the Edit Log from becoming too large.

Runs on a separate machine then the Primary NameNode

Maintains an out of date image of the merged Name Node image, which could be utilized if the Primary Name Node fails.

DataNode

The purpose of the DataNode is to retrieve blocks of data when it is told to do so by either the Clients and/or NameNode.

Stores all the raw blocks that represent the files being stored. Periodically reports back to the NameNode with lists of blocks it is storing.

File Reads (Process)

Data Nodes are sorted by proximity to the Client (Application making the READ request)

Clients contact DataNodes directly to retrieve data and the NameNode simply guides the Client to the best datanode for each block of data being read.

If an error occurs while reading a block from a DataNode, then the NameNode will try the next closest DataNode to the Client in order to retrieve the block of data. DataNodes that fail are reported back to the NameNode

File Writes (Creating a New File)

Map Reduce

Is a software framework for writing applications which process very large datasets (multi-terabyte data sets) in parallel on large clusters of machines. Essentially enabling the user to run analytics across large blocks of data.

The MapReduce Framework takes care of scheduling tasks, monitoring them, and re-executing failed tasks.

The Map Reduce Framework consists of a single master JobTracker and one slave Task Tracker per cluster node.

Job Tracker

Coordinates all the jobs run on the system scheduling tasks to run on Task Trackers. If a Job fails then the Job Tracker can re-schedule it to another Task Tracker.

Stores in-memory information about every running MapReduce Job

Assigns Tasks to machines in the cluster.

When a Job Tracker assigns a task to a machine, It will prioritize the task to machines with Data Locality.

Task Tracker

Runs Tasks and sends progress reports to the Job Tracker

Has a local directory to create a localized cache and localized job

Code is essentially moved to the data (Map Reduce Jars) instead of visa versa. It is more efficient to move around small jar files then moving around data. Map Reduce Jars are sent to Task Trackers to run locally (i.e. machines where the data is local to the task).

MapReduce Example:

Input a raw weather data file that is comma delimited and determine the maximum temperature in the dataset.

MAPPER

Assume the ‘Keys’ of the input file are line offsets between each row of data

Our user defined Mapper Function simply extracts the ‘Year’ and ‘Temperature’ from each row of input data.

The Output of our Map Function is sorted before sending it to the Reduce function. Therefore, each key / value in the intermediate output (year, temperature) is grouped by ‘Year’ and sorted by ‘Temperature’ within that year.

REDUCER

The Reducer function takes the sorted Map(s) inputs and simply iterates through the list of temperatures per year and selects a maximum temperature for each year.

HBase

HBase is a distributed Key / Value store built on top of Hadoop and is tightly integrated with the Hadoop MapReduce framework. HBase is an open source, distributed, column oriented database modeled after Google’s BigTable.

HBase shines with large amounts of data, and read/write concurrency.

Automatic Partitioning – as your table grows, they will automatically be split into regions and distributed across all available nodes.

Does not have indexes. Rows are stored sequentially, as are the columns written to each row.

HBase makes Hadoop useable for real time streaming workloads which the Hadoop File System cannot handle itself.

HIVE

Utilized by individuals with strong SQL Skills and limited programming ability.

Compatible with existing Business Intelligence tools that utilize SQL and ODBC.

Metastore – central depository of Hive Metadata.

It is comprised of a service and a backup store for the data.

Usually a standalone Database such as MySQL is utilized for the standalone Metastore.

Partial support of SQL-92 specification

OOZIE

Is a workflow scheduler. It manages data processing jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) for Hadoop.

Supports all types of Hadoop jobs and is integrated with the Hadoop stack.

Supports data and time triggers, users can specify execution frequency and can wait for data arrival to trigger an action in the workflow.

ZooKeeper

Zookeeper is a stripped down filesystem that exposes a few simple operations and abstractions that enable you to build distributed queues, configuration service, distributed locks, and leader election among a group of peers.

Configuration Service – store and allows applications to retrieve or update configuration files

Distributed Lock – is a mechanism for providing mutual exclusion between a collection of processes. At any one time, only a single process may hold the lock. They can be utilized for leader election, where the leader is the process the holds the lock at any point of time.

Zookeeper is highly available running across a collection of machines.

PIG

Pig and Hive were written to insulate users from the complexities of writing MapReduce Programs.

MapReduce requires users to write mappers and reducers, compilation of the code, submitting jobs and retrieving the results of the jobs. This is a very complex and time consuming process.

A Pig program is made up up of a series of operations, or transformations that are applied to the input data to produce a desired output. The operations describe a data flow, which is converted into a series of MapReduce Programs.

PIG is designed for batch processing of data. Pig is not designed to handle a small amount of data, since it has to scan the entire dataset.

PIG is comprised of two components:

The Language (Pig Latin) utilized to express data flows

The Execution Environment to run Pig Latin Programs.

Big Data & Analytics Reference Architecture

IBM Big Data Platform

Oracle

SAP HANA

Bigdata Security and privacy framework

The world's information doubles every two years,Over the next 10 years

The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data centers will grow by 50x
Number of “files” enterprise data center handle will grow by 75

Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB

Bigdata platform adaption is mandatory and necessary for all corporate customers

Saturday, 7 June 2014

Comparison of Private Cloud Platforms

To create a private cloud project strategy, a company will first need to identify which of its business practices can be made more efficient than before, as well as which repetitive manual tasks can be automated via the successful launch of a cloud computing project

Building a private cloud is necessary for many companies , the best advice for how to build a private cloud is to start small and then continue to grow the cloud computing project over time.

Private cloud projects can also be connected to public clouds to create hybrid clouds.

Comparison of Private Cloud Platforms- HP,IBM and VCE

Private Cloud Providers

Abiquo Private Cloud Solutions
Amazon Virtual Private Cloud (Amazon VPC)
BlueLock Virtual Private Clouds
BMC Cloud Lifecycle Management
CA Technologies Cloud Solutions
Cisco Private Cloud solutions
Citrix CloudPlatform (Open Source)
CloudStack (open source software)
Dell Cloud Solutions
Enomaly Elastic Computing Platform (Acquired by Virtustream)
Eucalyptus Cloud Platform
GoGrid cloud hosting platform
IBM SmartCloud Foundation
Microsoft Private Cloud
Nimbula
Novell Cloud Manager
OpenNebula (Open Source Project)
OpenStack (Open Source Software)
Piston Cloud Computing (Enterprise OpenStack)
Rackspace Private Cloud (Powered by OpenStack)
Red Hat Cloud
Savvis Symphony Virtual Private Solutions
SUSE Cloud (Powered by OpenStack)
TierraCloud
VMware Private Cloud Computing

Many providers providing their solutions on VMware technology and allowing customers to upload their own images to the cloud

This comparison is far from authoritative, It was based on information publicly available on the web site of the providers, so there might be variations contracts for specific customers

Cloud Architecting - Best Practices

As a cloud architect, it is important to understand the benefits of cloud computing.

Business Benefits of Cloud Computing

Almost zero upfront infrastructure investment
Just-in-time Infrastructure
More efficient resource utilization
Usage-based costing
Reduced time to market

Technical Benefits of Cloud Computing

Scriptable infrastructure
Auto-scaling
Proactive Scaling
More Efficient Development lifecycle
Improved Testability
Disaster Recovery and Business Continuity

Cloud Best Practices

Design for failure

Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures

Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Utilize Amazon RDS Multi-AZ deployment functionality to automatically replicate database updates across multiple Availability Zones.

Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication.

Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones.

Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances.

Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups

Decouple your components

Use Amazon SQS to isolate components
Use Amazon SQS as buffers between components
Design every component such that it expose a service interface and is responsible for its own scalability in all appropriate dimensions and interacts with other components asynchronously
Bundle the logical construct of a component into an Amazon Machine Image so that it can be deployed more often
Make your applications as stateless as possible. Store session state outside of component (in Amazon SimpleDB, if appropriate)

Implement Elasticity

Automate and bootstrap your instances

Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
Store and retrieve machine configuration information dynamically: Utilize Amazon SimpleDB to fetch config data during boot-time of an instance (eg. database connection strings).
SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef, Puppet, CFEngine or Genome.
Bundle Just Enough Operating System (JeOS) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data23 and instance metadata after launch.
Reduce bundling and launch time by booting from Amazon EBS volumes and attaching multiple Amazon EBS volumes to an instance.
Create snapshots of common volumes and share snapshots among accounts wherever appropriate.
Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.

Parallel

Multi-thread your Amazon S3 requests
Multi-thread your Amazon SimpleDB GET and BATCHPUT requests
Create a JobFlow using the Amazon Elastic MapReduce Service for each of your daily batch processes (indexing, log analysis etc.) which will compute the job in parallel and save time.
Use the Elastic Load Balancing service and spread your load across multiple web app servers dynamically

Data Placement -Keep dynamic data close to the compute and static data closer to end user

Ship your data drives to Amazon using the Import/Export service.
It may be cheaper and faster to move large amounts of data using the sneakernet28 than to upload using the Internet.
Utilize the same Availability Zone to launch a cluster of machines
Create a distribution of your Amazon S3 bucket and let Amazon CloudFront caches content in that bucket across all the edge locations around the world

Security best practices

Protect your data in transit
Protect your data at rest
Protect your AWS credentials
Manage multiple Users and their permissions with IAM
Secure your Application

Friday, 6 June 2014

Cloudstack

Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform.

CloudStack currently supports the most popular hypervisors: VMware, KVM, XenServer, Xen Cloud Platform (XCP) and Hyper-V.

Users can manage their cloud with an easy to use Web interface, command line tools, and/or a full-featured RESTful API. In addition, CloudStack provides an API that's compatible with AWS EC2 and S3 for organizations that wish to deploy hybrid clouds.

Cloudstack Architecture

Deployment Architecture

Hypervisor is the basic unit of scale.

Cluster consists of one ore more hosts of same hypervisor

All hosts in cluster have access to shared (primary) storage

Pod is one or more clusters, usually with L2 switches.

Availability Zone has one or more pods, has access to secondary storage.

One or more zones represent cloud

Management server Cluster – MS is stateless and can be deployed as physical server or VM

Single MS node can manage up to 10K hosts. Multiple nodes can be deployed for scale or redundancy

The Pieces of CloudStack

Hosts- Servers onto which services will be provisioned
Primary Storage - VM disk storage
Cluster - A grouping of hosts and their associated storage
Pod - Collection of clusters in the same failure boundary
Network -Logical network associated with service offerings
Secondary Storage -Template, snapshot and ISO storage
Zone- Collection of pods, network offerings and secondary storage
Management Server Farm - Management and provisioning tasks

Storage

Primary Storage

•Stores disk volumes for VMs in a cluster

•Configured at Cluster-level.

•Close to hosts for better performance

•Cluster have at least one primary storage

•Requires high IOPs (can be expensive)

Secondary Storage

Stores all Templates, ISOs and Snapshots

Configured at Zone-level

Zone can have one or more secondary storages

High capacity, low cost commodity storage

Networking - Giving control brings complexity

Network need to be defined before we begin the deolyment

CloudStack Networks

Management - used by management nodes
Storage - used by secondary storage node

VM Instance Networks

Public - network used for VMs and Internet (used only if you do Isolated Mode)
Guest - network used for internal VM communication

Sample Host Network Layout

Sample system VM Network Layout

Sample Cloudstack Mutli-tier Network Diagram

Extend the Cloudstack capability using Plugins

Plugins

Implements clearly defined interfaces and compiles only against the Plugin API module

Anatomy of a Plugin

Can be two jars: server component to be deployed on management server and an optional ServerResource component to be deployed co-located with the resource

•Server component can implement multiple Plugin APIs to add its feature

•Can expose its own API through Pluggable Service so administrators can configure the plugin

•As an example, OVS plugin actually implements both NetworkGuru and NetworkElement

Plugin Interfaces Available

NetworkGuru – Implements various network isolation and ip address technologies

•NetworkElement – Facilitate network services on network elements to support a VM (i.e. DNS, DHCP, LB, VPN, Port Forwarding, etc)

•DeploymentPlanner – Different algorithms to place a VM and volumes.

•Investigator – Ways to find out if a host is down or VM is down.

•Fencer – Ways to fence off a VM if the state is unknown

•UserAuthenticator – Methods of authenticating a user

•SecurityChecker – ACL access

•HostAllocator – Provides different ways to allocate host

•StoragePoolAllocator – Provides different ways to allocate volumes

Apache cloudstack projects currently in progress to extend the cloudstack Capability

Thursday, 5 June 2014

OpenStack -Cloud software

OpenStack is a global collaboration of developers and cloud computing technologists that seek to produce a ubiquitous Infrastructure as a Service (IaaS) open source cloud computing platform for public and private clouds. OpenStack was founded by Rackspace Hosting and NASA jointly in July 2010.

The Pieces of OpenStack

OpenStack Compute (Nova)
OpenStack Object Storage (Swift)
OpenStack Image Service (Glance)
Dashboard
Identity Management
Networking
Load balancers
Database
Queueing

Openstack Core components

OpenStack Compute (core -NOVA)

Provision and manage large networks of virtual machines

OpenStack Object Store (core-SWIFT)

Create petabytes of secure, reliable storage using standard hardware

OpenStack Image Service (core)

Catalog and manage massive libraries of server images

OpenStack Identity (core-GLANCE)

Unified authentication across all OpenStack projects and integrates with existing authentication systems.

OpenStack Dashboard (core-HORIZON)

Enables administrators and users to access & provision cloud-based resources through a self-service portal.

Openstack Storage -Nova

Openstack Storage -Swift

Swift storage is fully distributed and not a file system.System components are given below

The Ring: Mapping of names to entities (accounts, containers, objects) on disk.

Container Server: Handles listing of objects, stores as SQLite DB

Object Server: Blob storage server, metadata kept in xattrs, data in binary format

Stores data based on zones, devices, partitions, and replicas

Weights can be used to balance the distribution of partitions

Used by the Proxy Server for many background processes

Proxy Server: Request routing, exposes the public API
Replication: Keep the system consistent, handle failures
Updaters: Process failed or queued updates
Auditors: Verify integrity of objects, containers, and accounts

Account Server: Handles listing of containers, stores as SQLite DB

Openstack Network -Neutron

Networking is a standalone component in the OpenStack modular architecture. It's positioned alongside OpenStack components such as Compute, Image Service, Identity, or the Dashboard. Like those components, a deployment of Networking often involves deploying several services to a variety of hosts.

Networking is entirely standalone and can be deployed to a dedicated host. Depending on your configuration, Networking can also include the following agents:

Networking agents
Agent	Description
plug-in agent (`neutron-*-agent`)	Runs on each hypervisor to perform local vSwitch configuration. The agent that runs, depends on the plug-in that you use. Certain plug-ins do not require an agent.
dhcp agent (`neutron-dhcp-agent`)	Provides DHCP services to tenant networks. Required by certain plug-ins.
l3 agent (`neutron-l3-agent`)	Provides L3/NAT forwarding to provide external network access for VMs on tenant networks. Required by certain plug-ins.
metering agent (`neutron-metering-agent`)	Provides L3 traffic metering for tenant networks.

These agents interact with the main neutron process through RPC (for example, RabbitMQ or Qpid) or through the standard Networking API. In addition, Networking integrates with OpenStack components in a number of ways:

Networking relies on the Identity service (keystone) for the authentication and authorization of all API requests.
Compute (nova) interacts with Networking through calls to its standard API. As part of creating a VM, the nova-compute service communicates with the Networking API to plug each virtual NIC on the VM into a particular network.
The dashboard (horizon) integrates with the Networking API, enabling administrators and tenant users to create and manage network services through a web-based GUI.

Orchestration -Heat

Heat orchestrates composite cloud apps (stacks)
HA (restarts resources) & “auto-scaling”

Configuration Management-Puppet

Puppet can be used , widely used, large community, scales

Simplify the configuration of OpenStack itself

Accounting - Ceilometer

Accounting for OpenStack by project

Collects statistics from each compute node

common OpenStack message bus

OpenStack’s community is remarkably vibrant. It is no longer lead by any single vendor.

Prominent Adopters

Private Cloud Solutions (Dell, Nebula, Piston)
Large public clouds & hosting companies (Rackspace, ATT, NTT, Dreamhost, HP, Deutsche Telecom)
Web & SaaS Providers (eBay, Wikimedia, )
Government (NASA)
Major Linux Distributions (Ubuntu, Suse, RedHat)
Hardware Vendors (Dell, HP, IBM, Cisco)

Substantial Contributors

Dev: Rackspace, HP, RedHat, Citrix, Nebula, Cisco, Canonical, Piston …

Ops: Dell has lead here with Opscode.