Sunday 8 June 2014

Big data Architecture

Big data means immense amount of data, so much so that it is difficult to collect, store, manage, and analyze via general database software. In general, the meaning of “immense amount of data” is classified into three types as follows:

Volume: There is too much data to be stored and require too many processes—semantic analysis/data processing. These are the two elements that we need to understand.

Velocity: It means storage and processing speed.

Variety: The demand for unstructured data, such as text and images is increasing as well as refined-type data that can be standardized and previously defined like the RDBMS table record.

Bigdata Reference Architecture

Data Processing Flow 




Data Transformation Flow 



Bigdata Platform


 

Apache Hadoop 

The Apache Hadoop software library framework allows for distributed processing of large datasets across clusters of computers on commodity hardware.  This solution is  designed for flexibility and scalability, with an architecture that scales to thousands of servers and petabytes of data. The library detects and handles failures at  the application layer, delivering a high-availability service on commodity hardware.


Hadoop

Hadoop is a Platform which enables you to store and analyze large volumes of data. Batch oriented (high throughput and low latency) and strongly consistent (data is always available).

Hadoop is best utilized for:

Large scale batch analytics
Unstructured or semi-structured data
Flat files

Hadoop is comprised of two major subsystems
  • HDFS (File System)
  • Map Reduce



HDFS 


Is a file system that supports large files
Files are broken into 64MB+ Blocks that are normally triple replicated.

NameNode

Is essentially the master meta data server.  
The NameNode only persists metadata.  It does not persist the location of each data node that hosts a block.   The metadata is stored persistently on a local disk in the form of two files:
Name Space Image File (FS Image)
Edit Log

Secondary NameNode

Fetches the FS Image and the Edit Log and merges them together into a single file preventing the Edit Log from becoming too large.
Runs on a separate machine then the Primary NameNode
Maintains an out of date image of the merged Name Node image, which could be utilized if the Primary Name Node fails.

DataNode

The purpose of the DataNode is to retrieve blocks of data when it is told to do so by either the Clients and/or NameNode.
Stores all the raw blocks that represent the files being stored.  Periodically reports back to the NameNode with lists of blocks it is storing.
File Reads (Process)

Data Nodes are sorted by proximity to the Client (Application making the READ request)
Clients contact DataNodes directly to retrieve data and the NameNode simply guides the Client to the best datanode for each block of data being read.
If an error occurs while reading a block from a DataNode, then the NameNode will try the next closest DataNode to the Client in order to retrieve the block of data.  DataNodes that fail are reported back to the NameNode

File Writes (Creating a New File)



Map Reduce 

Is a software framework for writing applications which process very large datasets (multi-terabyte data sets) in parallel on large clusters of machines.   Essentially enabling the user to run analytics across large blocks of data.

The MapReduce Framework takes care of scheduling tasks, monitoring them, and re-executing failed tasks.


The Map Reduce Framework consists of a single master JobTracker and one slave Task Tracker per cluster node.

Job Tracker

Coordinates all the jobs run on the system scheduling tasks to run on Task Trackers.  If a Job fails then the Job Tracker can re-schedule it to another Task Tracker.
Stores in-memory information about every running MapReduce Job
Assigns Tasks to machines in the cluster.
When a Job Tracker assigns a task to a machine, It will prioritize the task to machines with Data Locality.

Task Tracker

Runs Tasks and sends progress reports to the Job Tracker
Has a local directory to create a localized cache and localized job
Code is essentially moved to the data (Map Reduce Jars) instead of visa versa.   It is more efficient to move around small jar files then moving around data.  Map Reduce Jars are sent to Task Trackers to run locally (i.e. machines where the data is local to the task).

MapReduce Example:

Input a raw weather data file that is comma delimited and determine the maximum temperature in the dataset.  

MAPPER

Assume the ‘Keys’ of the input file are line offsets between each row of data
Our user defined Mapper Function simply extracts the ‘Year’ and ‘Temperature’ from each row of input data.
The Output of our Map Function is sorted before sending it to the Reduce function.  Therefore, each key / value in the intermediate output (year, temperature) is grouped by ‘Year’ and sorted by ‘Temperature’ within that year.

REDUCER

The Reducer function takes the sorted Map(s) inputs and simply iterates through the list of temperatures per year and selects a maximum temperature for each year.



 HBase


HBase is a distributed Key / Value store built on top of Hadoop and is tightly integrated with the Hadoop MapReduce framework.  HBase is an open source, distributed, column oriented database modeled after Google’s BigTable. 

HBase shines with large amounts of data, and read/write concurrency.

Automatic Partitioning – as your table grows, they will automatically be split into regions and distributed across all available nodes.

Does not have indexes.  Rows are stored sequentially, as are the columns written to each row.

HBase makes Hadoop useable for real time streaming workloads which the Hadoop File System cannot handle itself.

HIVE

Utilized by individuals with strong SQL Skills and limited programming ability.
Compatible with existing Business Intelligence tools that utilize SQL and ODBC.
Metastore – central depository of Hive Metadata.
It is comprised of a service and a backup store for the data.
Usually a standalone Database such as MySQL is utilized for the standalone Metastore.
Partial support of SQL-92 specification




OOZIE

Is a workflow scheduler.   It manages data processing jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) for Hadoop.
Supports all types of Hadoop jobs and is integrated with the Hadoop stack.
Supports data and time triggers, users can specify execution frequency and can wait for data arrival to trigger an action in the workflow.

ZooKeeper


Zookeeper is a stripped down filesystem that exposes a few simple operations and abstractions that enable you to build distributed queues, configuration service, distributed locks, and leader election among a group of peers.

Configuration Service – store and allows applications to retrieve or update configuration files
Distributed Lock – is a mechanism for providing mutual exclusion between a collection of processes.  At any one time, only a single process may hold the lock.   They can be utilized for leader election, where the leader is the process the holds the lock at any point of time.
Zookeeper is highly available running across a collection of machines.

PIG

Pig and Hive were written to insulate users from the complexities of writing MapReduce Programs. 

 MapReduce requires users to write mappers and reducers, compilation of the code, submitting jobs and retrieving the results of the jobs.  This is a very complex and time consuming process.
A Pig program is made up up of a series of operations, or transformations that are applied to the input data to produce a desired output.  The operations describe a data flow, which is converted into a series of MapReduce Programs.

PIG is designed for batch processing of data.   Pig is not designed to handle a small amount of data, since it has to scan the entire dataset.
PIG is comprised of two components:
The Language (Pig Latin) utilized to express data flows
The Execution Environment to run Pig Latin Programs.

Big Data  & Analytics Reference Architecture 

IBM Big Data Platform



Oracle 



SAP HANA


Bigdata Security and privacy  framework



The world's information doubles every two years,Over the next 10 years


The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data centers will grow by 50x
Number of “files” enterprise data center handle will grow by 75

Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB

Bigdata platform adaption is mandatory and necessary for all corporate customers

Saturday 7 June 2014

Comparison of Private Cloud Platforms


To create a private cloud project strategy, a company will first need to identify which of its business practices can be made more efficient than before, as well as which repetitive manual tasks can be automated via the successful launch of a cloud computing project


 Building a private cloud is necessary  for many companies , the best advice for how to build a private cloud is to start small and then continue to grow the cloud computing project over time.

Private cloud projects can also be connected to public clouds to create hybrid clouds.


Comparison of  Private Cloud Platforms- HP,IBM and VCE




Private Cloud Providers 


  • Abiquo Private Cloud Solutions
  • Amazon Virtual Private Cloud (Amazon VPC)
  • BlueLock Virtual Private Clouds
  • BMC Cloud Lifecycle Management
  • CA Technologies Cloud Solutions
  • Cisco Private Cloud solutions
  • Citrix CloudPlatform (Open Source)
  • CloudStack (open source software)
  • Dell Cloud Solutions
  • Enomaly Elastic Computing Platform (Acquired by Virtustream)
  • Eucalyptus Cloud Platform
  • GoGrid cloud hosting platform
  • IBM SmartCloud Foundation
  • Microsoft Private Cloud
  • Nimbula
  • Novell Cloud Manager
  • OpenNebula (Open Source Project)
  • OpenStack (Open Source Software)
  • Piston Cloud Computing (Enterprise OpenStack)
  • Rackspace Private Cloud (Powered by OpenStack)
  • Red Hat Cloud
  • Savvis Symphony Virtual Private Solutions
  • SUSE Cloud (Powered by OpenStack)
  • TierraCloud
  • VMware Private Cloud Computing


Many providers providing  their solutions on VMware technology and allowing customers to upload their own images to the cloud

This comparison is far from authoritative, It was based on information publicly available on the web site of the providers, so there might be variations contracts for specific customers

Cloud Architecting - Best Practices

As a cloud architect, it is important to understand the benefits of cloud computing.

Business Benefits of Cloud Computing 

Almost zero upfront infrastructure investment
Just-in-time Infrastructure
More efficient resource utilization
Usage-based costing
Reduced time to market

Technical Benefits of Cloud Computing

Scriptable infrastructure
Auto-scaling
Proactive Scaling
More Efficient Development lifecycle
Improved Testability
Disaster Recovery and Business Continuity


Cloud Best Practices 

Design for failure 

  • Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures 
  • Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Utilize Amazon RDS Multi-AZ  deployment functionality to automatically replicate database updates across multiple Availability Zones. 
  • Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication. 
  • Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones. 
  • Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances. 
  • Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups


Decouple your components 


  • Use Amazon SQS to isolate components 
  • Use Amazon SQS as buffers between components 
  • Design every component such that it expose a service interface and is responsible for its own scalability in all appropriate dimensions and interacts with other components asynchronously 
  • Bundle the logical construct of a component into an Amazon Machine Image so that it can be deployed more often 
  • Make your applications as stateless as possible. Store session state outside of component (in Amazon SimpleDB, if appropriate)

Implement Elasticity 

 Automate and bootstrap your instances 

  • Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
  • Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications. 
  • Store and retrieve machine configuration information dynamically: Utilize Amazon SimpleDB to fetch config data during boot-time of an instance (eg. database connection strings).
  • SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
  • Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup. 
  • Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef, Puppet, CFEngine or Genome. 
  • Bundle Just Enough Operating System (JeOS) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data23 and instance metadata after launch.
  • Reduce bundling and launch time by booting from Amazon EBS volumes and attaching multiple Amazon EBS volumes to an instance.
  • Create snapshots of common volumes and share snapshots among accounts wherever appropriate. 
  • Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.

Parallel

  • Multi-thread your Amazon S3 requests 
  • Multi-thread your Amazon SimpleDB GET and BATCHPUT requests  
  • Create a JobFlow using the Amazon Elastic MapReduce Service for each of your daily batch processes (indexing, log analysis etc.) which will compute the job in parallel and save time.
  • Use the Elastic Load Balancing service and spread your load across multiple web app servers dynamically

Data Placement -Keep dynamic data close to the compute and static data closer to end user

  • Ship your data drives to Amazon using the Import/Export service. 
  • It may be cheaper and faster to move large amounts of data using the sneakernet28 than to upload using the Internet.
  • Utilize the same Availability Zone to launch a cluster of machines 
  • Create a distribution of your Amazon S3 bucket and let Amazon CloudFront caches content in that bucket across all the edge locations around the world

Security best practices

  • Protect your data in transit
  • Protect your data at rest
  • Protect your AWS credentials
  • Manage multiple Users and their permissions with IAM
  • Secure your Application


Friday 6 June 2014

Cloudstack


Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. 
CloudStack currently supports the most popular hypervisors: VMware, KVM, XenServer, Xen Cloud Platform (XCP) and Hyper-V.
Users can manage their cloud with an easy to use Web interface, command line tools, and/or a full-featured RESTful API. In addition, CloudStack provides an API that's compatible with AWS EC2 and S3 for organizations that wish to deploy hybrid clouds.

Cloudstack Architecture




Deployment Architecture

Hypervisor is the basic unit of scale.
Cluster consists of one ore more hosts of same hypervisor
All hosts in cluster have access to shared (primary) storage
Pod is one or more clusters, usually with L2 switches.
Availability Zone has one or more pods, has access to secondary storage.
One or more zones represent cloud
Management server Cluster – MS is stateless and   can be deployed as physical server or VM
Single MS node can manage up to 10K hosts. Multiple nodes can be deployed for scale or redundancy

The Pieces of CloudStack


  • Hosts-  Servers onto which services will be provisioned  
  • Primary Storage -  VM disk storage 
  • Cluster -  A grouping of hosts and their associated storage 
  • Pod -  Collection of clusters in the same failure boundary 
  • Network -Logical network associated with service offerings 
  • Secondary Storage -Template, snapshot and ISO storage 
  • Zone- Collection of pods, network offerings and secondary storage 
  • Management Server Farm -   Management and provisioning tasks

Storage

Primary Storage 

Stores disk volumes for VMs in a cluster
•Configured at Cluster-level. 
•Close to hosts for better performance 
•Cluster have at least one primary storage 
•Requires high IOPs (can be expensive)

Secondary Storage

Stores all Templates, ISOs and Snapshots 
Configured at Zone-level 
Zone can have one or more secondary storages 
High capacity, low cost commodity storage 


Networking - Giving control brings complexity

Network need to be defined before we begin the deolyment

CloudStack Networks

  • Management  - used by management nodes
  • Storage - used by secondary storage node

VM Instance Networks 

  • Public - network used for VMs and Internet (used only if you do Isolated Mode)
  • Guest - network used for internal VM communication

Sample Host Network Layout



Sample system VM Network Layout 



Sample Cloudstack Mutli-tier Network Diagram 



Extend the Cloudstack capability  using Plugins

Plugins
Implements clearly defined interfaces  and compiles only against the Plugin API module

Anatomy of a Plugin
Can be two jars: server component to be deployed on management server and an optional ServerResource component to be deployed co-located with the resource 
•Server component can implement multiple Plugin APIs to add its feature 
•Can expose its own API through Pluggable Service so administrators can configure the plugin 
•As an example, OVS plugin actually implements both NetworkGuru and NetworkElement 


Plugin Interfaces Available


NetworkGuru – Implements various network isolation and ip address technologies 
•NetworkElement – Facilitate network services on network elements to support a VM (i.e. DNS, DHCP, LB, VPN, Port Forwarding, etc) 
•DeploymentPlanner – Different algorithms to place a VM and volumes. 
•Investigator – Ways to find out if a host is down or VM is down. 
•Fencer – Ways to fence off a VM if the state is unknown 
•UserAuthenticator – Methods of authenticating a user 
•SecurityChecker – ACL access 
•HostAllocator – Provides different ways to allocate host 
•StoragePoolAllocator – Provides different ways to allocate volumes

Apache cloudstack projects currently in progress to extend the cloudstack Capability 




Thursday 5 June 2014

OpenStack -Cloud software

OpenStack is a global collaboration of developers and cloud computing technologists that seek to produce a ubiquitous Infrastructure as a Service (IaaS) open source cloud computing platform for public and private clouds. OpenStack was founded by Rackspace Hosting and NASA jointly in July 2010.

The Pieces of OpenStack
  • OpenStack Compute (Nova)
  • OpenStack Object Storage (Swift)
  • OpenStack Image Service (Glance)
  • Dashboard
  • Identity Management
  • Networking
  • Load balancers
  • Database
  • Queueing






Openstack Core components 

OpenStack Compute (core -NOVA)
    Provision and manage large networks of virtual machines
OpenStack Object Store (core-SWIFT)
Create petabytes of secure, reliable storage using standard hardware
OpenStack Image Service (core)
Catalog and manage massive libraries of server images
OpenStack Identity (core-GLANCE)
Unified authentication across all OpenStack projects and integrates with existing authentication systems.
OpenStack Dashboard (core-HORIZON)
Enables administrators and users to access & provision cloud-based resources through a self-service portal.


Openstack Storage -Nova








Openstack Storage -Swift


Swift storage is fully distributed and not a file system.System components are given below   


The Ring: Mapping of names to entities (accounts, containers, objects) on disk.
Container Server: Handles listing of objects, stores as SQLite DB
Object Server: Blob storage server, metadata kept in xattrs, data in binary format
  • Stores data based on zones, devices, partitions, and replicas
  • Weights can be used to balance the distribution of partitions
  • Used by the Proxy Server for many background processes
Proxy Server: Request routing, exposes the public API
Replication: Keep the system consistent, handle failures
Updaters: Process failed or queued updates
Auditors: Verify integrity of objects, containers, and accounts
Account Server: Handles listing of containers, stores as SQLite DB

Openstack Network -Neutron




Networking is a standalone component in the OpenStack modular architecture. It's positioned alongside OpenStack components such as Compute, Image Service, Identity, or the Dashboard. Like those components, a deployment of Networking often involves deploying several services to a variety of hosts.

Networking is entirely standalone and can be deployed to a dedicated host. Depending on your configuration, Networking can also include the following agents:


AgentDescription
plug-in agent (neutron-*-agent)Runs on each hypervisor to perform local vSwitch configuration. The agent that runs, depends on the plug-in that you use. Certain plug-ins do not require an agent.
dhcp agent (neutron-dhcp-agent)Provides DHCP services to tenant networks. Required by certain plug-ins.
l3 agent (neutron-l3-agent)Provides L3/NAT forwarding to provide external network access for VMs on tenant networks. Required by certain plug-ins.
metering agent (neutron-metering-agent)Provides L3 traffic metering for tenant networks.
Networking agents
These agents interact with the main neutron process through RPC (for example, RabbitMQ or Qpid) or through the standard Networking API. In addition, Networking integrates with OpenStack components in a number of ways:
  • Networking relies on the Identity service (keystone) for the authentication and authorization of all API requests.
  • Compute (nova) interacts with Networking through calls to its standard API. As part of creating a VM, the nova-compute  service communicates with the Networking API to plug each virtual NIC on the VM into a particular network.
  • The dashboard (horizon) integrates with the Networking API, enabling administrators and tenant users to create and manage network services through a web-based GUI.

 Orchestration -Heat

Heat orchestrates composite cloud apps (stacks)
HA (restarts resources) & “auto-scaling” 

Configuration Management-Puppet

  • Puppet can be used , widely used, large community, scales 
  • Simplify the configuration of OpenStack itself

Accounting - Ceilometer

  • Accounting for OpenStack by project
  • Collects statistics from each compute node
  • common OpenStack message bus

OpenStack’s community is remarkably vibrant.  It is no longer lead by any single vendor.

Prominent Adopters
  • Private Cloud Solutions (Dell, Nebula, Piston)
  • Large public clouds & hosting companies (Rackspace, ATT, NTT, Dreamhost, HP, Deutsche Telecom)
  • Web & SaaS Providers (eBay, Wikimedia, )
  • Government (NASA)
  • Major Linux Distributions (Ubuntu, Suse, RedHat)
  • Hardware Vendors (Dell, HP, IBM, Cisco)

Substantial Contributors

Dev: Rackspace, HP, RedHat, Citrix, Nebula, Cisco, Canonical, Piston … 
Ops: Dell has lead here with Opscode.

Leading  Hardware Vendors and Virtulaization players integrated their technologies in Openstack  to provide cloud solution 


1) VMware Technologies and OpenStack






2) Dell OpenStack  Cloud Solution





3) Rackspace openstack private  cloud reference Architecture