Big data means immense amount of data, so much so that it is difficult to collect, store, manage, and analyze via general database software. In general, the meaning of “immense amount of data” is classified into three types as follows:
Volume: There is too much data to be stored and require too many processes—semantic analysis/data processing. These are the two elements that we need to understand.
Velocity: It means storage and processing speed.
Variety: The demand for unstructured data, such as text and images is increasing as well as refined-type data that can be standardized and previously defined like the RDBMS table record.
Volume: There is too much data to be stored and require too many processes—semantic analysis/data processing. These are the two elements that we need to understand.
Velocity: It means storage and processing speed.
Variety: The demand for unstructured data, such as text and images is increasing as well as refined-type data that can be standardized and previously defined like the RDBMS table record.
Bigdata Reference Architecture
Data Processing Flow
Data Transformation Flow
Bigdata Platform
Apache Hadoop
The Apache Hadoop software library framework allows for distributed processing of large datasets across clusters of computers on commodity hardware. This solution is designed for flexibility and scalability, with an architecture that scales to thousands of servers and petabytes of data. The library detects and handles failures at the application layer, delivering a high-availability service on commodity hardware.
Hadoop
Hadoop is a Platform which enables you to store and analyze large volumes of data. Batch oriented (high throughput and low latency) and strongly consistent (data is always available).
Hadoop is best utilized for:
Large scale batch analytics
Unstructured or semi-structured data
Flat files
Hadoop is comprised of two major subsystems
Hadoop is best utilized for:
Large scale batch analytics
Unstructured or semi-structured data
Flat files
Hadoop is comprised of two major subsystems
- HDFS (File System)
- Map Reduce
HDFS
Is a file system that supports large files
Files are broken into 64MB+ Blocks that are normally triple replicated.
NameNode
Is essentially the master meta data server.
The NameNode only persists metadata. It does not persist the location of each data node that hosts a block. The metadata is stored persistently on a local disk in the form of two files:
Name Space Image File (FS Image)
Edit Log
Secondary NameNode
Fetches the FS Image and the Edit Log and merges them together into a single file preventing the Edit Log from becoming too large.
Runs on a separate machine then the Primary NameNode
Maintains an out of date image of the merged Name Node image, which could be utilized if the Primary Name Node fails.
DataNode
The purpose of the DataNode is to retrieve blocks of data when it is told to do so by either the Clients and/or NameNode.
Stores all the raw blocks that represent the files being stored. Periodically reports back to the NameNode with lists of blocks it is storing.
File Reads (Process)
Data Nodes are sorted by proximity to the Client (Application making the READ request)
Clients contact DataNodes directly to retrieve data and the NameNode simply guides the Client to the best datanode for each block of data being read.
If an error occurs while reading a block from a DataNode, then the NameNode will try the next closest DataNode to the Client in order to retrieve the block of data. DataNodes that fail are reported back to the NameNode
File Writes (Creating a New File)
Map Reduce
Is a software framework for writing applications which process very large datasets (multi-terabyte data sets) in parallel on large clusters of machines. Essentially enabling the user to run analytics across large blocks of data.
The MapReduce Framework takes care of scheduling tasks, monitoring them, and re-executing failed tasks.
The Map Reduce Framework consists of a single master JobTracker and one slave Task Tracker per cluster node.
Job Tracker
Coordinates all the jobs run on the system scheduling tasks to run on Task Trackers. If a Job fails then the Job Tracker can re-schedule it to another Task Tracker.
Stores in-memory information about every running MapReduce Job
Assigns Tasks to machines in the cluster.
When a Job Tracker assigns a task to a machine, It will prioritize the task to machines with Data Locality.
Task Tracker
Runs Tasks and sends progress reports to the Job Tracker
Has a local directory to create a localized cache and localized job
Code is essentially moved to the data (Map Reduce Jars) instead of visa versa. It is more efficient to move around small jar files then moving around data. Map Reduce Jars are sent to Task Trackers to run locally (i.e. machines where the data is local to the task).
MapReduce Example:
Input a raw weather data file that is comma delimited and determine the maximum temperature in the dataset.
MAPPER
Assume the ‘Keys’ of the input file are line offsets between each row of data
Our user defined Mapper Function simply extracts the ‘Year’ and ‘Temperature’ from each row of input data.
The Output of our Map Function is sorted before sending it to the Reduce function. Therefore, each key / value in the intermediate output (year, temperature) is grouped by ‘Year’ and sorted by ‘Temperature’ within that year.
REDUCER
The Reducer function takes the sorted Map(s) inputs and simply iterates through the list of temperatures per year and selects a maximum temperature for each year.
HBase
HBase is a distributed Key / Value store built on top of Hadoop and is tightly integrated with the Hadoop MapReduce framework. HBase is an open source, distributed, column oriented database modeled after Google’s BigTable.
HBase shines with large amounts of data, and read/write concurrency.
Automatic Partitioning – as your table grows, they will automatically be split into regions and distributed across all available nodes.
Does not have indexes. Rows are stored sequentially, as are the columns written to each row.
HBase makes Hadoop useable for real time streaming workloads which the Hadoop File System cannot handle itself.
HIVE
Utilized by individuals with strong SQL Skills and limited programming ability.
Compatible with existing Business Intelligence tools that utilize SQL and ODBC.
Metastore – central depository of Hive Metadata.
It is comprised of a service and a backup store for the data.
Usually a standalone Database such as MySQL is utilized for the standalone Metastore.
Partial support of SQL-92 specification
OOZIE
Is a workflow scheduler. It manages data processing jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) for Hadoop.
Supports all types of Hadoop jobs and is integrated with the Hadoop stack.
Supports data and time triggers, users can specify execution frequency and can wait for data arrival to trigger an action in the workflow.
ZooKeeper
Zookeeper is a stripped down filesystem that exposes a few simple operations and abstractions that enable you to build distributed queues, configuration service, distributed locks, and leader election among a group of peers.
Configuration Service – store and allows applications to retrieve or update configuration files
Distributed Lock – is a mechanism for providing mutual exclusion between a collection of processes. At any one time, only a single process may hold the lock. They can be utilized for leader election, where the leader is the process the holds the lock at any point of time.
Zookeeper is highly available running across a collection of machines.
PIG
Pig and Hive were written to insulate users from the complexities of writing MapReduce Programs.
MapReduce requires users to write mappers and reducers, compilation of the code, submitting jobs and retrieving the results of the jobs. This is a very complex and time consuming process.
A Pig program is made up up of a series of operations, or transformations that are applied to the input data to produce a desired output. The operations describe a data flow, which is converted into a series of MapReduce Programs.
PIG is designed for batch processing of data. Pig is not designed to handle a small amount of data, since it has to scan the entire dataset.
PIG is comprised of two components:
The Language (Pig Latin) utilized to express data flows
The Execution Environment to run Pig Latin Programs.
Big Data & Analytics Reference Architecture
IBM Big Data Platform
Oracle
SAP HANA
Bigdata Security and privacy framework
The world's information doubles every two years,Over the next 10 years
The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data centers will grow by 50x
Number of “files” enterprise data center handle will grow by 75
Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB
Bigdata platform adaption is mandatory and necessary for all corporate customers
The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data centers will grow by 50x
Number of “files” enterprise data center handle will grow by 75
Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB
Bigdata platform adaption is mandatory and necessary for all corporate customers