education plus: 2015

Thursday, December 31, 2015

Pig Execution Modes

Pig can run in two execution modes. The modes depends on where the pig script is going to run and where the data is residing. The data can be stored in a single machine, i.e. local file system or it can be stored in a distributed environment like typical Hadoop Cluster environment. We can run pig programs in three different modes.

First one is non interactive shell also known as script mode. In this we have to create a file, load the code in the file and execute the script. Second one is grunt shell, it is an interactive shell for running pig commands. Third one is embedded mode, in this we use JDBC to run SQL programs from Java.

Pig Local mode

In this mode, the pig runs on single JVM and accesses the local file system. This mode is best suitable for dealing with the smaller data sets. In this, parallel mapper execution is not possible because the earlier versions of the Hadoop versions are not thread safe.

By providing –x local, we can get in to pig local mode of execution. In this mode, pig always looks for the local file system path when data is loaded. $pig –x local implies that it’s in local mode.

Pig Map reduce mode

In this mode, we could have proper Hadoop cluster setup and Hadoop installations on it. By default, the pig runs on MR mode. Pig translates the submitted queries into Map reduce jobs and runs them on top of Hadoop cluster. We can say this mode as a Map reduce mode on a fully distributed cluster.

Pig Architecture

Pig Architecture consists of Pig Latin Interpreter and it will be executed on client Machine. It uses Pig Latin scripts and it converts the script into a series of MR jobs. Then It will execute MR jobs and saves the output result into HDFS. In between, it performs different operations such as Parse, Compile, optimize and plan the Execution on data that comes into the system.

Job Execution Flow

When a pig programmer develops scripts, they are stored in the local file system in the form of user defined functions. When we submit Pig Script, it comes in contact with pig Latin Compiler which splits the task and run a series of MR jobs, meanwhile Pig Compiler fetches data from HDFS (i.e. input file present). After running MR jobs, the output file is stored in HDFS.

Tuesday, December 29, 2015

Hadoop Pig

Pig is developed on top of Hadoop. It provides the data flowing environment for processing large sets of data. The pig provides a high-level language. It is an alternative abstraction on top of Map Reduce (MR). Pig program supports parallelization mechanism. For scripting of the Pig, it provides Pig Latin language.

Pig takes Pig Latin scripts and turns into a series of MR jobs. Pig scripting having its own advantages of running the applications on Hadoop from the client side. The nature of programming is easy, compared to low level languages such as Java. It provides simple parallel execution, the user can write and use its own custom defined functions to perform unique processing.

Pig Latin provides several operators like LOAD, FILTER, SORT, JOIN, GROUP, FOREACH and STORE for performing relational data operations. The operators implement data transformation tasks with simple lines of codes. Compared to MR code, the Pig Latin codes are very less in lines and gives better flexibility in some IT industrial use cases.

CS6611 MOBILE APPLICATION DEVELOPMENT LABORATORY

http://gr-solution.blogspot.in/2015/11/develop-application-that-uses-gui_16.html

Monday, December 28, 2015

Hive Data Modeling

In Hive data modeling - Tables, Partitions and Buckets come in to picture.

Coming to Tables, it’s just like the way that we create a table in Traditional relational databases. The functionalities such as filtering, joins can be performed on the tables. Hive deals with two types of table structures - Internal and External, depends on the design of schema and how the data is getting loaded in to Hive.

Internal Table is tightly coupled with nature. At first, we have to create tables and load the data. We can call this one as data on the schema. By dropping this table, both data and schema will be removed. The stored location of this table will be at /user/hive/warehouse.

External Table is loosely coupled with nature. Data will be available in HDFS; the table is going to get created with HDFS data. We can say that its creating schema of data. At the time of dropping the table, it dropped only schema, data will be available in HDFS as before. External tables provide an option to create multiple schemas for the data stored in HDFS instead of deleting the data every time whenever schema updates.

Partitions

Partitions come into place when table is having one or more Partition keys which is the basis for determining how the data is stored. For Example: - “Client has Some E–commerce data which belong to India operations in which each state (29 states) operations mentioned in as a whole. If we take the state as partition key and perform partitions on that India data as a whole, we will be able to get a Number of partitions (29 partitions) which is equal to the number of states (29) present in India. Each state data can be viewed separately in the partition tables.”

Hive Vs Relational Databases

By using Hive, we can perform some peculiar functionalities that can't be achieved by Relational Databases. For huge amounts of data that is in beta bytes, querying it and getting results in seconds is important. In this scenario, the hive will achieve fast querying and produce results in a second time.

Some key differences between hive and relational databases are the following

Relational databases are of “Schema on READ and Schema on Write”. First creating a table and then inserting the data into the particular table. Insertions, Updates, Modifications can be performed on this relational database table.
Hive is “Schema on READ only”. Update, modifications won't work on this because the hive query in typical cluster is set to run on multiple Data Nodes. So it is not possible to update and modify data across multiple nodes. Hive provides READ Many WRITE Once.

Job Execution Inside Hive

HIVESERVER is an API that allows the clients (JDBC) to execute the queries on hive data warehouse and get the desired results. Under hive services driver, compiler and execution engine interact with each other and process the query.

The client submits the query via a GUI. The driver receives the queries in the first instance from GUI and it will define session handlers which will fetch required APIs that is designed with different interfaces like JDBC or ODBC. The compiler creates the plan for the job to be executed. Compiler in turn is in contact with matter and its gets metadata from Meta Store.

Hive Architecture

There are 3 major components in Hive as shown in the architecture diagram. They are hive clients, hive services and Meta Store. Under hive client, we can have different ways to connect to HIVE SERVER in hive services.

These are Thrift client, ODBC driver and JDBC driver. Coming to thrift client, it provides an easy environment to execute the hive commands from a vast range of programming languages. Thrift client bindings for Hive are available for C++, Java, PHP scripts, python scripts and Ruby. Similarly, JDBC and ODBC drivers can be used for communication between hive client and hive servers for compatible options.

Tuesday, December 22, 2015

Hadoop Hive

Hive is developed on top of Hadoop as its data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open source-software that lets programmers analyze large data sets on Hadoop. Hive makes the job easy for performing operations like data encapsulation, ad-hoc queries, & analysis of huge datasets.

The hive’s design reflects its targeted use as a system for managing and querying structured data. While coming to structured data in general, Map Reduce doesn’t have optimization and usability features, but Hive framework provides those features. Hive’s SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, to ease learning.

Hadoop programming works on flat files. The hive can use directory structures to “partition“ data to improve performance on certain queries. To support these enhanced features, a new and important component of Hive i.e. metastore is used for storing schema information. This Metastore typically resides in a relational database.

We can interact with hive using several methods; those are Web GUI and Java Database Connectivity (JDBC) interface. Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write hive queries using Hive Query Language(HQL). Generally HQL syntax is similar to the SQL syntax that most data analysts are familiar with.

Job Execution and its Work flow

Let’s say for performing some computational mechanism using Java, we submitted some Job and data of size 1000 MB. Once a client submits the job, it will contact NN for the resources which are readily available for the job to execute. NN will provide data node information to the JT for further proceeding. Depends on the availability of resources, the JT splits and assigns tasks to Task Tracker.

Suppose, the Job is of 1000MB(1GB), assume JT splits the work into 10 tasks and allocates 100 MB to each process. Here the capacity to handle task tracker depends on input split like block size (64 MB or 128 MB).

Coming to open source apache Hadoop Limitations, the Hadoop 1.0 has scalability limited to 5000 nodes in cluster and maximum 40000 concurrent tasks can be available for this and it means each node is going to provide 8 concurrent tasks. Hadoop 2.2 overcomes the limitations of Hadoop 1. Hadoop enterprise editions will provide features like to distribute storage space in the cloud, Extensibility to upgrade nodes.

Functionalities of Hadoop daemons

Name Node, Secondary Name Node and Job Tracker are known as Master daemons. Task Tracker and Data Nodes are known as slave daemons. Whenever a client submits a Job, the actual processing is going to take place in Task Tracker.

NameNode consists of all Meta data information such as racks, blocks (r1, r2, b1, b2). NN sends its image file to secondary NN every 3sec (Heartbeat Mechanism). Name node is having special feature “Rack Awareness”, it will give information about nodes present in different racks of Hadoop cluster environment.

Secondary Name Node (Snn) - If NN goes down, Snn will take the image instantaneously and acts like NN and performs the NN functionalities.
Job Tracker (JT) - it is used for job scheduling and maintenance. It assigns different tasks to Task Tracker and monitors it.
Task Tracker (TT) - Actual computational processing is going to happen here. It gets in contact with JT and has some heart beat mechanism between them.

For Single Node Hadoop Setup it includes single master node and multiple working nodes. In this case, master consists of Job Tracker, Task Tracker, Name Node (NN) and Data Node (DN). A slave or working nodes act as both the Data Node and Task Tracker. For high level application developments Single Node Hadoop will give limited options (like memory, capacity) only, so it’s not that suitable.

Hadoop Architecture - Read & Write Operations

Wednesday, December 16, 2015

MapReduce

MapReduce is mainly used for parallel processing of large sets of data. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance. MR processes data in the form of key-value pairs. A key-value (KV) pair is a mapping element between two linked data items - key and its value.

The key(K) acts as an identifier to the value. An example of a key-value (KV) pair is a pair where the key is the node Id and the value is its properties including neighbor nodes, predecessor node etc. MR API provides the following features like batch processing, parallel processing of huge amounts of data and high availability.

For effective scheduling of work, Hadoop provides specific features at the architecture level. They are Fault tolerance, Rack Awareness and Replication Factor. As compared to two native UNIX/LINUX (8 to 16 KB) environment, the Block size in Hadoop by default is 64 MB. There is a provision to change to 128 MB. The Replication Factor by default is 3.

But it depends on the business requirement. We can increase/ decrease the replication factor. Compared to disk blocks, HDFS blocks are larger in size, so it will decrease the costs of six.

High Level Architecture of Hadoop

HDFS

HDFS is a Hadoop distributed file system. As it's name implies, it provides distributed environment for the storage and its file system is designed in a way to run on commodity hardware. There are some distributed file systems in the market like Cloudera, Horton works, etc. which provides distributed storage in HDFS and it has its own special advantages.

HDFS provides a high degree of fault tolerance and runs on cheap commodity hardware. By adding nodes to cluster using cheap commodity hardware, Apache Hadoop can process large data sets which will give the client better results as compared to existing distributed systems. In this, data stored in each block replicate into 3 nodes in such a way that even if one node goes down, there will be no loss of data since we have a proper back up recovery mechanism.

So HDFS provides concepts like Replication Factor, High memory block size and it can scale out up to several 1000 nodes.

Monday, December 14, 2015

Hadoop Architecture

Hadoop Distributed Framework is designed to handle large data sets. It can scale out to several thousands of nodes and process enormous amount of data in Parallel Distributed Approach. Apache Hadoop consists of two components. First one is HDFS (Hadoop Distributed File System) and the second component is Map Reduce (MR). Hadoop is write once and read many times.

HDFS is a scalable distributed storage file system and MapReduce is designed for parallel processing of data. If we look at the High Level Architecture of Hadoop, HDFS and Map Reduce components present inside each layer. The Map Reduce layer consists of job tracker and task tracker. HDFS layer consists of Name Node and Data Nodes.

Sunday, December 13, 2015

Typical Industrial Use Case: - Machine and Sensor Data Analytics

“This use case mainly focuses on the Real time streaming data that are coming from the sensors installed inside mines. Machines with high configurations (like memory storages, RAM capacity) have sensors installed on it. Every machine is installed with 2000-3000 sensors and each sensor will produce 1 to 2 MB of data per Sec. The entire mining area has more than 1000 machines.

All the systems are monitored by centralizing administrative network. So the size of data that is going to generate per machine is 20 Terabytes/hour. This is going to be a big challenge to process and do analytics on this huge real time geographical data (Temperature, Humidity, Seismic activity, Radiation activity). In this use case, two types of analytics involved are predictive and prescriptive.”

Processing Peta bytes of streaming data and predicting the weather patterns based on Statistical algorithms in a fraction of seconds may give the solution for identifying the potential Risk Analysis Patterns. Here the weather prediction and disaster risk analysis engine provides the required solution for the use case. This is the best solution provided by the BDA to the Mining Industry for Disaster Risk Management.

An Enterprise distribution of Hadoop provides enterprise class architecture for BDA and for its applications. These are Cloudera distributed Hadoop (CDH), Horton works, IBM Big Insights and MapR. For Real Time Online analytics, IBM had come up with IBM Info sphere streams. The enterprise editions provide scalable, secure, distributed environment for real time and offline data analytics.

Type of Analytics With BDA

Using BDA, we can perform different type of Analytics, namely Predictive, Prescriptive, Descriptive and Cognitive. Among all, Predictive is important to get the useful insights from the data. Predictive analytics are mainly performed by applications which needs to predict patterns in Real Time Analytics scenario. Predictive analytics improve fraud detection and speed up claims processing.

Cognitive Analytics working on cognitive computing systems can help business organizations, in terms of fast, accurate results by using Neural and Robotics like environments and setup. These systems learn and interact naturally with people to extend what either humans or machine could do on their own.

Typical live example is IBM Watson computer won the Jeopardy game in 2011 which is a purely super computer which won against former two champions of the Jeopardy game. The IBM Watson computer, purely works on cognitive system and it works on NLP (Natural Language Processing).

Industries in which big data had made Big Difference

The Industries such as Telecom, Retail, Finance, Healthcare, Banking, Energy and Automobile Sectors widely use Big data and its analytics and getting benefit in the current global market. Data volumes and data generation speed are significantly higher than it was before. All this new kind of data require a new set of technology to store, process and to make sense of data.

Predictive analytics improve fraud detection and speeds up claims processing. As a result of these, analytics gives more effective marketing, better customer service and new revenue generating opportunities in different industrial domains.

Comparing to traditional methods and approaches, the Big Data is adding so many features to present market trend in terms of sales and revenues. The Typical applications of these Big Data Analytics include weather predictions, Geospatial pattern recognition, Disaster management and Space Technology. One of the main functions of an ETL tool is to transform structured data. The transformation step is the most vital stage of building a structured data warehouse.

Big Data Analytics

Big Data Analytics (BDA) comes into the picture when we are dealing with the enormous amount of data that is being generated from the past 10 years with the advancement of the science and technology in different fields. To process this large amount of data and getting valuable meaning from it in a short span of time is a really challenging Task. Especially when four V’s that comes into the picture, when we discussing about BDA i.e. Volume, Velocity, Variety and Veracity of data.

Why and When to go for Big Data Analytics

Big data is a Revolutionary term that describes the Very large amount (Volume) of unstructured (text, images, videos), structured (tabular data) and semi-structured (json, xml) data that have the potential to be mined for information.

Volume (data at scale)

Volume is about large amount of data that is being generated daily from different type of sources, i.e. Namely, we can say like Social media data (Facebook, Twitter, Google), Satellite images, mining and sensor data, Different Type of Network logs generated from servers.

Integrating and processing these huge volumes of data, stored across a scalable and distributed environment poses a business huge challenge to analysts. Big IT Giants like Yahoo, Google generates Peta Bytes of data in less span of time. IT industry, the increase in data volume is in exponential terms compared to the past.

DROOLS Facts

Facts are nothing but plain old java objects (POJO). This article provides an overview on "How to create DROOLS Facts" in eclipse with the help of an example. A sample fact about a "Room" is as given below. Room have 4 properties. They are Room Name, Height , Width and Alarm Status. Value of alarm status will be true when alarm occurs due to fire or short circuit. When in normal condition, value of alarm status to be False.

public class Room 
{
private String roomName;
private String ROOM_WIDTH;
private String ROOM_HEIGHT;
private boolean alarmStatus;

public String getRoomName() {
return roomName;
}

public void setRoomName(String roomName) {
this.roomName = roomName;
}

public String getROOM_WIDTH() {
return ROOM_WIDTH;
}

public void setROOM_WIDTH(String rOOM_WIDTH) {
ROOM_WIDTH = rOOM_WIDTH;
}

public String getROOM_HEIGHT() {
return ROOM_HEIGHT;
}

public void setROOM_HEIGHT(String rOOM_HEIGHT) {
ROOM_HEIGHT = rOOM_HEIGHT;
}

public boolean isAlarmStatus() {
return alarmStatus;
}

public void setAlarmStatus(boolean alarmStatus) {
this.alarmStatus = alarmStatus;
}

}

If you look at the above fact you can see that , variables "roomName", "alarmStatus" starts with a small letter. Variables ROOM_WIDTH, ROOM_HEIGHT are capital. Reason behind it is that, drools wont allow variables starting with capital letter preceded by small letter. Fully capitalized words are accepted by DROOLS.

Creating Drools Knowledgebase, Knowledgesession and Facts

KnowledgeBase is an interface which manages a collection of knowledge definitions like rules, Internal Types and process. Knowledge packages are created by grouping knowledge definition. Since the creation of Knowledge packages is very expensive, KnowledgeBase stores and reuse them. Knowledge Sessions are created from KnowledgeBase, which consumes less memory. KnowledgeBase keeps track of each Knowledge Session, so that updates to knowledge base can be applied to all the session at run time. Another advantage of KnowledgeBase is that, it is completely serializable. We can restore serializable object of KnowledgeBase rather creating a new one.

There are two types of Knowledge Sessions

StatelessKnowledgeSession
StatefulKnowledgeSession

Both session implements BatchExecutor interface. In both sessions, we can insert, update and retract facts. Difference between the two is that stateless session automatically executes "FireAllRulesCommand" when facts are inserted/updated. In StatefulKnowledgeSession, after inserting the facts we had to explicitly call "Fire All Rules" Command.

Procedure to create DROOLS Project in Eclipse

Step by step procedure to create a drool project is given below. Open eclipse. Go to File → New → Other. A pop up appears as shown below.

Select Drools project from DROOLS folder. After selection click on next button. A dialog appears as shown below

Enter a project name and click on next button. A new window appears as shown below. If you wish to have a sample rule file, select first two options, else uncheck the options.

business rule management system (BRMS)

Fact

Facts are the knowledge of DROOLS. Facts are plain old java objects (POJO) where data for rules are stored.

Working Memory

Facts are placed in working memory which will be used for pattern matching at later stage. A user can retrieve, insert or modify facts.

Rule

Rules are a piece code which contain business logic's

Rule Engine

A business rules engine is a software system. Rule engine executes multiple business rules in a production environment. Rule engine have a Inference Engine,production and working memory. Businesses rule are put into production memory. Facts (Data) is placed in working memory. Inference engine matches facts against each rule in the production memory. Inference Engine uses Linear, Rete, Treat or Leaps algorithm for pattern matching. DROOLS uses extended Rete Algorithm, ReteOO for pattern matching. When many rules becomes true for same fact, it creates conflict. Agenda uses Conflict Resolution strategy to manage the execution. Rule engine architecture is as shown below

Basic Drools Concepts

DROOLS is a business rule management system (BRMS) . It have a forward chaining inference based rules engine known as a production rule system, using enhanced Rete algorithm. DROOLS run only on Java platform. This article provides an insight into various DROOLS features and its use. Major components of DROOLS are

Drool Expert (Rule Engine)
Drool Guvnor (BRMS/BPMS)
Drool Flow (Workflow)
Drool Fusion (Reasoning)

Steps to install DROOLS plugin and create DROOLS Project in Eclipse

JBOSS DROOLS is a business rule management system (BRMS). This document explain in detail step by step procedure to setup DROOLS in Eclipse IDE for work flow configuration and rule development. Eclipse IDE version used in this article is Eclipse Java EE IDE for Web Developers, Version: Indigo Service Release 2.

To start with , we need to install DROOL software in Eclipse. To do that

Go To Help → Install new software . A pop up appears as shown below . In Eclipse 3.4.1, go to Help → Software Updates.

Monday, November 30, 2015

Clustered Index vs Non Clustered Index

What is Indexing

Index is one of the database object and indexing can be done in one or more columns of the database table. The purpose of doing indexing is simple i.e, to speed up the retrieval process of rows in the table. Indexing can be applied to tables or views to speed up the retrieval process. Index can be considered as “copy” of the data of table and will be sorted based on some way. For example if we have a table called “Category” as shown below -

Category Table

CategoryID INT - PK
CategoryName VARCHAR(50)
CategoryAdded DATETIME
CategoryDeleted DATETIME

As shown above in the table structure “Category ID” is a primary key and will default have the clustered index. So all primary keys will default have the clustered index associated with it. Now consider another scenario where in we search the records based on “Category Name” and say the table has more than 1 million records in that table it will surely give the performance issues while filtering records based on “Category Name” as it does “full table scan” internally.

So to avoid the performance issues we can set “Category Name” as Non clustered Index. So now it sorts the data based on “Category Name” and stores it in the disk of the machine and hugely increases the performance as well.

Clustered Index

Let’s add some data into the table –“Category” and let’s see how the table search happens through query and let’s see how we can minimize number of comparisons using clustered index –

Category ID	Category Name	Category Added	Category Deleted
1	Category1	NULL	NULL
2	Category2	NULL	NULL
3	Category3	NULL	NULL
4	Category4	NULL	NULL
5	Category5	NULL	NULL

Below is the sample image of how the table scan is happened without clustered index -

Below is the sample image of how the table data is structured after clustered index -

As shown in the first figure above, table – Category have primary key – “Category ID” which is by default have a clustered index. Primary key as we know does not allow null and does not allow duplicate values.

Below are the list of points to be kept in mind while creating a Clustered Index -

There could be only one clustered index allowed in a table.
It is good to add the clustered index to the column which is used mostly in “where” clause the most. Because it greatly affects the performance of the table.
Whenever index is changed, reordering of data is happened in the table.
When clustered index is created it forms a binary tree and binary tree info stored in the disc separately. Sample binary tree diagram shown above.
During Update, Delete, Insert statements reordering will happen in the table.

Now let’s see how the data compared in binary tree in a query. Suppose I have a query to find the category having CategoryID = 5 –

Select * from Category where Category ID = 5

As shown in the above binary tree figure -

5 is matched with 3,
5 is matched with 4
5 is matched with 5 so only 3 comparisons.

So here total comparisons are 3 which is less compared to comparisons without clustered index 5.

Difference between Inner join and Left join

SQL Join is used for combining the rows between two or more tables using common fields/columns between these tables in a relational database. The goal of joining table is to extract the meaningful list of data. Joins will be obtained from the term called predicate. “Join” is a term used to combine fields of two or more tables by using common values between the tables.

There are different types of joins in SQL and they are –

INNER JOIN
CROSS JOIN
SELF JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN

Difference between Inner Join vs Left join

For demonstration purpose let’s create tables and let’s call it as – “tClass” and “tStudent”

tClass table structure

tStudent table structure

Inner join

Inner join matches the common records in two tables joined. Here in the above example we have “ClassID” in tclass table and “ClassID” in tstudent table. Below query is used to fetch the list of students which are in Class “10th”


SELECT s.StudentID, s.StudentName
FROM tClass c 

INNER JOIN tstudent s ON c.ClassID = s.ClassID

WHERE c.ClassID = 10

So above query is used to fetch the list of students which are studying in Class – “10th”. So here in “tstudent” table all students are stored irrespective of their classes and similarly in “tclass” table all classes are stored. So as the result of the above query, matched records between tstudent and tclass tables are searched and displayed. In the introduction part there was an introduction to the term called “predicate”, in the above query it is – “ON c.ClassID = s.ClassID”, which is one of the essential part of the join.

Note : There are multiple keys associated to each table like in tStudent table have “StudentID” as a primary key field and “ClassID” as a foreign key which inturn refer to the table – “tClass”.

In Inner join each record of table A Is matched with each record of Table B and the matched records are then be displayed in the resultant table.
JOIN word can be used instead of INNER JOIN, both meant the same.
INNER JOIN is the intersection of data between table A and table B.

The above query can be rewritten without using inner join like below but the performance will be impacted compared to inner join –


SELECT s.StudentID, s.StudentName
FROM tClass c , tstudent s

WHERE c.ClassID = s.ClassID AND c.ClassID = 10

Left Join

Left join will return all the rows from the left table and matching rows from right table.
In case of no match with right side table it will return NULL value.
“Left Join” and “Left Outer Join” are used interchangeably because records which are returned will be the same with either of these.

Difference between having and where clause

Where clause can be applied to a single row, where as Having clause is applied to the whole group. But, it does not mean that we cannot have both Where and Having clause in a single query. If both where and having clauses are used together in single a query, where clause will apply filter condition to the list of records and it will filter the group of records from the table and Having clause will be applied to this resultant group and groups which meets the condition given in having clause will be filtered in the resultant table.

Having clause is used, when we want to use select query to filter data from a table based on a aggregate condition. Aggregate functions used along with the having clause are Sum, Count, Max and Min. So, Having clause should always be used along with group by clause in SQL. If it is not used with Group By clause it throws an error saying – “Column 'Your column' is invalid in the HAVING clause because it is not contained in either an aggregate function or the GROUP BYclause“.

Example for Where Clause and Having Clause

Having clause is been added in SQL because WHERE clause could not be used like an aggregate function. Let’s create a multiple tables for demonstrating this with an example –

Table Name - Customer Table

Customer ID	Customer Name	Customer DOB	Customer Age
1	John	10/10/90	24
2	Martin	05/12/65	49
3	Darren	07/26/86	28
4	Derek	01/20/90	25

Table Name - Order Table

Order ID	Customer ID	Order Date
100	1	03/20/15
101	1	03/22/15
102	2	01/29/14
103	3	06/22/14
104	3	08/24/14
105	3	01/19/15

Now let’s filter the list of orders of the customer – “John” done –

SELECT o. Order ID, o. Order Date FROM Customer as c INNER JOIN Order as o ON c.Customer ID = o.Customer ID WHERE c.Customer Name = ‘John’

Above is the straight forward query to filter all the orders done by the customer – “John” and displaying the order date and order ID of the filtered records. So here in the above query we can use Group by clause and having clause also along with the existing where clause as shown below –

So below query is used to filter the orders of all the customers and group all the orders of the customers and filter the customers who has more than one order using having clause.

SELECT o.CustomerID FROM Customer as c INNER JOIN Order as o ON c.Custome rID = o.Customer ID GROUP BY o.Customer ID HAVING COUNT(o.Customer ID) > 1

So as we can see in the above query I am trying to check number of occurrence of Customer ID records in Order table, which means the number of orders done by the customers and then I am trying to fetch the Customer ID which is matching the filter criteria in having clause. So in case you want to get the name of the customer instead of customer ID then below query can be used –


SELECT c1.Customer Name

FROM Customer as c1 INNER JOIN
(

 SELECT o.Customer ID as Customer ID
FROM Customer as c

 INNER JOIN Order as o ON c.Customer ID = o.Customer ID
GROUP BY o.Customer ID
 HAVING COUNT(o.Customer ID) > 1
)cinner

ON cinner.Customer ID = c1.Customer ID

Looks straight forward query. Now my first query has been moved to inner query and now new join has been introduced to match the Customer ID from the filtered data and Customer table’s Customer ID. So the selected customer ID‘s Customer Name has been fetched. This is not only the way to fetch the Customer Name but this is one way of the way.

We can use Temporary table as well for the same purpose. Group by clause allows to us to use functions like max(), min(), sum() etc. which are very useful where as “where” clause does not work with these functions. So having clause can be used with aggregate functions whereas where clause will not work with aggregate functions.

SQL Group By and Having Command

Learning SQL won't be complete without knowing the usage of "Group By" and "Having" clause. These commands are one of the most commonly used one is SQL queries and knowing when and how to use them make your life easy.

SQL Group By Clause

SQL Group by Clause is used to find count, sum, min and max values belonging to a particular category. For example, to get total salary of employees from the Salary table given below, following SQL query is used.

Employee_Name	Year	Month	Salary
John	2015	April	30000
John	2015	May	31000
John	2015	June	32000
Alex	2015	April	40000
Alex	2015	May	41000
Alex	2015	June	42000

select employee_name,Sum(salary) Salary from salary group by employee_name

The result of the above query is given below. The group by function helps us to add up the salaries with respect to the column specified after "Group by" function. In this case, it is employee name.

Employee_Name	Salary
John	93000
Alex	123000

Pages

Thursday, December 31, 2015

Pig Local mode

Pig Map reduce mode

Wednesday, December 30, 2015

Job Execution Flow

Tuesday, December 29, 2015

Monday, December 28, 2015

Sunday, December 27, 2015

Thursday, December 24, 2015

Wednesday, December 23, 2015

Tuesday, December 22, 2015

Monday, December 21, 2015

Sunday, December 20, 2015

Thursday, December 17, 2015

Wednesday, December 16, 2015

Tuesday, December 15, 2015

HDFS

Monday, December 14, 2015

Sunday, December 13, 2015

Thursday, December 10, 2015

Wednesday, December 9, 2015

Tuesday, December 8, 2015

Why and When to go for Big Data Analytics

Volume (data at scale)

Monday, December 7, 2015

Sunday, December 6, 2015

Friday, December 4, 2015