Pig can run in two execution modes. The modes depends on where the pig script is going to run and where the data is residing. The data can be stored in a single machine, i.e. local file system or it can be stored in a distributed environment like typical Hadoop Cluster environment. We can run pig programs in three different modes.
First one is non interactive shell also known as script mode. In this we have to create a file, load the code in the file and execute the script. Second one is grunt shell, it is an interactive shell for running pig commands. Third one is embedded mode, in this we use JDBC to run SQL programs from Java.
Pig Local mode
In this mode, the pig runs on single JVM and accesses the local file system. This mode is best suitable for dealing with the smaller data sets. In this, parallel mapper execution is not possible because the earlier versions of the Hadoop versions are not thread safe.
By providing –x local, we can get in to pig local mode of execution. In this mode, pig always looks for the local file system path when data is loaded. $pig –x local implies that it’s in local mode.
Pig Map reduce mode
In this mode, we could have proper Hadoop cluster setup and Hadoop installations on it. By default, the pig runs on MR mode. Pig translates the submitted queries into Map reduce jobs and runs them on top of Hadoop cluster. We can say this mode as a Map reduce mode on a fully distributed cluster.
Pig Latin statements like LOAD, STORE are used to read data from the HDFS file system and to generate output. These Statements are used to process data.
Storing Results
During the processing and execution of MR jobs, intermediate data will be generated. Pig stores this data in a temporary location on HDFS storage. In order to store this intermediate data, temporary location has to be created inside HDFS.
By Using DUMP, we can get the final results displayed to the output screen. In a production environment, the output results will be stored using STORE operator.
No comments:
Post a Comment