Apache HBase is the Hadoop database, a distributed, scalable, big data store. HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
$ hive
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
15/09/10 15:14:16 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
Logging initialized using configuration in jar:file:/apps/hathi/apache-hive-0.14.0-bin/lib/hive-common-0.14.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/gphd/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/apps/hathi/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive>
$ hive> CREATE TABLE pokes_user (foo INT, bar STRING);
OK
Time taken: 1.56 seconds
$ hive> CREATE TABLE invites_user (foo INT, bar STRING) PARTITIONED BY (ds STRING);
OK
Time taken: 0.125 seconds
$ hive> SHOW TABLES;
OK
invites_user
pokes_user
Time taken: 0.114 seconds, Fetched: 2 row(s)
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The following steps walk through how to run Pig in interactive mode. Before to run, you should put the /etc/passwd file from you local system to hdfs:///user/yourusername/. These Pig Latin statements extract all user IDs from the /etc/passwd file.
$ pig -x mapreduce
15/10/06 16:32:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/06 16:32:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/06 16:32:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-10-06 16:32:01,856 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-06 16:32:01,857 [main] INFO org.apache.pig.Main - Logging error messages to: /path/to/pig/pig_1444163521852.log
2015-10-06 16:32:01,908 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /path/to/.pigbootup not found
2015-10-06 16:32:02,808 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-06 16:32:02,808 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-06 16:32:02,809 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://hathi-adm.rcac.purdue.edu:8020
2015-10-06 16:32:02,816 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2015-10-06 16:32:04,362 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>
Then, enter the Pig Latin statements interactively at the grunt prompt:
/* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the passwd file
B = foreach A generate $0 as id; -- extract the user IDs
store B into 'id.out'; -- write the results to a file name id.out