SQL-on-Hadoop

SQL-on-Hadoop is a class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements.

By supporting familiar SQL queries, SQL-on-Hadoop lets a wider group of enterprise developers and business analysts work with Hadoop on commodity computing clusters. Because SQL was originally developed for relational databases, it has to be modified for the Hadoop 1 model, which uses the Hadoop Distributed File System and Map-Reduce or the Hadoop 2 model, which can work without either HDFS or Map-Reduce.

The different means for executing SQL in Hadoop environments can be divided into (1) connectors that translate SQL into a MapReduce format; (2) “push down” systems that forgo batch-oriented MapReduce and execute SQL within Hadoop clusters; and (3) systems that apportion SQL work between MapReduce-HDFS clusters or raw HDFS clusters, depending on the workload.

One of the earliest efforts to combine SQL and Hadoop resulted in the Hive data warehouse, which featured HiveQL software for translating SQL-like queries into MapReduce jobs. Other tools that help support SQL-on-Hadoop include BigSQL, Drill, Hadapt, Hawq, H-SQL, Impala, JethroData, Polybase, Presto, Shark (Hive on Spark), Spark, Splice Machine, Stinger, and Tez (Hive on Tez).