Apache, Hadoop, Hive, SQL
SQL (Structured Query Language) is a Data Definition Language (DDL) and a Data Manipulation Language (DML) that has been used as the interface to relational data base systems since the late 1970s. Between 1979 and 1982, Oracle and IBM released commercial SQL-based relational databases and by 1986 all relational databases were using SQL as the DDL/DML or data flow language.
In 1986, the American National Standards Institute (ANSI) standardized SQL. This standard was updated in 1989, 1992, 1999, 2003, 2006 and, again, in 2008. Standard SQL, also called ANSI SQL, is supported by all major relational databases vendors. Each vendor, however, may have its own proprietary language extensions called a SQL dialect.
The ASF Hive project gained popularity quickly because it allowed data users to visualize the distributed, poly-structured data in the HDFS as SQL datasets (Tables) and to perform transformational operations on the visualizations with SQL queries. Hive has undergone significant rewrites to appease data users who are familiar with SQL and who demand the capabilities and performance of a relational data base even though the data in HDFS is stored as distributed, poly-structured, possibly replicated pieces (blocks) across a network.
Hive has been under scrutiny in recent years in terms of performance and its evolving presence of capabilities familiar to relational database users. In the interim, other ASF projects have emerged that also provide SQL interfaces and implement the SQL query with different implementation approaches and processing engines. These projects provide the user the choice to use SQL for batch, micro-batched or truly streaming data. With some SQL-supporting projects, the data set can be viewed as either finite or infinite in scope.
This course will present the current ASF projects that provide SQL interfaces. As the student progresses through the course, the student will be able to deduce why one SQL-supporting project might be more appropriate a specific use case than another project that also provides a SQL interface.
The student will be introduced to Hive with LLAP (Hive2), Phoenix on HBase, SQL on Spark, Kafka’s KSQL and Ignite’s SQL interface. Flink will be mentioned but covered extensively in the DFHz course, ASF Flink.
Development experience with Java, Hadoop and SQL are a prerequisite. It is suggested that a student new to Hadoop first take the DFHz course “Advanced Hadoop” as a prerequisite to this course.
Individuals such as Administrators, Data Engineers and Data Analysts who need to understand the requirements and capabilities of the ASF projects that offer a SQL interface.
This is a 5 day class when taught on-site with ILT or via web-ex with VILT. It is also offered on a per-module basis for on-line self-enablement via our LMS, Brane.
Day 1: Hive with LLAP (Hive 2)
Day 2: Phoenix on HBase
Day 3: SQL on Spark (with Hive and temporary tables)
Day 4: Kafka and KSQL
Day 5: SQL on Ignite and a brief overview of Flink