Sunday, January 07, 2018

HPL/SQL: Procedural SQL on Hadoop, NoSQL and RDBMS

I admit it - I've a lot to learn when it comes to the internals of big data technologies. Even though I've worked on many data science projects at Goibibo, my knowledge of the principles involved is very limited. So when this project happened, where we needed to build some data pipelines for business needs, I found the perfect opportunity to explore the Hadoop and Hive ecosystem. This blog post is about a new tool that I came across, HPL/SQL.


Years back, the database world was ruled by relational databases, chief amongst them being Oracle. And it had this great language for writing queries - PL/SQL. From the Oracle website,
PL/SQL is a procedural language designed specifically to embrace SQL statements within its syntax. PL/SQL program units are compiled by the Oracle Database server and stored inside the database. And at run-time, both PL/SQL and SQL run within the same server process, bringing optimal efficiency. PL/SQL automatically inherits the robustness, security, and portability of the Oracle Database.
However, the explosion in information technology requirements led to a humongous amount of data being available for analysis. This gave rise to two other classes of databases:

Big data systems
Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them.
NOSQL database
NoSQL (originally referring to "non SQL") database provide a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
So depending on the problem at hand, engineers now had a host of RDBMS (Oracle, MySQL, Aurora etc), NOSQL (Cassandra, MongoDB, Redis etc), and Big Data systems (Hadoop) to choose from. 

This gave rise to a complex challenge - most data analysis earlier could be powered by the same RDBMS alone - example, businesses could build intelligence using stored procedures and triggers in PL/SQL on Oracle. In the world of multiple databases, engineers now have to learn multiple tools for data collection, analysis and so on - example Spark, Mapreduce, Hive and so on.

Different ecosystems and data analysis tools started coming out with their approaches to solve this problem. For example, connecting to multiple databases, custom SQL that is compatible with a covered set of databases and so on. However, the coverage has not been big so far.

To solve this complex problem, Hive comes with a language/tool called as HPL/SQL. The promise of HPL/SQL is simple - it is a hybrid and heterogeneous language that understands syntax of most  procedural SQL dialect, and can be used with any databases, running existing Oracle PL/SQL code on Apache Hive and Microsoft SQL Server, or running Transact-SQL on Oracle, Cloudera Impala or Amazon Redshift. So, you can use it as a language to write new procedural SQL with Hive, and you can use it as a tool to execute existing PL/SQL and logic that you have written for other databases.

Compared to writing code in Python, Java or Scala, HPL/SQL is much more friendly for the BI/SQL expert to run, and allows them to hit the ground running faster. It offers features like
  • Control Flow Statements
  • Built-in Functions
  • Stored Procedures, Functions and Packages
  • Exception and Condition Handling
  • On-the-fly SQL Conversion
So, if you are exploring an efficient way to implement ETL processes in Hadoop, give HPL/SQL a try. Since Hive 2.0, HPL/SQL comes built in with Hive. All you need to do is, 

When executing commands directly:

hplsql -e "SELECT * FROM src LIMIT 1" 
When running from a file:
hplsql -f script.sql
You can go through the documentation to get more details if required.

4 comments:

  1. I want to know if it supports inserting data by partition.
    insert into tablename partition(partition_name) from select * from table;
    but it is error .
    Function:not found partition

    ReplyDelete
  2. i don`t konw why my name is unknown .
    this is my email : 10372756845aa@gmail.com

    ReplyDelete
  3. The control, stockpiling, and correspondence of advanced data require capable utilization of calculations. Software engineering encourages the use of these calculations.ExcelR Data Science Courses

    ReplyDelete