Apache Spark is renowned for its prowess as a distributed data processing engine. Yet, a lesser-known facet of Spark is its capability to function as a database. This article delves deep into how Apache Spark, with the assistance of Hive, can be harnessed as a database. We'll explore the creation of tables within Spark and the subsequent querying on them.
Spark’s Database Capabilities
Spark's inherent nature allows it to act as a database. Within Spark, users can establish databases. Once a database is set up, tables and views can be created within that database.
Table Components in Spark
A table in Spark comprises two main components:
- Table Data: This is stored as data files within your distributed storage system.
- Table Metadata: This includes information such as the schema, table name, database name, column names, partitions, and the physical location of the actual data. Spark provides an in-memory catalog by default, maintained per session. For persistence, Spark employs the Apache Hive meta-store.
Types of Spark Tables
Spark offers two distinct types of tables:
- Managed Tables: Here, Spark oversees both the table data and metadata. It creates the metadata in the meta-store and writes the data within a predefined directory location, known as the Spark SQL warehouse directory. Deleting a managed table results in the removal of both its metadata and table data.
- Unmanaged or External Tables: These tables are akin to managed tables concerning metadata. However, they differ in the data storage location. For unmanaged tables, only the metadata is created by Spark in the meta-store. The data directory location must be specified during the creation of unmanaged tables. Deleting an unmanaged table will only remove its metadata, leaving the table data untouched.
Why Opt for Spark Tables?
When ingesting data into Spark, two primary options are available:
- Store the data as a file in formats like Parquet or Avro.
- Save the data within a Table.
Choosing the first option requires the use of the DataFrameReader API to re-access the data. On the other hand, creating a managed table in Spark makes your data accessible to a plethora of SQL compliant tools. Spark database tables can be accessed using SQL expressions via JDBC-ODBC connectors, making them compatible with third-party tools like Tableau and Power BI.
Crafting a Spark Managed Table
To demonstrate, let's embark on creating a Spark managed table and querying it using Spark SQL.
import org.apache.log4j.Logger
import org.apache.spark.sql.{SaveMode, SparkSession}
object SparkSQLTableDemo extends Serializable {
@transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Spark SQL Table Demo")
.master("local[3]")
.enableHiveSupport()
.getOrCreate()
val userInfoDF = spark.read
.format("csv")
.option("path", "dataSource/")
.option("delimiter",";")
.option("header","true")
.option("inferSchema", "true")
.load()
import spark.sql
sql("CREATE DATABASE IF NOT EXISTS MY_DB")
sql("USE MY_DB")
userInfoDF.write
.mode(SaveMode.Overwrite)
.format("csv")
.saveAsTable("MY_DB.user_info")
logger.info("Now you can query whatever you want from the table...!")
sql("Select * from MY_DB.user_info").show()
spark.stop()
}
}
Conclusion
Apache Spark's ability to function as a database offers a multitude of advantages. By understanding its capabilities and harnessing them effectively, one can achieve seamless data processing and management.
References
FAQs
- What are the two main components of a table in Spark?
- Table Data and Table Metadata.
- What are the two types of tables in Spark?
- Managed Tables and Unmanaged or External Tables.
- Why should one opt for Spark tables?
- Spark tables make data accessible to various SQL compliant tools and can be accessed using SQL expressions via JDBC-ODBC connectors.
- Can Spark function as a database?
- Yes, Apache Spark can function as a database and allows the creation of tables and databases within it.