Kryo Serialization in Apache Spark

Apache Spark, a powerful distributed computing framework, offers two primary serialization libraries: Java serialization and Kryo serialization. While Java serialization is the default, Kryo serialization is often recommended for its efficiency, especially in network-intensive applications.

Why Isn’t Kryo the Default in Spark?

Kryo isn't set as the default serialization method in Spark primarily due to its requirement for custom registration. Although Kryo excels in RDD caching and shuffling, it isn't natively supported for disk serialization. Specifically, the methods saveAsObjectFile on RDD and objectFile method on SparkContext are tailored for Java serialization.

However, for those seeking enhanced performance and reduced memory consumption, Kryo is an excellent choice. Operations like joins and groupings, which often involve data shuffling, benefit significantly from efficient serialization. The same applies to caching, especially when data spills from memory to disk.

The Importance of Registering Classes in Kryo

To utilize Kryo, one must register the classes. This can be done using the registerKryoClasses method. For instance:

Scala
.registerKryoClasses(
   Array(classOf[Person], classOf[Furniture])
)

What Happens if Classes Aren’t Registered?

If a class isn't registered, Kryo selects a serializer from its list of default serializers. While Kryo boasts over 50 default serializers for various JRE classes, relying on them can lead to:

  1. Security Concerns: Unregistered classes can lead to the creation of any class instance, posing potential security threats.
  2. Increased Serialization Size: Instead of using a concise class ID, the full class name is written the first time an unregistered class appears, leading to larger serialization sizes.

To ensure all classes are registered, one can set the following Spark configuration:

Scala
.set("spark.kryo.registrationRequired", "true")

Practical Example: Kryo vs. Java Serialization

Let's delve into a practical example to discern the difference between Kryo and Java serialization.

Scala
// Define the class to be registered
case class Person(name: String, age: Int)

val conf = new SparkConf()
    .setAppName("kyroExample")
    .setMaster("local[*]")
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .set("spark.kryo.registrationRequired", "true")
    .registerKryoClasses(
      Array(classOf[Person],classOf[Array[Person]],
Class.forName("org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage"))
    )

val sparkContext = new SparkContext(conf)

// Create an array of Person and convert it to an RDD
val personList: Array[Person] = (1 to 9999999)
                     .map(value => Person("p"+value, value)).toArray

val rddPerson: RDD[Person] = sparkContext.parallelize(personList,5)
val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)

// Persist the RDD in memory
evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)

Upon execution and inspection via Spark's UI, one can observe that Kryo uses significantly less memory compared to Java serialization. For instance, in a test dataset, Kryo consumed 20.1 MB, whereas Java used 13.3 MB. This translates to a 30-40% reduction in memory usage with Kryo.

Conclusion

Kryo serialization offers a more efficient alternative to Java serialization in Spark, especially for network-intensive tasks. By ensuring proper class registration and understanding its nuances, developers can harness the power of Kryo to optimize their Spark applications.

For a hands-on example, refer to this GitHub repository.

FAQs

  • What is Kryo serialization in Spark?
    • Kryo is an efficient serialization library in Apache Spark, often recommended for network-intensive applications.
  • Why isn't Kryo the default serialization method in Spark?
    • Kryo requires custom registration, and while it's great for RDD caching and shuffling, it's not natively supported for disk serialization.
  • What happens if I don't register classes in Kryo?
    • Kryo will use a default serializer, which can lead to security concerns and increased serialization sizes.
  • How does Kryo compare to Java serialization in terms of memory usage?
    • Kryo typically uses 30-40% less memory compared to Java serialization, offering significant optimizations for Spark applications.

Author