How to Use dropDuplicates in Spark Scala - Removing Duplicates

Using dropDuplicates in Spark Scala

Removing duplicate rows is a common operation in data processing. In Apache Spark, you can use the dropDuplicates function to eliminate duplicate rows from a DataFrame using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.

Sample Data

Let's start by creating a sample DataFrame. We'll use the following data:

Category	Item	Quantity	Price
Fruit	Apple	10	1.5
Fruit	Apple	10	1.5
Vegetable	Carrot	15	0.7
Vegetable	Potato	25	0.3

1. Importing Necessary Libraries

Before we can use dropDuplicates, we need to import the necessary libraries:

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}

2. Creating a Sample DataFrame

  val schema = StructType(Array(
    StructField("category", StringType, true),
    StructField("item", StringType, true),
    StructField("quantity", IntegerType, true),
    StructField("price", DoubleType, true)
  ))

  // Create the data
  val data = Seq(
    Row("Fruit", "Apple", 10, 1.5),
    Row("Fruit", "Apple", 10, 1.5),
    Row("Vegetable", "Carrot", 15, 0.7),
    Row("Vegetable", "Potato", 25, 0.3)
  )

  // Create the DataFrame
  val rdd = sparkSession.sparkContext.parallelize(data)
  val df = sparkSession.createDataFrame(rdd, schema)

3. Using dropDuplicates to Remove Duplicate Rows

To remove duplicate rows in a DataFrame, use the dropDuplicates function. This function can be used without any arguments to remove fully duplicate rows:

val uniqueDF = df.dropDuplicates()
uniqueDF.show()

4. Removing Duplicates Based on Specific Columns

You can also specify columns to consider for identifying duplicates. For example, to remove rows that have the same "item" and "quantity" values:

val uniqueSpecificDF = df.dropDuplicates("item", "quantity")
uniqueSpecificDF.show()

5.Complete code

You can combine dropDuplicates with other transformations for more complex data processing. Here is an example:

  import org.apache.spark.sql.functions.col
  import org.apache.spark.sql.{Row, SparkSession}
  import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
      
  object RemoveDuplicates {

        def main(args: Array[String]): Unit = {
          val sparkSession = SparkSession
            .builder()
            .appName("remove dupilcates from a spark dataframe")
            .master("local")
            .getOrCreate()
          val schema = StructType(Array(
            StructField("category", StringType, true),
            StructField("item", StringType, true),
            StructField("quantity", IntegerType, true),
            StructField("price", DoubleType, true)
          ))
          // Create the data
          val data = Seq(
            Row("Fruit", "Apple", 10, 1.5),
            Row("Fruit", "Apple", 10, 1.5),
            Row("Vegetable", "Carrot", 15, 0.7),
            Row("Vegetable", "Potato", 25, 0.3)
          )
          // Create the DataFrame
          val rdd = sparkSession.sparkContext.parallelize(data)
          val rawdf = sparkSession.createDataFrame(rdd, schema)
          val uniqueSpecificDF = rawdf.dropDuplicates()
          uniqueSpecificDF.show()
          
        }
      
 }

Output

In this tutorial, we have seen how to use the dropDuplicates function in Spark with Scala to remove duplicate rows from a DataFrame. This function is very useful for data cleaning and preparation tasks. For more advanced operations, consider combining it with other DataFrame transformations.