How to Use dropDuplicates in Spark Scala - Removing Duplicates
Removing duplicate rows is a common operation in data processing. In Apache Spark, you can use the dropDuplicates function to eliminate duplicate rows from a DataFrame using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.
| Category | Item | Quantity | Price |
|---|---|---|---|
| Fruit | Apple | 10 | 1.5 |
| Fruit | Apple | 10 | 1.5 |
| Vegetable | Carrot | 15 | 0.7 |
| Vegetable | Potato | 25 | 0.3 |
Before we can use dropDuplicates, we need to import the necessary libraries:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
val schema = StructType(Array(
StructField("category", StringType, true),
StructField("item", StringType, true),
StructField("quantity", IntegerType, true),
StructField("price", DoubleType, true)
))
// Create the data
val data = Seq(
Row("Fruit", "Apple", 10, 1.5),
Row("Fruit", "Apple", 10, 1.5),
Row("Vegetable", "Carrot", 15, 0.7),
Row("Vegetable", "Potato", 25, 0.3)
)
// Create the DataFrame
val rdd = sparkSession.sparkContext.parallelize(data)
val df = sparkSession.createDataFrame(rdd, schema)
To remove duplicate rows in a DataFrame, use the dropDuplicates function. This function can be used without any arguments to remove fully duplicate rows:
val uniqueDF = df.dropDuplicates() uniqueDF.show()
You can also specify columns to consider for identifying duplicates. For example, to remove rows that have the same "item" and "quantity" values:
val uniqueSpecificDF = df.dropDuplicates("item", "quantity")
uniqueSpecificDF.show()
You can combine dropDuplicates with other transformations for more complex data processing. Here is an example:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
object RemoveDuplicates {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName("remove dupilcates from a spark dataframe")
.master("local")
.getOrCreate()
val schema = StructType(Array(
StructField("category", StringType, true),
StructField("item", StringType, true),
StructField("quantity", IntegerType, true),
StructField("price", DoubleType, true)
))
// Create the data
val data = Seq(
Row("Fruit", "Apple", 10, 1.5),
Row("Fruit", "Apple", 10, 1.5),
Row("Vegetable", "Carrot", 15, 0.7),
Row("Vegetable", "Potato", 25, 0.3)
)
// Create the DataFrame
val rdd = sparkSession.sparkContext.parallelize(data)
val rawdf = sparkSession.createDataFrame(rdd, schema)
val uniqueSpecificDF = rawdf.dropDuplicates()
uniqueSpecificDF.show()
}
}
In this tutorial, we have seen how to use the dropDuplicates function in Spark with Scala to remove duplicate rows from a DataFrame. This function is very useful for data cleaning and preparation tasks. For more advanced operations, consider combining it with other DataFrame transformations.