How to Use dropDuplicates in PySpark - Removing Duplicates
Removing duplicate rows is a crucial step in data processing to maintain data integrity. In Apache PySpark, the dropDuplicates function provides a straightforward method to eliminate duplicate entries from a DataFrame. This tutorial will delve into the dropDuplicates function, showcasing how to use it effectively with practical examples. Learn how to apply this function to specific columns or entire DataFrames, ensuring your data remains clean and reliable.
| Category | Item | Quantity | Price |
|---|---|---|---|
| Fruit | Apple | 10 | 1.5 |
| Fruit | Apple | 10 | 1.5 |
| Vegetable | Carrot | 15 | 0.7 |
| Vegetable | Potato | 25 | 0.3 |
Before we can use dropDuplicates, we need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Remove duplicates from a PySpark DataFrame") \
.master("local") \
.getOrCreate()
# Define the schema
schema = StructType([
StructField("category", StringType(), True),
StructField("item", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("price", DoubleType(), True)
])
# Create the data
data = [
Row("Fruit", "Apple", 10, 1.5),
Row("Fruit", "Apple", 10, 1.5),
Row("Vegetable", "Carrot", 15, 0.7),
Row("Vegetable", "Potato", 25, 0.3)
]
# Create the DataFrame
rdd = spark.sparkContext.parallelize(data)
To remove duplicate rows in a DataFrame, use the dropDuplicates function. This function can be used without any arguments to remove fully duplicate rows:
unique_df = df.dropDuplicates() unique_df.show()
You can also specify columns to consider for identifying duplicates. For example, to remove rows that have the same "item" and "quantity" values:
unique_df = df.dropDuplicates("item", "quantity")
unique_df.show()
You can combine dropDuplicates with other transformations for more complex data processing. Here is an example:
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType
from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Remove duplicates from a PySpark DataFrame") \
.master("local") \
.getOrCreate()
# Define the schema
schema = StructType([
StructField("category", StringType(), True),
StructField("item", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("price", DoubleType(), True)
])
# Create the data
data = [
Row("Fruit", "Apple", 10, 1.5),
Row("Fruit", "Apple", 10, 1.5),
Row("Vegetable", "Carrot", 15, 0.7),
Row("Vegetable", "Potato", 25, 0.3)
]
# Create the DataFrame
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd, schema)
df.show()
# Remove duplicates
unique_df = df.dropDuplicates()
# Show the result
unique_df.show()
In this tutorial, we have seen how to use the dropDuplicates function in PySpark with to remove duplicate rows from a DataFrame. This function is very useful for data cleaning and preparation tasks. For more advanced operations, consider combining it with other DataFrame transformations.