How to Use union in PySpark - Combining DataFrames
Combining DataFrames is a common operation in data processing. In Apache PySpark, you can use the union function to merge two DataFrames with the same schema using . This tutorial will guide you through the process of using this function with practical examples and explanations.
| Category | Item | Quantity | Price |
|---|---|---|---|
| Fruit | Apple | 10 | 1.5 |
| Fruit | Banana | 20 | 0.5 |
| Vegetable | Carrot | 15 | 0.7 |
| Category | Item | Quantity | Price |
|---|---|---|---|
| Fruit | Orange | 30 | 0.8 |
| Fruit | Pear | 10 | 1.0 |
| Vegetable | Potato | 25 | 0.3 |
Before we can use union, we need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType from pyspark.sql import Row
Now that we have our DataFrames, we can combine them using the union function:
combined_df = df1.union(df2)
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType
from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Use union in PySpark") \
.master("local") \
.getOrCreate()
# Define the schema
schema = StructType([
StructField("category", StringType(), True),
StructField("item", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("price", DoubleType(), True)
])
# Create the data for the first DataFrame
data1 = [
Row("Fruit", "Apple", 10, 1.5),
Row("Fruit", "Banana", 20, 0.5),
Row("Vegetable", "Carrot", 15, 0.7)
]
# Create the data for the second DataFrame
data2 = [
Row("Fruit", "Orange", 30, 0.8),
Row("Fruit", "Pear", 10, 1.0),
Row("Vegetable", "Potato", 25, 0.3)
]
# Create the DataFrames
rdd1 = spark.sparkContext.parallelize(data1)
df1 = spark.createDataFrame(rdd1, schema)
rdd2 = spark.sparkContext.parallelize(data2)
df2 = spark.createDataFrame(rdd2, schema)
# Perform the union operation
combined_df = df1.union(df2)
# Show the result
combined_df.show()
In this tutorial, we have demonstrated how to use the union function in PySpark with to combine two DataFrames with the same schema. This is a powerful tool for data integration and processing tasks..
union