How to Use right join in PySpark
The join function in PySpark is a fundamental tool for combining two DataFrames based on common columns or keys. This operation is vital for data integration and analysis. In this tutorial, we’ll explore how to perform a right join in PySpark, complete with practical examples to guide you through the process.
| id | name | age |
|---|---|---|
| 1 | John | 28 |
| 2 | Alice | 34 |
| 4 | Bob | 23 |
| id | salary |
|---|---|
| 1 | 50000 |
| 2 | 60000 |
# Create employee data employee_data = [ (1, "rahul", 28), (2, "anil", 34), (4, "Raju", 23) ] # Create salary data salary_data = [ (1, 50000), (2, 60000) ] # Create DataFrames employee_df = spark.createDataFrame(employee_data, ["id", "name", "age"]) salary_df = spark.createDataFrame(salary_data, ["id", "salary"])
Before we can use right join, we need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.functions import col
Now that we have our DataFrames, we can combine them using the join function:
# Style 1: Right join using the common column name right_join_df1 = employee_df.join(salary_df, ["id"], "right") right_join_df1.show()
This approach is particularly useful for controlling column selection when dealing with multiple common columns across both DataFrames.
# Style 2: Right join using aliases (useful when column names are the same except the joining column)
right_join_df2 = employee_df.alias("emp").join(
salary_df.alias("sal"),
col("emp.id") == col("sal.id"),
"right"
).select("emp.id", "emp.name", "emp.age", "sal.salary")
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Use right join in PySpark") \
.master("local") \
.getOrCreate()
# Create employee data
employee_data = [
(1, "rahul", 28),
(2, "anil", 34),
(4, "Raju", 23)
]
# Create salary data
salary_data = [
(1, 50000),
(2, 60000)
]
# Create DataFrames
employee_df = spark.createDataFrame(employee_data, ["id", "name", "age"])
salary_df = spark.createDataFrame(salary_data, ["id", "salary"])
# Style 1: Right join using the common column name
right_join_df1 = employee_df.join(salary_df, ["id"], "right")
right_join_df1.show()
# Style 2: Right join using aliases (useful when column names are the same except the joining column)
right_join_df2 = employee_df.alias("emp").join(
salary_df.alias("sal"),
col("emp.id") == col("sal.id"),
"right"
).select("emp.id", "emp.name", "emp.age", "sal.salary")
right_join_df2.show()
# Stop the SparkSession
spark.stop()