Guide to Apache Spark DataFrame distinct() Method
The distinct() method in Apache Spark DataFrame is used to return a new DataFrame with unique rows based on all columns. Here are five key points about distinct():
| Roll | First Name | Age | Last Name |
|---|---|---|---|
| 1 | Rahul | 30 | Yadav |
| 2 | Sanjay | 20 | gupta |
| 3 | Ranjan | 67 | kumar |
| 3 | Ranjan | 67 | kumar |
First, you need to import the necessary libraries:
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
For demonstration purposes, let's create a sample DataFrame:
val schema = StructType(Array(
StructField("roll", IntegerType, true),
StructField("first_name", StringType, true),
StructField("age", IntegerType, true),
StructField("last_name", StringType, true)
))
val data = Seq(
Row(1, "rahul", 30, "yadav"),
Row(2, "sanjay", 20, "gupta"),
Row(3, "ranjan", 67, "kumar"),
Row(3, "ranjan", 67, "kumar"),
)
val rdd = sparkSession.sparkContext.parallelize(data)
val testDF = sparkSession.createDataFrame(rdd, schema)
val transformedDF=testDF.distinct()
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object DistinctInSpark {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName("select distinct rows in spark scala")
.master("local")
.getOrCreate()
val schema = StructType(Array(
StructField("roll", IntegerType, true),
StructField("first_name", StringType, true),
StructField("age", IntegerType, true),
StructField("last_name", StringType, true)
))
val data = Seq(
Row(1, "rahul", 30, "yadav"),
Row(2, "sanjay", 20, "gupta"),
Row(3, "ranjan", 67, "kumar"),
Row(3, "ranjan", 67, "kumar"),
)
val rdd = sparkSession.sparkContext.parallelize(data)
val testDF = sparkSession.createDataFrame(rdd, schema)
val transformedDF=testDF.distinct()
transformedDF.show()
sparkSession.stop()
}
}
That's it! You've successfully applied filter and where conditions to a DataFrame in Spark using Scala.