Below is an example how to use partitioning in Apache Spark:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f
import random

spark = SparkSession.builder \
    .appName('app') \
    .config("spark.sql.adaptive.enabled", False) \
    .getOrCreate()


sc = spark.sparkContext
sc.defaultParallelism

Influencing the level of parallelism when creating an RDD:

rdd = sc.parallelize([("A", 2), ("A", 7), ("A", 12), ("B", 8), ("B", 5), ("B", 10)], 10)
rdd.getNumPartitions()

rdd.glom().collect()

[[],
 [('A', 2)],
 [],
 [('A', 7)],
 [('A', 12)],
 [],
 [('B', 8)],
 [],
 [('B', 5)],
 [('B', 10)]]

 rdd.glom().map(len).collect()

 [0, 1, 0, 1, 1, 0, 1, 0, 1, 1]

My site is free of ads and trackers. Was this post helpful to you? Why not BuyMeACoffee