pyspark.sql.DataFrame.groupBy#
- DataFrame.groupBy(*cols)[source]#
- Groups the - DataFrameby the specified columns so that aggregation can be performed on them. See- GroupedDatafor all the available aggregate functions.- groupby()is an alias for- groupBy().- New in version 1.3.0. - Changed in version 3.4.0: Supports Spark Connect. - Parameters
- Returns
- GroupedData
- A - GroupedDataobject representing the grouped data by the specified columns.
 
 - Notes - A column ordinal starts from 1, which is different from the 0-based - __getitem__().- Examples - >>> df = spark.createDataFrame([ ... ("Alice", 2), ("Bob", 2), ("Bob", 2), ("Bob", 5)], schema=["name", "age"]) - Example 1: Empty grouping columns triggers a global aggregation. - >>> df.groupBy().avg().show() +--------+ |avg(age)| +--------+ | 2.75| +--------+ - Example 2: Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’. - >>> df.groupBy("name").agg({"age": "sum"}).sort("name").show() +-----+--------+ | name|sum(age)| +-----+--------+ |Alice| 2| | Bob| 9| +-----+--------+ - Example 3: Group-by ‘name’, and calculate maximum values. - >>> df.groupBy(df.name).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+ - Example 4: Also group-by ‘name’, but using the column ordinal. - >>> df.groupBy(1).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+ - Example 5: Group-by ‘name’ and ‘age’, and calculate the number of rows in each group. - >>> df.groupBy(["name", df.age]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+ - Example 6: Also Group-by ‘name’ and ‘age’, but using the column ordinal. - >>> df.groupBy([df.name, 2]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+