GroupBy Function in Pandas Python

0
217

Introduction To GroupBy Function in pandas

GroupBy Function in pandas and aggregation are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.

For e.g., Suppose you have cities data and you want to analysye that the overall population of city and state aof average population of cities and state and according to each city population according to state that time we used this group-by and aggregation function to calculate value accordin to common values in the state and cities.

Grouping analysis can be thought of as having three parts:

1. Splitting the data into groups (e.g. groups of customer segments, product categories, etc.)

2. Applying a function to each group (e.g. mean or total sales of each customer segment)

3. Combining the results into a data structure showing the summary statistics

Applying GroupBy Function to groups in pandas

  • Aggregation
  • Transformation
  • Filtration
  • Applying our own function

Methods of GroupBy Function in pandas

Given data frame for apply gropuby and aggregation method 

Code:-

import numpy as np

population = DataFrame({'State':['Maharashtra','Maharashtra','Maharashtra',
                            'Uttar Pradesh','Uttar Pradesh',
                            'Madhya Pradesh','Madhya Pradesh','Madhya Pradesh',
                            'Tamil Nadu','Tamil Nadu'],
                 'Cities':['Nagpur','Nagpur','Mumbai',
                           'Lucknow','Kanpur',
                           'Bhopal','Indore','Indore',
                           'Chennai','Chennai'],
                 'Female Population': np.random.randint(100000,500000,10),   
                 'Male Population': np.random.randint(100000,500000,10),
                 'Total Population':np.random.randint(200000,700000,10),
                 'literacy_rate_total':np.abs(np.random.randn(10)*40)})

# np.random.randint() is used for generate random numbers in data
# np.ramdom.randn() is used for generate random normal number in data

population # To Show output of data frame

Output:-

StateCitiesFemale PopulationMale PopulationTotal Populationliteracy_rate_total
0MaharashtraNagpur33073123858264574367.694631
1MaharashtraNagpur3392324183295727612.930283
2MaharashtraMumbai29662231882729063711.248097
3Uttar PradeshLucknow2606313121233479290.849086
4Uttar PradeshKanpur41575343856853043140.991017
5Madhya PradeshBhopal19243514543020161562.159132
6Madhya PradeshIndore43935536893234725735.704109
7Madhya PradeshIndore22094447633320167214.285272
8Tamil NaduChennai29724830944054895920.368355
9Tamil NaduChennai26194717480321702222.798189

Groupby on the basis of single categorical column 

Example:- Use the Above population data and create a group of states.

# It just Showing the output as a group is created according to State
population.groupby('State')

Output:-

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000211A6521CF8>
# To show the How Group value and their index we used .groups function
population.groupby('State').groups

Output:- 

{'Madhya Pradesh': Int64Index([5, 6, 7], dtype='int64'),
 'Maharashtra': Int64Index([0, 1, 2], dtype='int64'),
 'Tamil Nadu': Int64Index([8, 9], dtype='int64'),
 'Uttar Pradesh': Int64Index([3, 4], dtype='int64')}
 # Apply Aggrigartion function To calculate Total Population of each state
 # Their is number of aggrigationfunction like( sum, mean, count, max, min,etc )
 population.groupby('State').sum()

Output:-

Female Population Male Population Total Population literacy_rate_total

State

Madhya Pradesh 778700 884884 1255651 46.009551

Maharashtra 783701 1154190 1551671 105.044499

Tamil Nadu 776934 355226 998136 45.691927

Uttar Pradesh 666343 865744 876842 137.275400

Here it shows the output as the sum of all the column values in according to states

Groupby on the basis of Two categorical column 

Example:

# It just Showing the output as a group is created according to State
population.groupby(['State','Cities'])

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000211A6521CF8>
# To show the Group value and their index we used .groups function
population.groupby('State').groups
 
Output:-
{'Madhya Pradesh': Int64Index([5, 6, 7], dtype='int64'),
 'Maharashtra': Int64Index([0, 1, 2], dtype='int64'),
 'Tamil Nadu': Int64Index([8, 9], dtype='int64'),
 'Uttar Pradesh': Int64Index([3, 4], dtype='int64')}
 
# Apply Aggregation function To calculate the mean of each state according to their cities
 
mean_pop = population.groupby(['State','Cities']).mean()
mean_pop
 
# It shows the output of each cities average female population, male population, total population and the literacy_rate_total
Output:-
 
 
	Female Population	Male Population	Total Population	literacy_rate_total
State	Cities				
Madhya Pradesh	Bhopal	190726.0	290946.0	321898.0	6.824701
Indore	293987.0	296969.0	466876.5	19.592425
Maharashtra	Mumbai	186423.0	325910.0	680144.0	36.866375
Nagpur	298639.0	414140.0	435763.5	34.089062
Tamil Nadu	Chennai	388467.0	177613.0	499068.0	22.845964
Uttar Pradesh	Kanpur	290262.0	425102.0	214901.0	35.262222
Lucknow	376081.0	440642.0	661941.0	102.013178

Loop over GroupBy groups

In this part iterating an element of group containing and shows their values as output.

Example:-

# iterating an element of group containing and shows their values 

# create Group according to State

  
grp = population.groupby(['State']) 
for name, group in grp: 
    print(name) 
    print(group) 
    print() 
Output:-
Madhya Pradesh
            State  Cities  Female Population  Male Population  \
5  Madhya Pradesh  Bhopal             190726           290946   
6  Madhya Pradesh  Indore             186505           381920   
7  Madhya Pradesh  Indore             401469           212018   
 
   Total Population  literacy_rate_total  
5            321898             6.824701  
6            628995            26.194098  
7            304758            12.990752  
 
Maharashtra
         State  Cities  Female Population  Male Population  Total Population  \
0  Maharashtra  Nagpur             334934           357959            508852   
1  Maharashtra  Nagpur             262344           470321            362675   
2  Maharashtra  Mumbai             186423           325910            680144   
 
   literacy_rate_total  
0            33.795318  
1            34.382807  
2            36.866375  
 
Tamil Nadu
        State   Cities  Female Population  Male Population  Total Population  \
8  Tamil Nadu  Chennai             394035           109944            515960   
9  Tamil Nadu  Chennai             382899           245282            482176   
 
   literacy_rate_total  
8            17.705373  
9            27.986555  
 
Uttar Pradesh
           State   Cities  Female Population  Male Population  \
3  Uttar Pradesh  Lucknow             376081           440642   
4  Uttar Pradesh   Kanpur             290262           425102   
 
   Total Population  literacy_rate_total  
3            661941           102.013178  
4            214901            35.262222

Example:-

# iterating an element of group containing and shows their values 

# create Group according to State and its Cities

grp = population.groupby(['State','Cities']) 
for name, group in grp: 
    print(name) 
    print(group) 
    print() 
Output:-
('Madhya Pradesh', 'Bhopal')
            State  Cities  Female Population  Male Population  \
5  Madhya Pradesh  Bhopal             190726           290946   
 
   Total Population  literacy_rate_total  
5            321898             6.824701  
 
('Madhya Pradesh', 'Indore')
            State  Cities  Female Population  Male Population  \
6  Madhya Pradesh  Indore             186505           381920   
7  Madhya Pradesh  Indore             401469           212018   
 
   Total Population  literacy_rate_total  
6            628995            26.194098  
7            304758            12.990752  
 
('Maharashtra', 'Mumbai')
         State  Cities  Female Population  Male Population  Total Population  \
2  Maharashtra  Mumbai             186423           325910            680144   
 
   literacy_rate_total  
2            36.866375  
 
('Maharashtra', 'Nagpur')
         State  Cities  Female Population  Male Population  Total Population  \
0  Maharashtra  Nagpur             334934           357959            508852   
1  Maharashtra  Nagpur             262344           470321            362675   
 
   literacy_rate_total  
0            33.795318  
1            34.382807  
 
('Tamil Nadu', 'Chennai')
        State   Cities  Female Population  Male Population  Total Population  \
8  Tamil Nadu  Chennai             394035           109944            515960   
9  Tamil Nadu  Chennai             382899           245282            482176   
 
   literacy_rate_total  
8            17.705373  
9            27.986555  
 
('Uttar Pradesh', 'Kanpur')
           State  Cities  Female Population  Male Population  \
4  Uttar Pradesh  Kanpur             290262           425102   
 
   Total Population  literacy_rate_total  
4            214901            35.262222  
 
('Uttar Pradesh', 'Lucknow')
           State   Cities  Female Population  Male Population  \
3  Uttar Pradesh  Lucknow             376081           440642   
 
   Total Population  literacy_rate_total  
3            661941           102.013178
 

Selecting groups

If you want to select particular group from groupby the used groypby.get_group Function.

Example:- Select particular group Maharashtra


Code:-

# selecting a single group 
  
grp = population.groupby('State') 
grp.get_group('Maharashtra') 
Output:-
State Cities Female Population Male Population Total Population literacy_rate_total
0 Maharashtra Nagpur 334934 357959 508852 33.795318
1 Maharashtra Nagpur 262344 470321 362675 34.382807
2 Maharashtra Mumbai 186423 325910 680144 36.866375

Example :-

# selecting a single group 

Output:-.

grp = population.groupby(['State','Cities']) 
grp.get_group(('Uttar Pradesh', 'Lucknow'))
State Cities Female Population Male Population Total Population literacy_rate_total
3 Uttar Pradesh Lucknow 376081 440642 661941 102.013178

Apply Functions into Group

  • Aggregation: It is used to calculate summary statistics of each group category example calculator sum average minimum value
  • Transformation: Used to perform some group-specific computation and return a like indexed. EX Fill null value in the group according to the calculated value of group
  • Filtration: apply filter function according to the group-wise computation that evaluates as Boolean.Example. Filter out the data according to there group of sum and mean.

Aggregation

Example:- Calculate mean, sum and minimum value of Female population of each state

Code:-

 grp = population.groupby('State') 
  
grp['Female Population'].agg([np.sum, np.mean, np.min])  # Pass Select perticulat columns to Calculate there values
 
Output:-
 
sum	mean	amin
State			
Madhya Pradesh	778700	259566.666667	186505
Maharashtra	783701	261233.666667	186423
Tamil Nadu	776934	388467.000000	382899
Uttar Pradesh	666343	333171.500000	290262
 

Example:- Apply different aggregation function to different columns of data frame\

Code:-

# applying a function bypassing 
# a list of functions 
grp = population.groupby('State') 
grp.agg({'Female Population':np.sum,'Male Population': np.sum, 'literacy_rate_total':np.min})  
# Pass Select particular columns to Calculate different Aggregation values

Output:-

Female Population Male Population literacy_rate_total

State

Madhya Pradesh 778700 884884 6.824701
Maharashtra 783701 1154190 33.795318
Tamil Nadu 776934 355226 17.705373
Uttar Pradesh 666343 865744 35.262222

Transformation

Transform method Output an object that is indexed the same (same size) as the one each group. 

Example:- Perform some group specific computation

Filtration:-
 
Example:- Filter out the cities which get occurs in two or more time
grp = population.groupby('Cities') 
grp.filter(lambda x: len(x) >= 2)
Output:-
 
State	Cities	Female Population	Male Population	Total Population	literacy_rate_total
0	Maharashtra	Nagpur	334934	357959	508852	33.795318
1	Maharashtra	Nagpur	262344	470321	362675	34.382807
6	Madhya Pradesh	Indore	186505	381920	628995	26.194098
7	Madhya Pradesh	Indore	401469	212018	304758	12.990752
8	Tamil Nadu	Chennai	394035	109944	515960	17.705373
9	Tamil Nadu	Chennai	382899	245282	482176	27.986555

Conclusion:-

In this blog you will get the better understanding of how to create group of categorical data and how to operate and also perform sum function on this data to get the inference from this groups.

LEAVE A REPLY

Please enter your comment!
Please enter your name here