dplyr vs Pandas

Introduction

Over the years , the debate has always been …which one is better for data science ,R or Python? . I say it all depends with where you are coming from and as long as you get the job done . Rstudio is a great IDE that supports many languages such as sql ,R and Python . In this tutorial i will show you how to use python in Rstudio as well as compare the two software syntax inorder to leverage the power of both languages. Make sure you have installed anaconda here https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Windows-x86_64.exe

Set up

# enable python in RMarkdown
library(tidyverse)
library(reticulate)
use_condaenv("base")

Reading A CSV

Python

import pandas as pd

df = pd.read_csv('diabetes.csv')
df
#>      Pregnancies  Glucose  ...  Age  Outcome
#> 0              6      148  ...   50        1
#> 1              1       85  ...   31        0
#> 2              8      183  ...   32        1
#> 3              1       89  ...   21        0
#> 4              0      137  ...   33        1
#> ..           ...      ...  ...  ...      ...
#> 763           10      101  ...   63        0
#> 764            2      122  ...   27        0
#> 765            5      121  ...   30        0
#> 766            1      126  ...   47        1
#> 767            1       93  ...   23        0
#> 
#> [768 rows x 9 columns]
# verify whether the object df is a dataframe
type(df)
#> <class 'pandas.core.frame.DataFrame'>

R


library(tidyverse)

df <- read_csv('diabetes.csv')

# verify whether the object df is a dataframe
class(df)
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Storing a DataFrame to a CSV

Python

import pandas as pd

# access the R main module via the 'r' object
df = r.df

# Storing a DataFrame to a CSV, do not include the index
df.to_csv('diabetes_new.csv', index=False)

# verify the dataframe is indeed saved there (review Method #1)
diab = pd.read_csv('diabetes_new.csv')
type(diab)
#> <class 'pandas.core.frame.DataFrame'>

R


library(tidyverse)

# access the python main module via the 'py' object
df <- py$diab

# Storing a DataFrame to a CSV
write_csv(df, 'diabetes_clean.csv')

# verify the dataframe is indeed saved there (review Method #1)
diab_clean = read_csv('diabetes_clean.csv')
class(diab_clean)
#> [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

dimensions

Python

shape returns the dimensionality (number_of_rows, number_of_columns) of a dataframe

import pandas as pd

df = pd.read_csv('diabetes.csv')

df.shape
#> (768, 9)

R

Similary, dim() in base R returns the dimensionality of a dataframe

library(readr) # a package in tidyverse that allows us to use read_csv

df <- read_csv('diabetes.csv')

dim(df)
#> [1] 768   9

first terms

Python

Use head() to show the top N rows.

import pandas as pd

df = pd.read_csv('diabetes.csv')

# print the top 3 rows in the dataframe
print(df.head(3))
#>    Pregnancies  Glucose  BloodPressure  ...  DiabetesPedigreeFunction  Age  Outcome
#> 0            6      148             72  ...                     0.627   50        1
#> 1            1       85             66  ...                     0.351   31        0
#> 2            8      183             64  ...                     0.672   32        1
#> 
#> [3 rows x 9 columns]

We can also sort the dataframe first then show the top N. For more on sorting a dataframe, please refer to method 13.

# to sort the dataframe first then check out the top N
df.sort_values('Age')\
  .head(5)
#>      Pregnancies  Glucose  ...  Age  Outcome
#> 255            1      113  ...   21        1
#> 60             2       84  ...   21        0
#> 102            0      125  ...   21        0
#> 182            1        0  ...   21        0
#> 623            0       94  ...   21        0
#> 
#> [5 rows x 9 columns]

R

To do these tasks in R is pretty straightforward.


df <- py$df # get the dataframe from python

# print the top 3 rows in the dataframe
print(df |>  head(3))
#>   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
#> 1           6     148            72            35       0 33.6
#> 2           1      85            66            29       0 26.6
#> 3           8     183            64             0       0 23.3
#>   DiabetesPedigreeFunction Age Outcome
#> 1                    0.627  50       1
#> 2                    0.351  31       0
#> 3                    0.672  32       1

To sort first, we can use the arrange() function in the dplyr package.


library(dplyr) 

# to sort the dataframe first then check out the top N
df |> 
  arrange(Age) |> 
  relocate(Age) |> 
  head(5)
#>   Age Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
#> 1  21           1      89            66            23      94 28.1
#> 2  21           1      73            50            10       0 23.0
#> 3  21           2      84             0             0       0  0.0
#> 4  21           1      80            55             0       0 19.1
#> 5  21           2     142            82            18      64 24.7
#>   DiabetesPedigreeFunction Outcome
#> 1                    0.167       0
#> 2                    0.248       0
#> 3                    0.304       0
#> 4                    0.258       0
#> 5                    0.761       0

Printing the Datatype of

Python

# .dtypes returns the data type of all
df.dtypes
#> Pregnancies                   int64
#> Glucose                       int64
#> BloodPressure                 int64
#> SkinThickness                 int64
#> Insulin                       int64
#> BMI                         float64
#> DiabetesPedigreeFunction    float64
#> Age                           int64
#> Outcome                       int64
#> dtype: object

R

The str() function in base R returns the data type of all, the default option also returns the dimensionality of the dataframe, length and head of each column, and gives attributes as sub structures.
df |> str(give.attr = FALSE)
#> 'data.frame':	768 obs. of  9 variables:
#>  $ Pregnancies             : num  6 1 8 1 0 5 3 10 2 8 ...
#>  $ Glucose                 : num  148 85 183 89 137 116 78 115 197 125 ...
#>  $ BloodPressure           : num  72 66 64 66 40 74 50 0 70 96 ...
#>  $ SkinThickness           : num  35 29 0 23 35 0 32 0 45 0 ...
#>  $ Insulin                 : num  0 0 0 94 168 0 88 0 543 0 ...
#>  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
#>  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
#>  $ Age                     : num  50 31 32 21 33 30 26 29 53 54 ...
#>  $ Outcome                 : num  1 0 1 0 1 0 1 0 1 1 ...

Another option is glimpse() function in the dplyr package, which returns the dimensionality of the dataframe, data type of all and the head of each column as well.


library(dplyr)

df |> glimpse()
#> Rows: 768
#> Columns: 9
#> $ Pregnancies              <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
#> $ Glucose                  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
#> $ BloodPressure            <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
#> $ SkinThickness            <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
#> $ Insulin                  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
#> $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
#> $ Age                      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
#> $ Outcome                  <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~

Modifying the Datatype of a Column

Python

import pandas as pd

data = {'id': [1,2,3,4,5],
        'name': ['Bongani', 'blaize', 'Ncube', 'Ropae', 'James']}
      
df1 = pd.DataFrame(data)

df1.dtypes
#> id       int64
#> name    object
#> dtype: object

The id field is of integer type. Use astype() method to change it to string/character type as follows.

df2 = df1.copy()
# change the id field to string/character type
df2['id'] = df2['id'].astype(str)
df2.dtypes
#> id      object
#> name    object
#> dtype: object

R

# get the dataframe from python
df1 = py$df1

library(dplyr)

df1 |>  glimpse()
#> Rows: 5
#> Columns: 2
#> $ id   <dbl> 1, 2, 3, 4, 5
#> $ name <chr> "Bongani", "blaize", "Ncube", "Ropae", "James"
Use the as.character() function to change the id field into string/character type, and use as.numeric() function to change it back to numeric. (id: what did I do? :p)

# change the id field to string/character type
df1 =df1 |> mutate(id = as.character(id))
df1 |>  glimpse()
#> Rows: 5
#> Columns: 2
#> $ id   <chr> "1", "2", "3", "4", "5"
#> $ name <chr> "Bongani", "blaize", "Ncube", "Ropae", "James"

# change the id field back to numeric type
df1 = df1 |> mutate(id = as.numeric(id))
df1 |> glimpse()
#> Rows: 5
#> Columns: 2
#> $ id   <dbl> 1, 2, 3, 4, 5
#> $ name <chr> "Bongani", "blaize", "Ncube", "Ropae", "James"

Printing Descriptive Info about the DataFrame

Python

info() method can be used to print the missing-value stats and the datatypes. Recall that we can also get datatypes info using .dtypes.

df.info()
#> <class 'pandas.core.frame.DataFrame'>
#> RangeIndex: 768 entries, 0 to 767
#> Data columns (total 9 columns):
#>  #   Column                    Non-Null Count  Dtype  
#> ---  ------                    --------------  -----  
#>  0   Pregnancies               768 non-null    int64  
#>  1   Glucose                   768 non-null    int64  
#>  2   BloodPressure             768 non-null    int64  
#>  3   SkinThickness             768 non-null    int64  
#>  4   Insulin                   768 non-null    int64  
#>  5   BMI                       768 non-null    float64
#>  6   DiabetesPedigreeFunction  768 non-null    float64
#>  7   Age                       768 non-null    int64  
#>  8   Outcome                   768 non-null    int64  
#> dtypes: float64(2), int64(7)
#> memory usage: 54.1 KB

R

summary() in base R returns the number of missings as well as some summary statistics.


df |>  summary()
#>   Pregnancies        Glucose      BloodPressure    SkinThickness  
#>  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
#>  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
#>  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
#>  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
#>  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
#>  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
#>     Insulin           BMI        DiabetesPedigreeFunction      Age       
#>  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
#>  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
#>  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
#>  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
#>  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
#>  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
#>     Outcome     
#>  Min.   :0.000  
#>  1st Qu.:0.000  
#>  Median :0.000  
#>  Mean   :0.349  
#>  3rd Qu.:1.000  
#>  Max.   :1.000

Printing Descriptive Info about the DataFrame (Method 2)

Python

describe() returns standard statistics like mean, standard deviation, maximum etc. of every numeric-valued column

df.describe()
#>        Pregnancies     Glucose  ...         Age     Outcome
#> count   768.000000  768.000000  ...  768.000000  768.000000
#> mean      3.845052  120.894531  ...   33.240885    0.348958
#> std       3.369578   31.972618  ...   11.760232    0.476951
#> min       0.000000    0.000000  ...   21.000000    0.000000
#> 25%       1.000000   99.000000  ...   24.000000    0.000000
#> 50%       3.000000  117.000000  ...   29.000000    0.000000
#> 75%       6.000000  140.250000  ...   41.000000    1.000000
#> max      17.000000  199.000000  ...   81.000000    1.000000
#> 
#> [8 rows x 9 columns]

describe() can also offer some info on categorical: the number of unique values, the most frequent value and its frequency, if we add include = all argument

df.describe(include = 'all')
#>        Pregnancies     Glucose  ...         Age     Outcome
#> count   768.000000  768.000000  ...  768.000000  768.000000
#> mean      3.845052  120.894531  ...   33.240885    0.348958
#> std       3.369578   31.972618  ...   11.760232    0.476951
#> min       0.000000    0.000000  ...   21.000000    0.000000
#> 25%       1.000000   99.000000  ...   24.000000    0.000000
#> 50%       3.000000  117.000000  ...   29.000000    0.000000
#> 75%       6.000000  140.250000  ...   41.000000    1.000000
#> max      17.000000  199.000000  ...   81.000000    1.000000
#> 
#> [8 rows x 9 columns]

R

As discussed in Method 9, summary() in base R returns the number of missings as well as some summary statistics for all numerical and factor.


df = py$df

df |>  summary()
#>   Pregnancies        Glucose      BloodPressure    SkinThickness  
#>  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
#>  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
#>  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
#>  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
#>  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
#>  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
#>     Insulin           BMI        DiabetesPedigreeFunction      Age       
#>  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
#>  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
#>  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
#>  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
#>  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
#>  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
#>     Outcome     
#>  Min.   :0.000  
#>  1st Qu.:0.000  
#>  Median :0.000  
#>  Mean   :0.349  
#>  3rd Qu.:1.000  
#>  Max.   :1.000

For categorical variables, like Species in the iris dataset, summary() could give us its frequency counts if we turn it into factor type in R.


library(dplyr) #for mutate

df |> 
  mutate(Outcome = as.factor(Outcome)) |> # see method17 for more on mutate()   
  summary()
#>   Pregnancies        Glucose      BloodPressure    SkinThickness  
#>  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
#>  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
#>  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
#>  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
#>  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
#>  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
#>     Insulin           BMI        DiabetesPedigreeFunction      Age       
#>  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
#>  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
#>  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
#>  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
#>  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
#>  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
#>  Outcome
#>  0:500  
#>  1:268  
#>         
#>         
#>         
#> 

In python, we could use value_counts() to get frequency counts. See #method27 for more.

df['Outcome'].value_counts()
#> 0    500
#> 1    268
#> Name: Outcome, dtype: int64

Filling NaN values

Python

We can use the df.fillna() method to replace missing values with a specific value. Let’s start by creating a dataframe with missing values.

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': [1,2],
                   'col2': [3,np.nan],
                   'col3': ['A',np.nan]})
print(df)
#>    col1  col2 col3
#> 0     1   3.0    A
#> 1     2   NaN  NaN

To replace the missing value with 0, just do .fillna(0).

# replace all NA values with 0, inplace = True means df itself will be modefied
df.fillna(0, inplace = True)
print(df)
#>    col1  col2 col3
#> 0     1   3.0    A
#> 1     2   0.0    0

Or if you only wnat to replace missing values in one particular column, simply select it first

df2 = df.copy()
df2['col2'].fillna(0, inplace = True) 
print(df2)
#>    col1  col2 col3
#> 0     1   3.0    A
#> 1     2   0.0    0

R

We can use either replace() in base R to replace all missings in a dataframe with a specific value, or replace_na in tidyr to offer tailored replacement for each specific column.


df = data.frame(col1 = c(1,2),
                col2 = c(3, NA),
                col3 = c('A',NA))

print(df)
#>   col1 col2 col3
#> 1    1    3    A
#> 2    2   NA <NA>

# use replace() in base R to replace all missings to 0
replace(df, is.na(df), 0)
#>   col1 col2 col3
#> 1    1    3    A
#> 2    2    0    0

library(tidyr)
# or use replace_na() in tidyverse to offer tailored replacement for each column. 
df |>  replace_na(list(col2=0,col3="Unknown"))
#>   col1 col2    col3
#> 1    1    3       A
#> 2    2    0 Unknown

Grouping a DataFrame

Python

We can use the groupby method in Pandas to group a dataframe and then perform aggregations with agg(). We could put both methods in one line, or wrap the chain of methods in brackets and show them in separate lines. The latter can enhance readability when we have multiple methods chained together.

import pandas as pd

df = pd.DataFrame([[1, 2,  "A"], 
                   [5, 8,  "B"], 
                   [3, 10, "B"]], 
                  = ["col1", "col2", "col3"])

# put both methods in one line 
df.groupby('col3').agg({'col1':sum, 'col2':max}) # 
# alternatively, show each method in separate lines
(df
 .groupby("col3")
 .agg({"col1":sum, "col2":max}) # get sum for col1 and max for col2
 )

Above we specify different aggregates for each column, but the code can be simplified if same aggregates are needed for all.

(df
 .groupby("col3")
 .agg(['min','max','median']) # get these three aggregates for all
)

R

In tidyverse, similarly we use group_by() to do the grouping, then use summarize() for the aggregation.

library(dplyr)

df <- py$df

df |> 
  group_by(col3) |> 
  summarise(col1_sum = sum(col1),
            col2_max = max(col2))

complex aggregation

R

What if we want to do a slightly more complex aggregation which is not available as a default function/method? Let’s say we want to add a column to represent percentage within each group. For example, below we have the sale of three types of fruits in two months. We would like to add a column pct_month to represent the sale of each fruit within each month.

df <- data.frame(
  month = c(rep('Jan',3), rep('Feb',3)),
  fruit = c('Apple', 'Kiwi','banana', 'Apple', 'Kiwi','banana'),
  sale = c(20,30, 30,30,20,15)
)

df
#>   month  fruit sale
#> 1   Jan  Apple   20
#> 2   Jan   Kiwi   30
#> 3   Jan banana   30
#> 4   Feb  Apple   30
#> 5   Feb   Kiwi   20
#> 6   Feb banana   15

Notice that here we need one value for each row, rather than one value for each group. Therefore, instead of using summarize() as we did above, this time mutate() function is our friend. We can also easily add a round() function to round the percentage.

func<-function(x){
  ratio<-x/sum(x)
  return(ratio)
}
df |>  
  group_by(month) |>  
  mutate(pct_month = func(sale) |> round(3) * 100)
#> # A tibble: 6 x 4
#> # Groups:   month [2]
#>   month fruit   sale pct_month
#>   <chr> <chr>  <dbl>     <dbl>
#> 1 Jan   Apple     20      25  
#> 2 Jan   Kiwi      30      37.5
#> 3 Jan   banana    30      37.5
#> 4 Feb   Apple     30      46.2
#> 5 Feb   Kiwi      20      30.8
#> 6 Feb   banana    15      23.1

Python

Now let’s see how to do this in Python. We can create a function first, then call it with the transform() method.

df = r.df 

def pct_total(s):
  return s/sum(s)

df['pct_month'] = (df
                    .groupby('month')['sale']
                    .transform(pct_total).round(3) * 100
                    )
df
#>   month   fruit  sale  pct_month
#> 0   Jan   Apple  20.0       25.0
#> 1   Jan    Kiwi  30.0       37.5
#> 2   Jan  banana  30.0       37.5
#> 3   Feb   Apple  30.0       46.2
#> 4   Feb    Kiwi  20.0       30.8
#> 5   Feb  banana  15.0       23.1

Create a dataframe in Python

import pandas as pd

df = pd.DataFrame({'col1': [1,5,3],
                   'col2': [8,4,10],
                   'col3': ['A','B','B']})
df
#>    col1  col2 col3
#> 0     1     8    A
#> 1     5     4    B
#> 2     3    10    B

Load the dataframe into R

df <- py$df #load the df object created in Python above

df
#>   col1 col2 col3
#> 1    1    8    A
#> 2    5    4    B
#> 3    3   10    B

Filtering a DataFrame1: Boolean Filtering

Filter with a value

A row can be selected when the condition specified is evaluated to be True for it. For example

Python

df[df['col2']>5]
#>    col1  col2 col3
#> 0     1     8    A
#> 2     3    10    B
# or to improve readability, we could do it in two steps
col2_larger_than_5 = df['col2'] > 5
df[col2_larger_than_5]
#>    col1  col2 col3
#> 0     1     8    A
#> 2     3    10    B

R


library(dplyr)

df |> 
  filter(col2 > 5)
#>   col1 col2 col3
#> 1    1    8    A
#> 2    3   10    B
Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.