Relational Data

This notebook will go through merge function in pandas. We all know there are joins in SQL query. In pandas, you can achieve the same results.

This notebook is just for personal review for class 15688 Practical Data science. For the class notes, you can go to Relational data Note

First, let’s create some fake datasets.

import pandas as pd 
Person = pd.DataFrame([(1, 'Kolter', 'Zico'), 
                   (2, 'Xi', 'Edgar'), 
                   (3, 'Lee', 'Mark'), 
                   (4, 'Mani', 'Shouvik'), 
                   (5, 'Gates', 'Bill'), 
                   (6, 'Musk', 'Elon')], 
                  columns=["ID", "Last Name", "First Name"])
Person.set_index("ID", inplace=True)

Person
Last Name First Name
ID
1 Kolter Zico
2 Xi Edgar
3 Lee Mark
4 Mani Shouvik
5 Gates Bill
6 Musk Elon
Grade = pd.DataFrame([(5,'HW1',80),
                    (6,'HW1',90),
                     (6,'HW2',0),
                     (100,'HW1',100),
                     (1,'HW1',90),
                     (1,'HW2',100)],
                     columns=['PersonId','HW','Grade']
                    )
Grade.set_index("PersonId", inplace=True)

Grade
HW Grade
PersonId
5 HW1 80
6 HW1 90
6 HW2 0
100 HW1 100
1 HW1 90
1 HW2 100

Description

Person dataset restores each student’s informtion. ID is its primary id in the SQL context. In pandas, it is index.

Grade dataset restores each student’s grade. It can be multiple inputs for each person for they can have multiple grades.

Frist, we try to inner join to select all the entries both Person and Grade have.

Person.reset_index()
ID Last Name First Name
0 1 Kolter Zico
1 2 Xi Edgar
2 3 Lee Mark
3 4 Mani Shouvik
4 5 Gates Bill
5 6 Musk Elon

Method 1: merge two dataframes by their columns

pd.merge(Person.reset_index(),Grade.reset_index(),left_on = "ID", right_on="PersonId")
ID Last Name First Name PersonId HW Grade
0 1 Kolter Zico 1 HW1 90
1 1 Kolter Zico 1 HW2 100
2 5 Gates Bill 5 HW1 80
3 6 Musk Elon 6 HW1 90
4 6 Musk Elon 6 HW2 0

Method 2: merge two dataframes by their indexes

pd.merge(Person, Grade, left_index=True, right_index=True)
Last Name First Name HW Grade
1 Kolter Zico HW1 90
1 Kolter Zico HW2 100
5 Gates Bill HW1 80
6 Musk Elon HW1 90
6 Musk Elon HW2 0

Second, let’s try left join. Get all the students with their grade. We need Person left join Grade here.

We still showcase two methods here.

pd.merge(Person, Grade, how='left',left_index=True, right_index=True).rename_axis('ID')
Last Name First Name HW Grade
ID
1 Kolter Zico HW1 90.0
1 Kolter Zico HW2 100.0
2 Xi Edgar NaN NaN
3 Lee Mark NaN NaN
4 Mani Shouvik NaN NaN
5 Gates Bill HW1 80.0
6 Musk Elon HW1 90.0
6 Musk Elon HW2 0.0
pd.merge(Person.reset_index(),Grade.reset_index(),how='left',left_on = "ID", right_on="PersonId").set_index('ID')
Last Name First Name PersonId HW Grade
ID
1 Kolter Zico 1.0 HW1 90.0
1 Kolter Zico 1.0 HW2 100.0
2 Xi Edgar NaN NaN NaN
3 Lee Mark NaN NaN NaN
4 Mani Shouvik NaN NaN NaN
5 Gates Bill 5.0 HW1 80.0
6 Musk Elon 6.0 HW1 90.0
6 Musk Elon 6.0 HW2 0.0

Last, let’s try Outer join. Get all the students(also the ones in Grade dataset) with their grade. We need Person outer join Grade here.

We still showcase two methods here.

pd.merge(Person, Grade, how='outer',left_index=True, right_index=True).rename_axis('ID')
Last Name First Name HW Grade
ID
1 Kolter Zico HW1 90.0
1 Kolter Zico HW2 100.0
2 Xi Edgar NaN NaN
3 Lee Mark NaN NaN
4 Mani Shouvik NaN NaN
5 Gates Bill HW1 80.0
6 Musk Elon HW1 90.0
6 Musk Elon HW2 0.0
100 NaN NaN HW1 100.0
pd.merge(Person.reset_index(),Grade.reset_index(),how='outer',left_on = "ID", right_on="PersonId").set_index('ID')
Last Name First Name PersonId HW Grade
ID
1.0 Kolter Zico 1.0 HW1 90.0
1.0 Kolter Zico 1.0 HW2 100.0
2.0 Xi Edgar NaN NaN NaN
3.0 Lee Mark NaN NaN NaN
4.0 Mani Shouvik NaN NaN NaN
5.0 Gates Bill 5.0 HW1 80.0
6.0 Musk Elon 6.0 HW1 90.0
6.0 Musk Elon 6.0 HW2 0.0
NaN NaN NaN 100.0 HW1 100.0

End