Relational Data
This notebook will go through merge function in pandas. We all know there are joins in SQL query. In pandas, you can achieve the same results.
This notebook is just for personal review for class 15688 Practical Data science. For the class notes, you can go to Relational data Note
First, let’s create some fake datasets.
import pandas as pd
Person = pd.DataFrame([(1, 'Kolter', 'Zico'),
(2, 'Xi', 'Edgar'),
(3, 'Lee', 'Mark'),
(4, 'Mani', 'Shouvik'),
(5, 'Gates', 'Bill'),
(6, 'Musk', 'Elon')],
columns=["ID", "Last Name", "First Name"])
Person.set_index("ID", inplace=True)
Person
|
Last Name |
First Name |
ID |
|
|
1 |
Kolter |
Zico |
2 |
Xi |
Edgar |
3 |
Lee |
Mark |
4 |
Mani |
Shouvik |
5 |
Gates |
Bill |
6 |
Musk |
Elon |
Grade = pd.DataFrame([(5,'HW1',80),
(6,'HW1',90),
(6,'HW2',0),
(100,'HW1',100),
(1,'HW1',90),
(1,'HW2',100)],
columns=['PersonId','HW','Grade']
)
Grade.set_index("PersonId", inplace=True)
Grade
|
HW |
Grade |
PersonId |
|
|
5 |
HW1 |
80 |
6 |
HW1 |
90 |
6 |
HW2 |
0 |
100 |
HW1 |
100 |
1 |
HW1 |
90 |
1 |
HW2 |
100 |
Description
Person dataset restores each student’s informtion. ID is its primary id in the SQL context. In pandas, it is index.
Grade dataset restores each student’s grade. It can be multiple inputs for each person for they can have multiple grades.
Frist, we try to inner join to select all the entries both Person and Grade have.
|
ID |
Last Name |
First Name |
0 |
1 |
Kolter |
Zico |
1 |
2 |
Xi |
Edgar |
2 |
3 |
Lee |
Mark |
3 |
4 |
Mani |
Shouvik |
4 |
5 |
Gates |
Bill |
5 |
6 |
Musk |
Elon |
Method 1: merge two dataframes by their columns
pd.merge(Person.reset_index(),Grade.reset_index(),left_on = "ID", right_on="PersonId")
|
ID |
Last Name |
First Name |
PersonId |
HW |
Grade |
0 |
1 |
Kolter |
Zico |
1 |
HW1 |
90 |
1 |
1 |
Kolter |
Zico |
1 |
HW2 |
100 |
2 |
5 |
Gates |
Bill |
5 |
HW1 |
80 |
3 |
6 |
Musk |
Elon |
6 |
HW1 |
90 |
4 |
6 |
Musk |
Elon |
6 |
HW2 |
0 |
Method 2: merge two dataframes by their indexes
pd.merge(Person, Grade, left_index=True, right_index=True)
|
Last Name |
First Name |
HW |
Grade |
1 |
Kolter |
Zico |
HW1 |
90 |
1 |
Kolter |
Zico |
HW2 |
100 |
5 |
Gates |
Bill |
HW1 |
80 |
6 |
Musk |
Elon |
HW1 |
90 |
6 |
Musk |
Elon |
HW2 |
0 |
Second, let’s try left join. Get all the students with their grade. We need Person left join Grade here.
We still showcase two methods here.
pd.merge(Person, Grade, how='left',left_index=True, right_index=True).rename_axis('ID')
|
Last Name |
First Name |
HW |
Grade |
ID |
|
|
|
|
1 |
Kolter |
Zico |
HW1 |
90.0 |
1 |
Kolter |
Zico |
HW2 |
100.0 |
2 |
Xi |
Edgar |
NaN |
NaN |
3 |
Lee |
Mark |
NaN |
NaN |
4 |
Mani |
Shouvik |
NaN |
NaN |
5 |
Gates |
Bill |
HW1 |
80.0 |
6 |
Musk |
Elon |
HW1 |
90.0 |
6 |
Musk |
Elon |
HW2 |
0.0 |
pd.merge(Person.reset_index(),Grade.reset_index(),how='left',left_on = "ID", right_on="PersonId").set_index('ID')
|
Last Name |
First Name |
PersonId |
HW |
Grade |
ID |
|
|
|
|
|
1 |
Kolter |
Zico |
1.0 |
HW1 |
90.0 |
1 |
Kolter |
Zico |
1.0 |
HW2 |
100.0 |
2 |
Xi |
Edgar |
NaN |
NaN |
NaN |
3 |
Lee |
Mark |
NaN |
NaN |
NaN |
4 |
Mani |
Shouvik |
NaN |
NaN |
NaN |
5 |
Gates |
Bill |
5.0 |
HW1 |
80.0 |
6 |
Musk |
Elon |
6.0 |
HW1 |
90.0 |
6 |
Musk |
Elon |
6.0 |
HW2 |
0.0 |
Last, let’s try Outer join. Get all the students(also the ones in Grade dataset) with their grade. We need Person outer join Grade here.
We still showcase two methods here.
pd.merge(Person, Grade, how='outer',left_index=True, right_index=True).rename_axis('ID')
|
Last Name |
First Name |
HW |
Grade |
ID |
|
|
|
|
1 |
Kolter |
Zico |
HW1 |
90.0 |
1 |
Kolter |
Zico |
HW2 |
100.0 |
2 |
Xi |
Edgar |
NaN |
NaN |
3 |
Lee |
Mark |
NaN |
NaN |
4 |
Mani |
Shouvik |
NaN |
NaN |
5 |
Gates |
Bill |
HW1 |
80.0 |
6 |
Musk |
Elon |
HW1 |
90.0 |
6 |
Musk |
Elon |
HW2 |
0.0 |
100 |
NaN |
NaN |
HW1 |
100.0 |
pd.merge(Person.reset_index(),Grade.reset_index(),how='outer',left_on = "ID", right_on="PersonId").set_index('ID')
|
Last Name |
First Name |
PersonId |
HW |
Grade |
ID |
|
|
|
|
|
1.0 |
Kolter |
Zico |
1.0 |
HW1 |
90.0 |
1.0 |
Kolter |
Zico |
1.0 |
HW2 |
100.0 |
2.0 |
Xi |
Edgar |
NaN |
NaN |
NaN |
3.0 |
Lee |
Mark |
NaN |
NaN |
NaN |
4.0 |
Mani |
Shouvik |
NaN |
NaN |
NaN |
5.0 |
Gates |
Bill |
5.0 |
HW1 |
80.0 |
6.0 |
Musk |
Elon |
6.0 |
HW1 |
90.0 |
6.0 |
Musk |
Elon |
6.0 |
HW2 |
0.0 |
NaN |
NaN |
NaN |
100.0 |
HW1 |
100.0 |
End