Pandas Selecting document summary
6/5/2018
This walk-through jupyter notebook outlined the important techniques shown in pandas document for selecting and modifying. The website can be referred as: Pandas advanced indexing with labels
Dataset uses the housing price competition from Kaggle.com. The website is Housing price competition
Selecting and modifying with Pandas is very useful tools for data scientists to grasp. Let’s start.
import pandas as pd
test = pd.read_csv('C:/Users/t_louji/Downloads/test.csv',index_col='Id')
test.head()
#test.info()
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | Corner | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | Inside | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 79 columns
Selecting
1. .loc (select by label)
.loc is primarily label based, but may also be used with a boolean array.
Allowed inputs are:
-
A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
-
A list or array of labels [‘a’, ‘b’, ‘c’].
-
A slice object with labels ‘a’:’f’ (Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels.).
-
A boolean array
-
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above).
# test.loc[0] #key error
test.loc[1461:1500]
test.loc[test.MSZoning=='RH']
test.loc[test.MSZoning!='RH','Alley'].head()
Id
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
Name: Alley, dtype: object
df1.loc['a', 'A'] #row label a and column A
**Note: ** When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label and not the position.
2. .iloc (select by position)
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.
Allowed inputs are:
-
An integer e.g. 5.
-
A list or array of integers [4, 3, 0].
-
A slice object with ints 1:7.
-
A boolean array.
-
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above).
test.iloc[0:3]
test.iloc[0:3,0:3]
test.iloc[3,3] # a scalar value
test.iloc[1]
#Out of range slice indexes are handled gracefully just as in Python/Numpy.
test.iloc[3000:30001]
#A single indexer that is out of bounds will raise an IndexError.
#A list of indexers where any element is out of bounds will raise an IndexError.
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id |
0 rows × 79 columns
** Note ** :
Object Type Indexers
Series s.loc[indexer]
DataFrame df.loc[row_indexer,column_indexer]
3. [ ]
Series series[label] scalar value
DataFrame frame[colname] Series corresponding to colname
test['MSZoning'][test.index[0]] #get scala element
'RH'
4. Swap two columns
# The correct way to swap column values is by using raw values:
#df.loc[:,['B', 'A']] = df[['A', 'B']].values
5. Attribute Access
.COL_Name
test1=test.copy()
test1.Street.head()
Id
1461 Pave
1462 Pave
1463 Pave
1464 Pave
1465 Pave
Name: Street, dtype: object
x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
x.iloc[1] = dict(x=9, y=99) # assign a dict to a row of a dataframe
x
x | y | |
---|---|---|
0 | 1 | 3 |
1 | 9 | 99 |
2 | 3 | 5 |
**Note: ** You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column.
6. Slicing ranges
df[:n]
test[::2].head() ##every odd column
s2[:5] = 0
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | Inside | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
1467 | 20 | RL | NaN | 7980 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | GdPrv | Shed | 500 | 3 | 2010 | WD | Normal |
1469 | 20 | RL | 85.0 | 10176 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2010 | WD | Normal |
5 rows × 79 columns
ix indexer is deprecated
7. Scalar value getting and setting
Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to iloc
test.at[1461,'MSZoning']
'RH'
test.iat[0,1]
'RH'
8. Boolean indexing
The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3). |
test[test.MSSubClass>100].head()
##multiple conditions
test[(test.MSSubClass>100)&(test.MSSubClass<160)].head()
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | Inside | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
1471 | 120 | RH | 26.0 | 5858 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
1475 | 120 | RL | 24.0 | 2280 | Pave | NaN | Reg | Lvl | AllPub | FR2 | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
1482 | 120 | RL | 41.0 | 7132 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal |
1484 | 120 | RL | 43.0 | 3203 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 79 columns
Comprehensive methods
criterion = test['Street'].map(lambda x: x.startswith('P'))
test[criterion].head()
# equivalent but slower
# test[[x.startswith('t') for x in df2['a']]]
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | Inside | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | Corner | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | Inside | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | Inside | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 79 columns