Pandas Selecting document summary

6/5/2018

This walk-through jupyter notebook outlined the important techniques shown in pandas document for selecting and modifying. The website can be referred as: Pandas advanced indexing with labels

Dataset uses the housing price competition from Kaggle.com. The website is Housing price competition

Selecting and modifying with Pandas is very useful tools for data scientists to grasp. Let’s start.

import pandas as pd
test = pd.read_csv('C:/Users/t_louji/Downloads/test.csv',index_col='Id')
test.head()
#test.info()
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
Id
1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub Corner ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 79 columns

Selecting

1. .loc (select by label)

.loc is primarily label based, but may also be used with a boolean array.

Allowed inputs are:

  • A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

  • A list or array of labels [‘a’, ‘b’, ‘c’].

  • A slice object with labels ‘a’:’f’ (Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels.).

  • A boolean array

  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above).

# test.loc[0] #key error
test.loc[1461:1500]
test.loc[test.MSZoning=='RH']
test.loc[test.MSZoning!='RH','Alley'].head()
Id
1462    NaN
1463    NaN
1464    NaN
1465    NaN
1466    NaN
Name: Alley, dtype: object
df1.loc['a', 'A']   #row label a and column A

**Note: ** When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label and not the position.

2. .iloc (select by position)

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:

  • An integer e.g. 5.

  • A list or array of integers [4, 3, 0].

  • A slice object with ints 1:7.

  • A boolean array.

  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above).

test.iloc[0:3]
test.iloc[0:3,0:3]
test.iloc[3,3]  # a scalar value
test.iloc[1]

#Out of range slice indexes are handled gracefully just as in Python/Numpy.
test.iloc[3000:30001]
#A single indexer that is out of bounds will raise an IndexError. 
#A list of indexers where any element is out of bounds will raise an IndexError.

MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
Id

0 rows × 79 columns

** Note ** :
Object Type Indexers

        Series	      s.loc[indexer]

        DataFrame	df.loc[row_indexer,column_indexer]

3. [ ]

Series series[label] scalar value

DataFrame frame[colname] Series corresponding to colname

test['MSZoning'][test.index[0]]  #get scala element
'RH'

4. Swap two columns

# The correct way to swap column values is by using raw values:

#df.loc[:,['B', 'A']] = df[['A', 'B']].values

5. Attribute Access

.COL_Name

test1=test.copy()
test1.Street.head()
Id
1461    Pave
1462    Pave
1463    Pave
1464    Pave
1465    Pave
Name: Street, dtype: object
x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
x.iloc[1] = dict(x=9, y=99) # assign a dict to a row of a dataframe
x
x y
0 1 3
1 9 99
2 3 5

**Note: ** You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column.

6. Slicing ranges

df[:n]

test[::2].head() ##every odd column
s2[:5] = 0
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
Id
1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside ... 144 0 NaN NaN NaN 0 1 2010 WD Normal
1467 20 RL NaN 7980 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN GdPrv Shed 500 3 2010 WD Normal
1469 20 RL 85.0 10176 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 2 2010 WD Normal

5 rows × 79 columns

ix indexer is deprecated

7. Scalar value getting and setting

Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to iloc

test.at[1461,'MSZoning']
'RH'
test.iat[0,1]
'RH'

8. Boolean indexing

The operators are: for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).
test[test.MSSubClass>100].head()
##multiple conditions
test[(test.MSSubClass>100)&(test.MSSubClass<160)].head()
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
Id
1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside ... 144 0 NaN NaN NaN 0 1 2010 WD Normal
1471 120 RH 26.0 5858 Pave NaN IR1 Lvl AllPub FR2 ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
1475 120 RL 24.0 2280 Pave NaN Reg Lvl AllPub FR2 ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
1482 120 RL 41.0 7132 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 4 2010 WD Normal
1484 120 RL 43.0 3203 Pave NaN Reg Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 79 columns

Comprehensive methods

criterion = test['Street'].map(lambda x: x.startswith('P'))
test[criterion].head()

# equivalent but slower
# test[[x.startswith('t') for x in df2['a']]]
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
Id
1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub Corner ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub Inside ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 79 columns