Pandas Selecting document summary

6/5/2018

This walk-through jupyter notebook outlined the important techniques shown in pandas document for selecting and modifying. The website can be referred as: Pandas advanced indexing with labels

Dataset uses the housing price competition from Kaggle.com. The website is Housing price competition

Selecting and modifying with Pandas is very useful tools for data scientists to grasp. Let’s start.

import pandas as pd
test = pd.read_csv('C:/Users/t_louji/Downloads/test.csv',index_col='Id')
test.head()
#test.info()

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	Inside	...	120	0	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	Corner	...	0	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	Inside	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 79 columns

Selecting

1. .loc (select by label)

.loc is primarily label based, but may also be used with a boolean array.

Allowed inputs are:

A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels ‘a’:’f’ (Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels.).
A boolean array
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above).

# test.loc[0] #key error
test.loc[1461:1500]
test.loc[test.MSZoning=='RH']
test.loc[test.MSZoning!='RH','Alley'].head()

Id
  NaN
  NaN
  NaN
  NaN
  NaN
Name: Alley, dtype: object

df1.loc['a', 'A']   #row label a and column A

**Note: ** When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label and not the position.

2. .iloc (select by position)

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:

An integer e.g. 5.
A list or array of integers [4, 3, 0].
A slice object with ints 1:7.
A boolean array.
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above).

test.iloc[0:3]
test.iloc[0:3,0:3]
test.iloc[3,3]  # a scalar value
test.iloc[1]

#Out of range slice indexes are handled gracefully just as in Python/Numpy.
test.iloc[3000:30001]
#A single indexer that is out of bounds will raise an IndexError. 
#A list of indexers where any element is out of bounds will raise an IndexError.

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id

0 rows × 79 columns

** Note ** :
Object Type Indexers

        Series	      s.loc[indexer]

        DataFrame	df.loc[row_indexer,column_indexer]

3. [ ]

Series series[label] scalar value

DataFrame frame[colname] Series corresponding to colname

test['MSZoning'][test.index[0]]  #get scala element

'RH'

4. Swap two columns

# The correct way to swap column values is by using raw values:

#df.loc[:,['B', 'A']] = df[['A', 'B']].values

5. Attribute Access

.COL_Name

test1=test.copy()
test1.Street.head()

Id
  Pave
  Pave
  Pave
  Pave
  Pave
Name: Street, dtype: object

x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
x.iloc[1] = dict(x=9, y=99) # assign a dict to a row of a dataframe

	x	y
0	1	3
1	9	99
2	3	5

**Note: ** You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column.

6. Slicing ranges

df[:n]

test[::2].head() ##every odd column
s2[:5] = 0

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	Inside	...	120	0	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	Inside	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal
1467	20	RL	NaN	7980	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	GdPrv	Shed	500	3	2010	WD	Normal
1469	20	RL	85.0	10176	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	2	2010	WD	Normal

5 rows × 79 columns

ix indexer is deprecated

7. Scalar value getting and setting

Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to iloc

test.at[1461,'MSZoning']

'RH'

test.iat[0,1]

'RH'

8. Boolean indexing

The operators are:

for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

test[test.MSSubClass>100].head()
##multiple conditions
test[(test.MSSubClass>100)&(test.MSSubClass<160)].head()

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	Inside	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal
1471	120	RH	26.0	5858	Pave	NaN	IR1	Lvl	AllPub	FR2	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
1475	120	RL	24.0	2280	Pave	NaN	Reg	Lvl	AllPub	FR2	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
1482	120	RL	41.0	7132	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	4	2010	WD	Normal
1484	120	RL	43.0	3203	Pave	NaN	Reg	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 79 columns

Comprehensive methods

criterion = test['Street'].map(lambda x: x.startswith('P'))
test[criterion].head()

# equivalent but slower
# test[[x.startswith('t') for x in df2['a']]]

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	Inside	...	120	0	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	Corner	...	0	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	Inside	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	Inside	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 79 columns

Pandas Intro Modify and Select

Starter package to Pandas

Pandas Selecting document summary

Selecting

1. .loc (select by label)

2. .iloc (select by position)

3. [ ]

4. Swap two columns

5. Attribute Access

6. Slicing ranges

ix indexer is deprecated

7. Scalar value getting and setting

8. Boolean indexing