This note is updated frequently without notice!

In this note, I use df as DataFrame, s as Series.

Libraries

import pandas as pd # import pandas package
import numpy as np

Other tasks

Deal with columns

Removing or Keep some

Removing columns,

df.drop('New', axis=1, inplace=True) # drop column 'New'
df.drop(['col1', 'col2'], axis=1, inplace=True)

Only keep some,

kept_cols = ['col1', 'col2', ...]
df = df[kept_cols]

Rename columns

In this part, we are going to use below dataframe df.

  Name Ages Marks Place
0 John 10 8 Ben Tre
1 Thi 20 9 Paris
# implicitly
df.columns = ['Surname', 'Years', 'Grade', 'Location']

# explicitly
df.rename(columns={
  'Name': 'Surname',
  'Ages': 'Years',
  ...
}, inplace=True)

Make index

Check if a column has unique values (so that it can be an index)

df['col'].is_unique # True if yes

Transform an index to column to a normal column,

df.reset_index(inplace=True)

Make a column be an index,[ref]

df.set_index('column')
df.set_index(['col1', 'col2'])

Deal with NaN

Drop if NaN

# Drop any rows which have any nans
df.dropna()

# Drop columns that have any nans
df.dropna(axis=1)

# Only drop columns which have at least 90% non-NaNs
df.dropna(thresh=int(df.shape[0] * .9), axis=1)

Fill NaN with others

Check other methods of fillna here.

# Fill NaN with ' '
df['col'] = df['col'].fillna(' ')

# Fill NaN with 99
df['col'] = df['col'].fillna(99)

# Fill NaN with the mean of the column
df['col'] = df['col'].fillna(df['col'].mean())

Do with conditions

np.where(if_this_condition_is_true, do_this, else_this)
df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example

Notice an error?

Everything on this site is published on Github. Just summit a suggested change or email me directly (don't forget to include the URL containing the bug), I will fix it.