Table of Contents


Get duplicated index values - useful when debugging stuff like “ValueError: cannot reindex from a duplicate axis”

df[df.index.duplicated()]

– via StackOverflow

Drop duplicated index values

df = df[~df.index.duplicated(keep='first')]

– via StackOverflow

Drop/filter out rows based on a list of values

df = df[~df['col'].isin(['a', 'b'])]

– via StackOverflow

Display a correlation matrix using pandas

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))

corr = df.corr()

# change the color map
corr.style.background_gradient(cmap='coolwarm')

# ..and only display two decimals
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

# compute the colors based on the entire matrix and not per column or per row
corr.style.background_gradient(cmap='coolwarm', axis=None)

– via StackOverflow

Calculate the difference in months between two dates

df['car-age-in-months'] = (df['date-of-visit'].dt.year - df['date-bought-car'].dt.year) * 12 + 
    (df['date-of-visit'].dt.month - df['date-bought-car'].dt.month)

It’s messy I know, if you find a cleaner way to do this ping me.

Create a datetime Series from year/month numeric columns

We have two main options here:

  1. Use predefined column names - at a minimum you need year, month, and day. You can also add hour, minute, second, etc.

    df['date'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
    
  2. ⭐️ Use a dict and avoid the need to have predefined column names

df['date'] = pd.to_datetime(dict(year=df['y'], month=df['m'], day=1))

– via StackOverflow and pandas API reference

Return value counts for NumPy array

np.unique(my_array, return_counts=True)

– via StackOverflow

Avoid “TypeError: float() argument must be a string or a number, not ‘Period’” errors when plotting with pandas

You’ve most likely forgotten to register the matplotlib converters.

import pandas as pd
pd.plotting.register_matplotlib_converters()

– via StackOverflow

Flatten hierarchical index (MultiIndex) in columns

df.columns = df.columns.get_level_values(0)

– via StackOverflow

Drop NaNs from specific columns

df = df.dropna(subset=['col1', 'col2'])

– via StackOverflow

Trying to jsonify a numpy array and getting “TypeError: Object of type ndarray is not JSON serializable”

Use .tolist().

json.dump(myarray.tolist())

– via StackOverflow

Setting a value on a slice

# BAD, don't use this
df[df['name'] == 'John'].loc[:,'id'] = 1

# GOOD, go for it
df.loc[df['name'] == 'John','id'] = 1

– via StackOverflow

Set all dtypes for a DataFrame

dummy_df.astype(data_df.dtypes)

Just pray that you won’t have NaN values in an integer column.

– via StackOverflow