Connect to Teradata using Pandas

pip install --upgrade 'sqlalchemy<2.0'
pip install teradatasqlalchemy
pip install python-dotenv

Note: I’m using sqlalchemy<2.0 in order to avoid getting the dreaded AttributeError: 'OptionEngine' object has no attribute 'execute' error, caused by pandas 1.x not being updated to support sqlalchemy 2.x. When pandas 2.x is released, this should no longer be an issue.

from sqlalchemy import create_engine
from dotenv import dotenv_values

config = dotenv_values(".env")

engine = create_engine(f'teradatasql://{config["teradata-user"]}:{config["teradata-password"]}@{config["teradata-host"]}')
pd.read_sql("SELECT * FROM my_teradata_table", engine)

– sqlalchemy tip via Giorgos Myrianthous

Slice by second (or third) level of a MultiIndex

df.loc[pd.IndexSlice[:, 't'], :]

– via this great StackOverflow answer

Get duplicated index values - useful when debugging stuff like “ValueError: cannot reindex from a duplicate axis”

df[df.index.duplicated()]

– via StackOverflow

Drop duplicated index values

df = df[~df.index.duplicated(keep='first')]

– via StackOverflow

Drop/filter out rows based on a list of values

df = df[~df['col'].isin(['a', 'b'])]

– via StackOverflow

Display a correlation matrix using pandas

rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))

corr = df.corr()

# change the color map
corr.style.background_gradient(cmap='coolwarm')

# ..and only display two decimals
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

# compute the colors based on the entire matrix and not per column or per row
corr.style.background_gradient(cmap='coolwarm', axis=None)

– via StackOverflow

Return value counts for NumPy array

np.unique(my_array, return_counts=True)

– via StackOverflow

Avoid “TypeError: float() argument must be a string or a number, not ‘Period’” errors when plotting with pandas

You’ve most likely forgotten to register the matplotlib converters.

import pandas as pd
pd.plotting.register_matplotlib_converters()

– via StackOverflow

Flatten hierarchical index (MultiIndex) in columns

df.columns = df.columns.get_level_values(0)

– via StackOverflow

Drop NaNs from specific columns

df = df.dropna(subset=['col1', 'col2'])

– via StackOverflow

Trying to jsonify a numpy array and getting “TypeError: Object of type ndarray is not JSON serializable”

Use .tolist().

json.dump(myarray.tolist())

– via StackOverflow

Setting a value on a slice

# BAD, don't use this
df[df['name'] == 'John'].loc[:,'id'] = 1

# GOOD, go for it
df.loc[df['name'] == 'John','id'] = 1

– via StackOverflow

Set all dtypes for a DataFrame

dummy_df.astype(data_df.dtypes)

Just pray that you won’t have NaN values in an integer column.

– via StackOverflow

Replace values in a column for rows identified by a .loc condition

df.loc[df['Job?'] == 'Other', 'Job?'] = df['Job?'] + ': ' + df['Other']

– via StackOverflow

Type annotations in for loops

# Option A
for key, frame in my_dict.items():
   key: str
   frame: pd.DataFrame

   pass

# Option B
for key, frame in my_dict.items():   # type: (str, pd.DataFrame)
   pass

I’m leaning towards Option A to be honest, I don’t really like counting on code comments for this.

– via Stack Overflow

Truncate floats to a fixed number of decimals, without having to mess with formatting and the like

Use numpy.trunc

# 3 decimals, for example
import numpy as np

# df remains a pd.DataFrame
df = np.trunc(1000 * df) / 1000

– via StackOverflow

Date/Time

Calculate the difference in months between two dates

df['car-age-in-months'] = (df['date-of-visit'].dt.year - df['date-bought-car'].dt.year) * 12 + 
    (df['date-of-visit'].dt.month - df['date-bought-car'].dt.month)

It’s messy I know, if you find a cleaner way to do this ping me.

Create a datetime Series from year/month numeric columns

We have two main options here:

  1. Use predefined column names - at a minimum you need year, month, and day. You can also add hour, minute, second, etc.

    df['date'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
    
  2. ⭐️ Use a dict and avoid the need to have predefined column names

df['date'] = pd.to_datetime(dict(year=df['y'], month=df['m'], day=1))

– via StackOverflow and pandas API reference

Easily convert a datetime to its month representation

Keep in mind that I haven’t seen datetime64[M] mentioned in the docs. 🤷🏻‍♂️

df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

– via StackOverflow

Detecting missing entries in a time series

df = #...
df_reference = pd.DataFrame(index=pd.date_range(start, end, freq='1H'))
missing_dates = df_reference.index[~df_reference.index.isin(df.index)]

print(missing_dates)

– via StackOverflow

Convert a DateTimeIndex to isoformat-like strings

df.index = df.index.strftime('%Y-%m-%dT%H:%M:%S.%f%z')

Adding months, years to a datetime object

datetime.timedelta won’t cut it, use either dateutil.relativedelta.relativedelta, or pandas.tseries.offsets.DateOffset.

from datetime import datetime

from dateutil.relativedelta import relativedelta
from pandas.tseries.offsets import DateOffset

today = datetime.today()

month_from_now = today + relativedelta(months=1)
year_from_now = today + relativedelta(years=1)

# or...

month_from_now = today + DateOffset(months=1)
year_from_now = today + DateOffset(years=1)

– via StackOverflow and Pandas Docs

Group DateTimeIndex by year/month/etc, and optionally get the values

years = df.groupby(df.index.year).first().index.values

– via StackOverflow

Replace year/month/etc in a DateTimeIndex with a new value

df.index = df.index.map(lambda t: t.replace(year=2022))

– via StackOverflow

Excel via XlsxWriter

Write two frames in different sheets of the same Excel document, using XlsxWriter

import pandas as pd

writer = pd.ExcelWriter('./report.xlsx', engine='xlsxwriter')

df_1.to_excel(writer, sheet_name='Data1')
df_2.to_excel(writer, sheet_name='Data2')

writer.save()

– via XlsxWriter Docs

Set column width when using XlsxWriter

with pd.ExcelWriter('output.xlsx', engine='xlsxwriter') as writer:
   sheet_name = 'whats_in_a_name'
   df.to_excel(writer, sheet_name=sheet_name)

   worksheet = writer.sheets[sheet_name]
   worksheet.set_column(0, 0, 17)
   worksheet.set_column(1, 1, 13)

– via StackOverflow

Prevent XlsxWriter from automatically converting strings that start with the equals sign (=) to formulas

Which in turn helps with generating corrupt Excel files 🙂.

pd.ExcelWriter('./report.xlsx', engine='xlsxwriter', engine_kwargs={'options':{'strings_to_formulas': False})

– via XlsxWriter GitHub

Other tips for not getting corrupt Excel files when using XlsxWriter

Set a temp dir

xlsxwriter.Workbook('hello_world3.xlsx', {'tmpdir': 'tmp'})

Use the in_memory option

xlsxwriter.Workbook('hello_world.xlsx', {'in_memory': True})

– via XlsxWriter GitHub

Apply an async function to a dataframe

import asyncio

async def my_async_func():
   pass

df['my_new_col'] = await asyncio.gather(*(my_async_func(row) for index, row in df.iterrows()))

– via StackOverflow