Connect to Teradata using Pandas
pip install --upgrade 'sqlalchemy<2.0'
pip install teradatasqlalchemy
pip install python-dotenv
Note: I’m using sqlalchemy<2.0
in order to avoid getting the dreaded AttributeError: 'OptionEngine' object has no attribute 'execute'
error, caused by pandas 1.x not being updated to support sqlalchemy 2.x. When pandas 2.x is released, this should no longer be an issue.
from sqlalchemy import create_engine
from dotenv import dotenv_values
config = dotenv_values(".env")
engine = create_engine(f'teradatasql://{config["teradata-user"]}:{config["teradata-password"]}@{config["teradata-host"]}')
pd.read_sql("SELECT * FROM my_teradata_table", engine)
– sqlalchemy tip via Giorgos Myrianthous
Slice by second (or third) level of a MultiIndex
df.loc[pd.IndexSlice[:, 't'], :]
– via this great StackOverflow answer
Get duplicated index values - useful when debugging stuff like “ValueError: cannot reindex from a duplicate axis”
df[df.index.duplicated()]
– via StackOverflow
Drop duplicated index values
df = df[~df.index.duplicated(keep='first')]
– via StackOverflow
Drop/filter out rows based on a list of values
df = df[~df['col'].isin(['a', 'b'])]
– via StackOverflow
Display a correlation matrix using pandas
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = df.corr()
# change the color map
corr.style.background_gradient(cmap='coolwarm')
# ..and only display two decimals
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
# compute the colors based on the entire matrix and not per column or per row
corr.style.background_gradient(cmap='coolwarm', axis=None)
– via StackOverflow
Return value counts for NumPy array
np.unique(my_array, return_counts=True)
– via StackOverflow
Avoid “TypeError: float() argument must be a string or a number, not ‘Period’” errors when plotting with pandas
You’ve most likely forgotten to register the matplotlib converters.
import pandas as pd
pd.plotting.register_matplotlib_converters()
– via StackOverflow
Flatten hierarchical index (MultiIndex) in columns
df.columns = df.columns.get_level_values(0)
– via StackOverflow
Drop NaNs from specific columns
df = df.dropna(subset=['col1', 'col2'])
– via StackOverflow
Trying to jsonify a numpy array and getting “TypeError: Object of type ndarray is not JSON serializable”
Use .tolist()
.
json.dump(myarray.tolist())
– via StackOverflow
Setting a value on a slice
# BAD, don't use this
df[df['name'] == 'John'].loc[:,'id'] = 1
# GOOD, go for it
df.loc[df['name'] == 'John','id'] = 1
– via StackOverflow
Set all dtypes for a DataFrame
dummy_df.astype(data_df.dtypes)
Just pray that you won’t have NaN values in an integer column.
– via StackOverflow
Replace values in a column for rows identified by a .loc condition
df.loc[df['Job?'] == 'Other', 'Job?'] = df['Job?'] + ': ' + df['Other']
– via StackOverflow
Type annotations in for loops
# Option A
for key, frame in my_dict.items():
key: str
frame: pd.DataFrame
pass
# Option B
for key, frame in my_dict.items(): # type: (str, pd.DataFrame)
pass
I’m leaning towards Option A to be honest, I don’t really like counting on code comments for this.
Truncate floats to a fixed number of decimals, without having to mess with formatting and the like
Use numpy.trunc
# 3 decimals, for example
import numpy as np
# df remains a pd.DataFrame
df = np.trunc(1000 * df) / 1000
– via StackOverflow
Date/Time
Calculate the difference in months between two dates
df['car-age-in-months'] = (df['date-of-visit'].dt.year - df['date-bought-car'].dt.year) * 12 +
(df['date-of-visit'].dt.month - df['date-bought-car'].dt.month)
It’s messy I know, if you find a cleaner way to do this ping me.
Create a datetime Series from year/month numeric columns
We have two main options here:
-
Use predefined column names - at a minimum you need
year
,month
, andday
. You can also addhour
,minute
,second
, etc.df['date'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
-
⭐️ Use a
dict
and avoid the need to have predefined column names
df['date'] = pd.to_datetime(dict(year=df['y'], month=df['m'], day=1))
– via StackOverflow and pandas API reference
Easily convert a datetime to its month representation
Keep in mind that I haven’t seen datetime64[M]
mentioned in the docs. 🤷🏻♂️
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
– via StackOverflow
Detecting missing entries in a time series
df = #...
df_reference = pd.DataFrame(index=pd.date_range(start, end, freq='1H'))
missing_dates = df_reference.index[~df_reference.index.isin(df.index)]
print(missing_dates)
– via StackOverflow
Convert a DateTimeIndex to isoformat-like strings
df.index = df.index.strftime('%Y-%m-%dT%H:%M:%S.%f%z')
Adding months, years to a datetime object
datetime.timedelta
won’t cut it, use either dateutil.relativedelta.relativedelta
, or pandas.tseries.offsets.DateOffset
.
from datetime import datetime
from dateutil.relativedelta import relativedelta
from pandas.tseries.offsets import DateOffset
today = datetime.today()
month_from_now = today + relativedelta(months=1)
year_from_now = today + relativedelta(years=1)
# or...
month_from_now = today + DateOffset(months=1)
year_from_now = today + DateOffset(years=1)
– via StackOverflow and Pandas Docs
Group DateTimeIndex by year/month/etc, and optionally get the values
years = df.groupby(df.index.year).first().index.values
– via StackOverflow
Replace year/month/etc in a DateTimeIndex with a new value
df.index = df.index.map(lambda t: t.replace(year=2022))
– via StackOverflow
Excel via XlsxWriter
Write two frames in different sheets of the same Excel document, using XlsxWriter
import pandas as pd
writer = pd.ExcelWriter('./report.xlsx', engine='xlsxwriter')
df_1.to_excel(writer, sheet_name='Data1')
df_2.to_excel(writer, sheet_name='Data2')
writer.save()
– via XlsxWriter Docs
Set column width when using XlsxWriter
with pd.ExcelWriter('output.xlsx', engine='xlsxwriter') as writer:
sheet_name = 'whats_in_a_name'
df.to_excel(writer, sheet_name=sheet_name)
worksheet = writer.sheets[sheet_name]
worksheet.set_column(0, 0, 17)
worksheet.set_column(1, 1, 13)
– via StackOverflow
Prevent XlsxWriter from automatically converting strings that start with the equals sign (=) to formulas
Which in turn helps with generating corrupt Excel files 🙂.
pd.ExcelWriter('./report.xlsx', engine='xlsxwriter', engine_kwargs={'options':{'strings_to_formulas': False})
– via XlsxWriter GitHub
Other tips for not getting corrupt Excel files when using XlsxWriter
Set a temp dir
xlsxwriter.Workbook('hello_world3.xlsx', {'tmpdir': 'tmp'})
Use the in_memory option
xlsxwriter.Workbook('hello_world.xlsx', {'in_memory': True})
– via XlsxWriter GitHub
Apply an async function to a dataframe
import asyncio
async def my_async_func():
pass
df['my_new_col'] = await asyncio.gather(*(my_async_func(row) for index, row in df.iterrows()))
– via StackOverflow