NumPy vs pandas: .nanquantile(), axis = None

I love Python, particularly pandas’ rich library for data wrangling and mathematical functions.

But today I encountered a limitation of pandas. And it’s predecessor NumPy came through.

I wanted to calculate the average, median, and count of non-null values in a dataset. My dataset is messy and I need to calculate over different columns that aren’t adjacent to each other. For example, I want the average, median and count of all the values in columns 16 and 20. Not 16 through 20. Not the average of column 16, and the average of 20. One single average for the values in both columns 16 and 20.

This is where the “axis” parameter comes in. It usually defaults to axis = 1, ie df.mean(axis = 1), to indicate we are performing the calculation over a single column. For pd.mean(), I used axis = None to get a single mean over two non-adjacent columns. (double-checked it in Excel!)

import pandas as pd
import numpy as np

# df is the dataframe of my full dataset. Here we'll work with a subset of two columns, rows 4 through 54.
hello = df.iloc[4:55, [16, 20]]

# Get mean of the two columns using pandas.mean
calc1 = hello.mean(axis=None)

But pandas doesn’t have an axis = None option for it’s functions to get the median or counts. It only has axis = 0 (over the row) or axis = 1 (over the column) as options, which is inconsistent with the .mean() option.

So this doesn’t work:

calc2 = hello.quantile(0.5, axis=None)
>>> ValueError: No axis named None for object type DataFrame

But hello NumPy! You do have axis=None available for these functions! So let’s import numpy.
My dataset has more than half of NaNs (null values) which I didn’t want to include for the median calculation. So I used np.nanquantile() in order not to count them. The np.quantile() function does count them and was returning a median value of ‘NaN’, which wasn’t what I wanted.

For the count function, we are getting a little creative by first counting all of the non-null values in the hello df (using pd.count()), then summing them up so that we can count across all multiple columns.

calc2 = np.nanquantile(hello, 0.5, axis=None) #numpy quantile allows axis of None
calc3 = np.sum(hello.count())

Thank you NumPy for saving the day! Although pandas is built on NumPy, I’m glad the libraries’ distinct features are all still accessible and that NumPy hasn’t deprecated.

I love Python! And now I love NumPy!

Share this:

Related

Leave a comment Cancel reply