You delegate tasks to a machine rather than doing it yourself so they can do it automatically. You give them precise instructions, and that is your code.
I love Python, particularly pandas’ rich library for data wrangling and mathematical functions.
But today I encountered a limitation of pandas. And it’s predecessor NumPy came through.
I wanted to calculate the average, median, and count of non-null values in a dataset. My dataset is messy and I need to calculate over different columns that aren’t adjacent to each other. For example, I want the average, median and count of all the values in columns 16 and 20. Not 16 through 20. Not the average of column 16, and the average of 20. One single average for the values in both columns 16 and 20.
This is where the “axis” parameter comes in. It usually defaults to axis = 1, ie df.mean(axis = 1), to indicate we are performing the calculation over a single column. For pd.mean(), I used axis = None to get a single mean over two non-adjacent columns. (double-checked it in Excel!)
import pandas as pd
import numpy as np
# df is the dataframe of my full dataset. Here we'll work with a subset of two columns, rows 4 through 54.
hello = df.iloc[4:55, [16, 20]]
# Get mean of the two columns using pandas.mean
calc1 = hello.mean(axis=None)
But pandas doesn’t have an axis = None option for it’s functions to get the median or counts. It only has axis = 0 (over the row) or axis = 1 (over the column) as options, which is inconsistent with the .mean() option.
So this doesn’t work:
calc2 = hello.quantile(0.5, axis=None)
>>> ValueError: No axis named None for object type DataFrame
But hello NumPy! You do have axis=None available for these functions! So let’s import numpy. My dataset has more than half of NaNs (null values) which I didn’t want to include for the median calculation. So I used np.nanquantile() in order not to count them. The np.quantile() function does count them and was returning a median value of ‘NaN’, which wasn’t what I wanted.
For the count function, we are getting a little creative by first counting all of the non-null values in the hello df (using pd.count()), then summing them up so that we can count across all multiple columns.
Thank you NumPy for saving the day! Although pandas is built on NumPy, I’m glad the libraries’ distinct features are all still accessible and that NumPy hasn’t deprecated.
It’s been one year since I started studying programming using Codecademy.com. I set out to study 4 to 5 times a week, every week, 1 lesson page at a time. My longest streak on record is 12 weeks in a row. I’ve completed 86% of the Learn Python 3 course (a hefty course that covers programming fundamentals) and finished the Command Line course too (Linux terminal is not so scary anymore!)
I just finished an online project called ‘Fending the Hacker’ where I read and write to CSV and JSON files programmatically with Python. I didn’t realize this till the end, but this project built on prior lesson topics:
Functions
Loops
Lists
Dictionaries
Modules (JSON, CSV)
Files – Programmatic File Reading/Writing
Looking back on what I’m comfortable with now and how much I’ve learned in one year amazes me. I don’t look back much nor often. But I recall a sinking, confused feeling about not understanding loops, when to use a function, and the purpose of lists and dictionaries. Now I can’t imagine doing any Python analysis or mini project without loops and lists at a minimum. I’m comfortable using them, something distinctly different from before.
This shows me the power of bite-sized but consistent practice. Most lesson topics are divided into about a dozen pages, and I do the reading and practice for 1-2 lesson pages each sitting. That’s 10 minutes or less of light and easy studying. I don’t let long stretches of days pass between each sitting. Recently I’ve shifted my Python study time to earlier in the day to ensure I get it done. I feel the power of compounding knowledge and love it. Is this what the power of compounding interest is also like? The journey along the way has actually been fun.
Onward to the next and final lesson of Python 3, Classes!
The previous post on falsiness (which should be “falseness”, but will continue with the ‘i’ since “Truthiness” is the conceptual term instead of “Truthfulness) has me thinking and steam’s coming out of the engine. I wanted to see for myself these different flavors of False in action, as well as variants of Truthiness.
See the results for yourself running this code in a Python IDE. Experimenting with this made me discover {} is part of the falsiness group, too.
# Values for test: False, 0, None, [], {}
test = []
if test:
print("True. Condition passed. If statement succeeded.")
else: print("False. Condition did not pass. If statement failed.")
>>> False. Condition did not pass. If statement failed.
test = [1]
if test:
print("True. Condition passed. If statement succeeded.")
else: print("False. Condition did not pass. If statement failed.")
>>> True. Condition passed. If statement succeeded.
Here’s a lesson on “falseness”- that is, whether values are classified as True or False in Python.
I’m working on a Codecademy project (Abruptly Goblins) where there’s a gamer named Kimberly who is available to play on Monday, Tuesday and Friday. There will be other gamers added in later.
Let’s make a dictionary with name and availability as keys. We’ll also make an empty list called gamers to store valid gamer details.
The project instructions say to: Create a function called add_gamer that takes two parameters: gamer and gamers_list. The function should check that the argument passed to the gamer parameter has both "name" and a "availability" as keys and if so add gamer to gamers_list.
The number of times ‘gamer’ and variants are being tossed around make these instructions confusing as heck! But I plow through. Here’s what I came up with:
def add_gamer(gamer, gamers_list): #gamers_list is the parameter. gamers = [] is what the parameter value will be (argument).
if gamer.get("name") and gamer.get("availability"): # Access name and avail values if they exist. If any keys not found, returns None.
gamers_list.append(gamer)
else: print("Failure, Gamer doesn't have name or availability")
Notice the if statement here. It seems incomplete to me: if gamer.get(“name”) and gamer.get(“availability”):
We will be inserting gamer dictionary arguments. If it doesn’t contain “name” or “availability” as a key, the .get() method will return None (because it did not find the key, and thus has no corresponding value).
But there is something weird assumed here. If the gamer argument does contain keys of “name” and “availability”, the if statement is True, so proceed with the function (appending the player’s details to the gamer list).
1. Why do the two .get() statements result in a True / pass go, collect $200?
2. If any of the .get() statements results in a None, why is that a False / do not pass go, do not collect $200?
The answer to #1 is still unknown to me, but I did find out #2 from Stack Overflow:
The expression x or y evaluates to x if x is true, or y if x is false.
Note that “true” and “false” in the above sentence are talking about “truthiness”, not the fixed values True and False. Something that is “true” makes an if statement succeed; something that’s “false” makes it fail. “false” values include False, None, 0 and [] (an empty list).
When any .get() statements results in a None, that is of the “False” category in Python so it will not proceed. I tested this out by running:
gamers = []
def add_gamer(gamer, gamers_list):
if gamer.get("name") and gamer.get("availability"):
gamers_list.append(gamer)
else: print("Failure, Gamer doesn't have name or availability")
kimberly = {"notname":"Kimberly Chook", "availability": ["Monday", "Tuesday", "Friday"]}
print(kimberly.get("name" and gamer.get("avialability")))
>>> None
So when you ask your partner “Did you clean the bathroom yet?” and get no response for an answer (none, nada, nothing), you can interpret that as: status_bathroom_is_clean = False.
I’ve been gathering data about my hens’ eggs, like how many eggs are laid per day and by whom. One of my baby hens ‘Ramsey’ started laying eggs on March 21st. I weighed the eggs each day and recorded the data. The weight appears to increase gradually over time.
Day
Egg Weight (grams)
0
39
1
42
2
42
3
43
4
47
5
44
6
44
7
43
8
44
9
46
10
50
11
55
I experimented with creating a linear regression (y = mx + b) to find the line of best fit using Python. I plotted the data and could tell this was not linear, so then I constructed a quadratic regression (y = ax^2 + bx + c).
# Set up Quadratic Regression
def calculate_error(a, b, c, point):
(x_point, y_point) = point
y = a * x_point**2 + b*x_point + c # Quadratic
distance = abs(y - y_point)
return distance
def calculate_all_error(a, b, c, points):
total_error = 0 # Set initial value before starting loop calculation
for point in points:
total_error += calculate_error(a, b, c, point)
return total_error
I entered the egg weight data as a list (datapoints), and iterated over a range of a, b, and c values to find what combination of a, b, and c would give the smallest error possible (smallest absolute distance between the regression line and actual values). I set initial values of a, b, and c = 0 and smallest_error = infinity and updated (replaced) them each time the error value was smaller than before.
# Ramsey Egg Data
datapoints = [
(0,39),
(1,42),
(2,42),
(3,43),
(4,47),
(5,44),
(6,44),
(7,43),
(8,44),
(9,46),
(10,50),
(11,55)
]
a_list = list(range(80,100))
possible_as = [num * .001 for num in a_list] #your list comprehension here
b_list = list(range(-10,10))
possible_bs = [num * .001 for num in b_list] #your list comprehension here
c_list = list(range(400,440))
possible_cs = [num * .1 for num in c_list] #your list comprehension here
smallest_error = float("inf")
best_a = 0
best_b = 0
best_c = 0
for a in possible_as:
for b in possible_bs:
for c in possible_cs:
loop_error_calc = calculate_all_error(a, b, c, datapoints)
if loop_error_calc < smallest_error:
best_a = a
best_b = b
best_c = c
smallest_error = loop_error_calc
print(smallest_error, best_a, best_b, best_c)
print("y = ",best_a,"x^2 + ",best_b,"x + ", best_c)
Ultimately I got the following results:
y = 0.084 x^2 + -0.01 x + 41.7 Which gives a total error of 19.828.
This error feels big to me. I would like to get it as close to 0 as possible, or within single digits. One thing I may do is remove the data point of day 4, 47grams, which was unusually large.
I plotted the data in an Excel graph and added a quadratic regression line as well. The resulting regression line is y = 0.0972x2 – 0.1281x + 41.525. This is close to my Python quadratic regression, but not the same. I’d like to figure out why these differ when the model is similar. It believe this may have to do with formula of error calculation – I am using Total Absolute Error, whereas the more common standard is to get Mean Squared Error.
Note how the data points do not follow linear growth, hence quadratic time!
Something in that boggles me is why range(a, b) in Python includes the a value, but not b. In math, range(a, b) implies neither a nor b are included – the parentheses are exclusive. Square brackets – range[a, b] – are inclusive. So why is Python’s range(a, b) part inclusive, part exclusive? It doesn’t follow the math rules I’d expect.
I did some research and came across this snippet:
Python range is inclusive because it starts with the first argument of the range() method, but it does not end with the second argument of the range() method; it ends with the end – 1 index. The reason is zero-based indexing.
Now I have a lead, but still want to understand: How does zero-based indexing affect range inclusion? Here’s an explanation that made things *click* for me.
I think it may help to add some simple ‘real life’ reasoning as to why it works this way, which I have found useful when introducing the subject to young newcomers:
With something like ‘range(1,10)’ confusion can arise from thinking that pair of parameters represents the “start and end”.
It is actually start and “stop”.
Now, if it were the “end” value then, yes, you might expect that number would be included as the final entry in the sequence. But it is not the “end”.
Others mistakenly call that parameter “count” because if you only ever use ‘range(n)’ then it does, of course, iterate ‘n’ times. This logic breaks down when you add the start parameter.
So the key point is to remember its name: “stop“. That means it is the point at which, when reached, iteration will stop immediately. Not after that point.
So, while “start” does indeed represent the first value to be included, on reaching the “stop” value it ‘breaks’ rather than continuing to process ‘that one as well’ before stopping.
One analogy that I have used in explaining this to kids is that, ironically, it is better behaved than kids! It doesn’t stop after it supposed to – it stops immediately without finishing what it was doing. (They get this 😉 )
Another analogy – when you drive a car you don’t pass a stop/yield/’give way’ sign and end up with it sitting somewhere next to, or behind, your car. Technically you still haven’t reached it when you do stop. It is not included in the ‘things you passed on your journey’.
I’ve been coding! Like the slow erosion of a river forming a canyon, I am steadily pecking away at Python to become a better programmer. Here is a lil project I did today. Why chickens? I’ll explain in a future post. Stay tuned! Bok bok bok!
# Magic 8 Ball - Ask a question, reveal an answer.
import random
name = "Heeju"
question = "Should I get hens this weekend?"
answer = ""
answer_2 = ""
# First question random answer generation
random_number = random.randint(1,10)
if random_number == 1:
answer = "Yes - definitely."
elif random_number == 2:
answer = "It is decidedly so."
elif random_number == 3:
answer = "Without a doubt."
elif random_number == 4:
answer = "Reply hazy, try again."
elif random_number == 5:
answer = "Ask again later."
elif random_number == 6:
answer = "Better not to tell you now."
elif random_number == 7:
answer = "My sources say no."
elif random_number == 8:
answer = "Outlook not so good."
elif random_number == 9:
answer = "Very doubtful."
elif random_number == 10:
answer = "Don't rush it. Give it some time."
else:
answer = "Error (number outside of range)"
# Second question random answer generation
random_number_2 = random.randint(1,9)
if random_number_2 == 1:
answer_2 = "Yes - definitely."
elif random_number_2 == 2:
answer_2 = "It is decidedly so."
elif random_number_2 == 3:
answer_2 = "Without a doubt."
elif random_number_2 == 4:
answer_2 = "Reply hazy, try again."
elif random_number_2 == 5:
answer_2 = "Ask again later."
elif random_number_2 == 6:
answer_2 = "Better not to tell you now."
elif random_number_2 == 7:
answer_2 = "My sources say no."
elif random_number_2 == 8:
answer_2 = "Outlook not so good."
elif random_number_2 == 9:
answer_2 = "Very doubtful."
else:
answer_2 = "Error (number outside of range)"
if question == "":
print("You didn't ask a question. Please ask one!")
elif name == "":
print(question)
elif name != "":
print(name,"asks:", question)
else:
print(name,"asks:", question)
print("Magic 8-ball's answer:", answer)
print("Is this truly random?", answer_2)
As part of my Python programming practice, I came up with a module and function that randomly generates 5 USDA Plant Hardiness Zones / Garden Zones.
def randomgardenzone():
test_list = ['a', 'b']
for i in range(1, 5+1):
x = randint(1, 13)
res = choice(test_list)
print(x, res)
randomgardenzone()
For context: the US Department of Agriculture has 13 designated “zones” for the country, based on the average annual min temperature. Each zone number is 10 degrees F apart. There is a further subdivision of zones with a letter ‘a’ or ‘b’, where ‘b’ is 5 degrees F warmer than ‘a’.
These zones are useful for gardeners because we can confidently plant specimens that are hardy (cold/frost/freezing temperature tolerant) to their zone. This is why mango and bananas don’t grow in Minnesota, while they may thrive in a Floridian garden. The Minnesotan would have to have a toasty heated greenhouse in order to cultivate mango trees or bananas through their winter. (Did you know the banana plant is actually an herb, not a tree?)
I’m curious how some plants are able to be cold-hardy and resist freezing. When it gets below 32’F, the water in the plant cells wants to freeze and expand. This would rupture the cell walls and make the plant loose its structure, becoming frost damaged, mushy, and sadly, not salvageable. I heard that cold-hardy plants contain a natural antifreeze that prevents this. I’m curious how antifreeze works, and if it’s similar to what’s used in automobiles. Dianthus is an example of a common plant that is cold-hardy (you can grow them in Alaska), and in fact they need a cold season in order to thrive.
Python Things I learned:
Use “for i in range(1, 5)”, not just “for i in (1,5)”. A simple doh!-type mistake!
range(a, b) works like [a, b) – it is exclusive of the b value. However, randint(a, b) is inclusive of the b value.
The “choice( )” function from the random module let’s you pick a random item from a list. This was useful to pick the zone letters ‘a’ or ‘b’, since randint( ) is only used to pick a random integer.