Please enable JavaScript to use CodeHS

Python Data Science Documentation

Data Science & Python Documentation

Comments

We use comments to leave notes about the code to the reader. Comments are not actually run by Python, they are just there to help us read the code.

We can make multiline comments with """ and single line comments with #.

"""
A multi-line comment describes your code
to someone who is reading it.
"""

Example:

"""
This program will ask the user for two numbers.
Then it will add the numbers and print the final value.
"""
number_one = int(input("Enter a number: "))
number_two = int(input("Enter a second number: "))
print("Sum: " + str(number_one + number_two))

# Use single line comments to clarify parts of code.

Example:

# This program adds 1 and 2
added = 1 + 2
print(added)

Variables

We use variables to store values that can be used to control commands in our code. We can also alter these values throughout the code.

# Make a variable to store text
name = "Zach"

# Create variables that are numbers
num_one = 3
num_two = 4
sum = num_one + num_two

# We can also assign multiple variables at once
num_one, num_two = 3, 4

# The value of a variable can be changed after it has been 
# created
num_one = num_one + 1

Printing

We can print elements to the screen by using the print command. If we want to print text, we need to surround the text with quotation marks " ".

print("Hello world")
print(2 + 2)
print(10)

Casting as a String

To print integers or floats together with strings, the integer or float must be cast as a string using the str() function. The strings are concatenated with a plus symbol.

print("The mean is " + str(my_list.mean()) + " .")

Mathematical Operators

Use mathematical operators to alter values.

+   Addition
-   Subtraction
*   Multiplication
/   Division
%   Modulus (Remainder)
()  Parentheses (For order of operations)

# Examples
z = x + y
w = x * y

# Division
a = 5.0 / 2                     # Returns 2.5
b = 5.0 // 2                    # Returns 2.0
c = 5/2                         # Returns 2.5
d = 5 // 2                      # Returns 2

# Increment (add one)
x += 1

# Decrement (subtract one)
x -= 1

# Absolute value
absolute_value = abs(x)

abs_val = abs(-5)               # Returns 5

# Square root
import math
square_root = math.sqrt(x)

# Raising to a power
power = math.pow(x, y)          # Calculates x^y

# Rounding
rounded_num = round(2.675, 2)   # Returns 2.68

Random Numbers

To be able to use the randint or choice functions, you must use import random at the beginning of your code.

# Random integer between (and including) low and high
import random
random_num = random.randint(low, high)
random_element = random.choice(string)

# Example:
# Returns random number within and including 0 and 10.
random_num = random.randint(0,10)

# Random element in a string
random_element = random.choice('abcdefghij')

Comparison Operators

Use comparison operators to compare elements in order to make decisions in your code. Comparison operators return booleans (True/False).

x == y      # is x equal to y
x != y      # is x not equal to y
x > y       # is x greater than y
x >= y      # is x greater than or equal to y
x < y       # is x less than y
x <= y      # is x less than or equal to y

# Comparison operators in if statements
if x == y:
    print("x and y are equal")

if x > 5:
    print("x is greater than 5.")

Logical Operators

Use logical operators to check multiple conditions at once or one condition out of multiple.

# And Operator
and_expression = x and y

# Or Operator
or_expression = x or y

# You can combine many booleans!
boolean_expression = x and (y or z)

Functions

Writing a function is like teaching the computer a new word.

Naming Functions: You can name your functions whatever you want, but you can't have spaces in the function name. Instead of spaces, use underscores ( _ ) like_this_for_example

Make sure that all the code inside your function is indented one level!

Defining a Function

We define a function to teach the computer the instructions for a new word. We need to use the term def to tell the computer we’re creating a function.

def name_of_your_function():
    # Code that will run when you make a call to
    # this function.

# Example:

# Teach the computer to add two numbers
num_one = 1
num_two = 2
def add_numbers():
    sum = num_one + num_two

Returning Values in Functions

We can use the command return to have a function give a value back to the code that called it. Without the return command, we could not use any altered values that were determined by the function.

# We add a return statement in order to use the value of the 
# sum variable
num_one = 1
num_two = 2
def add_numbers():
    sum = num_one + num_two
    return sum

Calling a Function

We call a function to tell the computer to actually carry out the new command.

# Call the add_numbers() function once
# The computer will return a value of 3
add_numbers()

# Call the add_numbers() function 3 times and print the output
# The output will be the number 3 printed on 3 separate lines
print(add_numbers())
print(add_numbers())
print(add_numbers())

Using Parameters in Functions

We can use parameters to alter certain commands in our function. We have to include arguments for the parameters in our function call.

# In this program, parameters are used to give two numbers
def add_numbers(num_one, num_two):
    sum = num_one + num_two
    return sum

# We call the function with values inside the parentheses
# This program will print ‘7’
print(add_numbers(3, 4))
# If we have a list with the same number of parameters, we
# can use the items to assign arguments using an asterisk
my_list = [3, 4]
print(add_numbers(*my_list))

Creating a List

We create a list by listing items inside square brackets. We can include elements of any type.

# Create an empty list
my_list = []

# Create a list with any number of items
my_list = [item1, item2, item3]
# Example:
number_list = [1, 2, 4]

# A list can have any type
my_list = [integer, string, boolean]
# Example:
a_list = ["hello", 4, True]

Altering a List

Due to the mutable nature of lists, we can alter individual elements in the list.

# Access an element in a list
a_list = [“hello”, 4, True]
first_element = a_list[0]    # Returns "hello"

# Set an element in a list
a_list = [“hello”, 4, True]
a_list[0] = 9       # Changes a_list to be [9, 4, True]

# Looping over a list
# Prints each item on a separate line (9, then 4, then True)
a_list = [9, 4, True]
for item in a_list:
    print(item)

# Length of a list
a_list = [9, 4, True]
a_list_length = len(a_list)  # Returns 3

# Creates a list based on first operation
# This will create a list with numbers 0 to 4
a_list = [x for x in range(5)]
# This will create a list with multiples of 2 from 0 to 8
list_of_multiples = [2*x for x in range(5)]

Series

A Series is a one-dimensional array. It is formatted similar to one column in a table. Series includes indices that start at 0 and number the rows.

# Creates a Series using a list

scores = pd.Series([96, 88, 89, 90])

# Creates a Series using a list AND specifying the indices

ingredients = pd.Series(["6 ounces", "1 cup", 
"2 large", "1 cup"], index=["Coffee", "Milk", 
"Eggs", "Sugar"])

# Creates a series using a Python dictonary. 
# The key becomes the index. 

s = {"Los Angeles Dodgers": 2020, "New York Yankees": 2009, 
    "Boston Red Sox": 2018, "Chicago Cubs": 2016, 
    "San Francisco Giants": 2014, "Colorado Rockies": None}
    
world_series = pd.Series(s)

Searches for an item in the Series

2002 in name_of_series # Returns True or False

"mouse" in name_of_series # Returns True or False

Statistics

The follow functions return summary statistics using data in a Series or DataFrame.

# Returns all statistics at one time
df.describe()

# Or return each measure separately
df.mean()
df.median()
df.mode()
df.min()
df.max()
df.count()

The follow functions return measures of spread for the dataset.

# Returns the variance and the standard deviation 
df.var()
df.std()

# Find the range using the max and min values
max = people_named_anna.max()
min = people_named_anna.min()
range = max - min

# Find the interquartile range using the first and third 
# quartile values
Q1 = people_named_anna.quantile(0.25)
Q3 = people_named_anna.quantile(0.75)
IQR = Q3 - Q1

Dictionaries

Dictionaries have a collections of key-value pairs.

a_dictionary = {key1:value1, key2:value2}
# Example:
# This dictionary keeps a farm's animal count
my_farm = {pigs:2, cows:4}  

# Creates an empty dictionary
a_dictionary = {}

# Inserts a key-value pair
a_dictionary[key] = value
my_farm["horses"] = 1      # The farm now has one horse

# Gets a value for a key
my_dict[key] # Will return the key
my_farm["pigs"]           # Will return 2, the value of "pigs"

# Using the 'in' keyword
my_dict = {"a": 1, "b": 2}
print("a" in my_dict)   # Returns True
print("z" in my_dict)   # Returns False
print(2 in my_dict)     # Returns False, 2 is not a key

# Iterating through a dictionary
for key in my_dict:
    print("key: " + str(key))
    print("value: " + str(my_dict[key]))

DataFrames

A data frame is a two-dimensional data structure. The data is aligned in a tabular fashion in rows and columns. DataFrames include indices that start at 0 and number the rows.

# Creates a DataFrame using a Python dictonary.

data = {"mammal": ["African Elephant", "Bottlenose Dolphin", 
        "Cheetah", "Domestic Cat"],
        "life_span": [70, 25, 14, 16]
    }

mammals = pd.DataFrame(data)

DataFrame Functions

# Returns the data type of each column
df.dtypes

# Returns the number of rows and columns as (rows, columns)
df.shape

# Returns summary statistics about each column 
df.describe()

# Returns summary statistics, rounding to one decimal
round((df.describe()), 1)

Filtering

iloc

Index-based selection (iloc) selects rows and columns by their index location or address in the table.

# Returns rows from index location 0 to 1 
# and columns from index location 3 to 6
df.iloc[0:2, 3:7]

loc

Label-based selection (loc) selects rows and columns by their label or name in the table.

# Returns rows with the index 8 through 12 
# and columns named "country" and "score"
df.loc[8:12, ["country","score"]]

Conditional Formatting

Conditions can be used together with loc to filter for specific values, etc.

# Returns only rows with a score higher than 7
# and only the score column
df.loc[df.score > 7, ["score"]])    

Boxplots

Be sure to import the Matplotlib library for visualizations

import matplotlib.pyplot as plt

Plots all data

df.plot(kind="box")
plt.show()

Plots only one column

df["column"].plot(kind="box")
plt.show()

Plots a list of specific columns

df[["column1", "column2"]].plot(kind="box")
plt.show()

An alternative way to plot boxplots

df.boxplot(column=["column1", "column2"])
plt.show()

Histograms

Be sure to import the Matplotlib library for visualizations

import matplotlib.pyplot as plt

Plots one column, adds a title and an edgecolor

df["column1"].plot(kind="hist", title="Histogram")
plt.show()

Plots two columns on one grid

plt.hist(df[["column1", "column2"]])
plt.show()

Plots two columns on separate grids

df.hist(column=["column1", "column2"])
plt.show()

Pie Charts

Be sure to import the Matplotlib library for visualizations

import matplotlib.pyplot as plt

Simple Pie Chart

# Groups by a specific column and sums up the total

df1 = df.groupby("column1").sum()

# Plots using the sums and another column

df1.plot.pie(y="column2", labels=df1.index)
plt.show()

Advanced Pie Chart

# Specify the colors used
colors = ["lightcoral", "lightskyblue", "gold"]

# Set the middle section to "explode"
explode = [0, 0.1, 0]

# Plot the pie chart using the data frame
# Organize it by a specific column
# Set a start angle for the text
# Display percentages
df.plot.pie(y="column", colors=colors, explode=explode, 
startangle=45, autopct="%1.1f%%")

# Move the legend to the best location
plt.legend(loc="upper right")
plt.show()

Scatterplots

Be sure to import the Matplotlib library for visualizations

import matplotlib.pyplot as plt

Plots a scatterplot that relates two columns

df.plot(kind="scatter", x="column1", y="column2")
plt.show()

Styling Options

# Sets a color and size (s) for the points

df.plot(kind="scatter", x="column1", y="column2", 
color="orange", s = 10)
plt.show()

Line Charts

Be sure to import the Matplotlib library for visualizations

import matplotlib.pyplot as plt

Plot line chart using two sets of data (x1, y1 and x2, y2)

You may need to sort data first. See sorting here.
# Set x1 and y1

x = df.age.loc[df.sex == "f"]
y = df.height.loc[df.sex == "f"]

# Set x2 and y2
x2 = df.age.loc[df.sex == "m"]
y2 = df.height.loc[df.sex == "m"]

# Plot and customize each line
plt.plot(x1, y1)
plt.plot(x2, y2)

plt.show()    

More Options

# Add labels
plt.xlabel("Age")
plt.ylabel("Height")
plt.title("Height of School Children")

# Add a legend
plt.legend(["Females", "Males"])

Bar Charts

Be sure to import the Matplotlib library for visualizations

import matplotlib.pyplot as plt

Plot bar chart using two columns of data (x and y)

# Set color, width, and edgecolor of bars

plt.bar(x=df.column1, height=df.column2, width=1, 
edgecolor="black", color="#EA638C")
plt.show()   

More Options

# Add labels and a title

plt.xlabel("Month")
plt.ylabel("Temperature (°F)")
plt.title("Average GA Temps", fontsize=22)

# Adjust grid and rotation of x ticks

plt.grid(False)
plt.xticks(rotation=45)

Plot bar chart using three columns of data (x1, x2 and y)

# Set the width of the bar

bar_width = 0.4

# Plot first dataset

plt.bar(x=df.column1, height=df.column2, 
width=bar_width, color="#EA638C")

# Plot second data set.
# Add the bar width to the x value so that the bars 
# do not overlap 

plt.bar(x=df.column3 + bar_width, height=df.column2, 
width=bar_width, color="#190E4F")

# Add a legend

plt.legend(["First Column", "Second Column "])
plt.show() 

Normal Distribution

Be sure to import the Matplotlib library and the SciPy library

import matplotlib.pyplot as plt
from scipy.stats import norm
You will also need to include scipy in the requirements.txt file

Plot the Data

# Set data to be the values in a specific column

data = df.column

# Plot the histogram (w/density)

plt.hist(data, bins=10, density=True) 
plt.show()

Plot the Normal Distribution Curve

# Determine the mean, median and std

mean = data.mean()
median = data.median()
std = data.std()

# Set up min and max of the x-axis using the mean and standard deviation

xmin = mean - 3 * std
xmax = mean + 3 * std

# Define the x-axis values

x = range(int(xmin), int(xmax)) 

# "Norm" the y-axis values based on the x-axis values, the mean and the std

y = norm.pdf(x, mean, std)

# Plot the graph using the x and the y values

plt.plot(x, y, color="orange", linewidth=2) 
plt.show()

Determine the Likelihood

The pdf finds the likelihood of an exact event. The value is used to graph the normal distibution, but is not typically used in determining likelihood since it is usually a very low number.
pdf = norm.pdf(x_value, mean, std) 
print(pdf)
The cdf finds the cumulative likelihood.
What is the likelihood that a value is less than the x_value?
cdf = norm.cdf(x_value, mean, std) 
print(cdf)
What is the likelihood that a value is more than the x_value?
more_than_cdf = 1 - norm.cdf(x_value, mean, std) 
print(more_than_cdf)

Linear Regression

Be sure to import the Matplotlib library and the NumPy library

import numpy as np
import matplotlib.pyplot as plt

Determining Correlation

Linear Regression is only valid for values that have a correlation.
# Set the x and y values

x = column1
y = column2

# Determine and display the correlation

correlation = y.corr(x)
print(correlation)    

Determining the Line of Best Fit

The model will print as a list that includes the slopr and the y-intercept: [ m, b ]
# Determine the model equation

model = np.polyfit(x, y, 1)
print(model)

Predicting Using a Model

# Create the predict function

predict = np.poly1d(model)

# Use the predict function

value = 60

prediction = predict(value)

print(prediction)   

Plotting the Line of Best Fit

# Determine the min and max values of the x-axis

print(df.wait_time.min())
print(df.wait_time.max())

# Create the line of best fit
# range is based on the min and max values determined above

x_lin_reg = range(min, max) 
y_lin_reg = predict(x_lin_reg)
plt.plot(x_lin_reg, y_lin_reg)
  
plt.show()  

User Input

We can use input from the user to control our code. The input is saved as a string by default.

# If the input is a string.
name = input("What is your name? ")

# If the input needs to be used as a number include 
# the term 'int' or 'float'
num_one = int(input("Enter a number: "))
num_two = int(input("Enter a second number: "))
num_three = float(input("Enter a third number: "))

If/Else Statements

We can tell the computer how to make decisions using if/else statements. Make sure that all the code inside your if/else statement is indented one level!

If Statements

Use an if statement to instruct the computer to do something only when a condition is true. If the condition is false, the command indented underneath will be skipped.

if BOOLEAN_EXPRESSION:
    print("This executes if BOOLEAN_EXPRESSION is True")

# Example:

# This will only print if the user enters a negative number
number = int(input("Enter a number: "))
if number < 0:
    print(str(number) + " is negative!")

If/Else Statements

Use an if/else statement to force the computer to make a decision between multiple conditions. If the first condition is false, the computer will skip to the next condition until it finds one that is true. If no conditions are true, the commands inside the else block will be performed.

if condition_1:
    print("This executes if condition_1 evaluates to True")
elif condition_2:
    print("This executes if condition_2 evaluates to True")
else:
    print("This executes if no prior conditions are True")

# Example:

# This program will print that the color is secondary
color == "purple"
if color == "red" or color == "blue" or color == "yellow":
    print("Primary color.")
elif color == "green" or color == "orange" or color == "purple":
    print("Secondary color.")
else:
    print("Not a primary or secondary color.")

Loops

Loops help us repeat commands which makes our code much shorter. Make sure everything inside the loop is indented one level!


For Loops

Use for loops when you want to repeat something a fixed number of times.

# This for loop will print "hello" 5 times
for i in range(5):
    print("hello")

# This for loop will print out even numbers 1 through 10
for number in range(2, 11, 2):
    print(i)

# This code executes on each item in my_list
# This loop will print 1, then 5, then 10, then 15
my_list = [1, 5, 10, 15]
for item in my_list:
    print(item)

While Loops

Use while loops when you want to repeat something an unknown number of times or until a condition becomes false. If there is no point where the condition becomes false, you will create an infinite loop which should always be avoided!

# This program will run as long as the variable 'number' is greater than 0
# Countdown from from 10 to 0
number = 10
while number >= 0:
    print(number)
    number -= 1

# You can also use user input to control a while loop
# This code will continue running while the user answers ‘Yes’
continue = input("Continue code?: ")
while continue == "Yes":
    continue = input("Continue code?: ")

Strings

Strings are pieces of text. We can gain much information about strings and alter them in many ways using various methods.

Indexing a String

We use indexing to find or take certain portions of a string. Index values always start at 0 for the first character and increase by 1 as we move to the right. From the end of the string, the final value also has an index of -1 with the values decreasing by 1 as we move to the left.

# Prints a character at a specific index
my_string = "hello!"
print(my_string[0])       # print("h")
print(my_string[5])       # print("!")

# Prints all the characters after the specific index
my_string = "hello world!"
print(my_string[1:])      # print("ello world!")
print(my_string[6:])      # prints("world!")

# Prints all the characters before the specific index
my_string = "hello world!"
print(my_string[:6])     # print("hello")
print(my_string[:1])     # print("h")

# Prints all the characters between the specific indices
my_string = "hello world!"
print(my_string[1:6])      # print("ello")
print(my_string[4:7])      # print("o w")

# Iterates through every character in the string
# Will print one letter of the string on each line in order
my_string = "Turtle"
for c in my_string:
    print(c)

# Completes commands if the string is found inside the given string
my_string = "hello world!"
if "world" in my_string:
   print("world")

# Concatenation
my_string = "Tracy the"
print(my_string + " turtle")    # print(“Tracy the turtle”)

# Splits the string into a list of letters
my_string = "Tracy"
my_list = list(my_string)       # my_list = ['T’, ‘r’, ‘a’, ‘c’, ‘y’]

# Using enumerate will print the index number followed by a colon and the
# word at that index for each word in the list
my_string = "Tracy is a turtle"
for index, word in enumerate(my_string.split()):
    print(str(index) + ": " + word)

String Methods

There are many methods that can be used to alter strings.

# upper: To make a string all uppercase
my_string = "Hello"
my_string = my_string.upper()     # returns "HELLO"

# lower: To make a string all lowercase
my_string = "Hello"
my_string = my_string.lower()     # returns "hello"

# isupper: Returns True if a string is all uppercase letters and False otherwise
my_string = "HELLO"
print(my_string.isupper())        # returns True

# islower: Returns True if a string is all lowercase letters and False otherwise
my_string = "Hello"
print(my_string.islower())         # returns False

# swapcase: Returns a string where each letter is the opposite case from original
my_string = "PyThOn"
my_string = my_string.swapcase()  # returns "pYtHoN"

# strip: Returns a copy of the string without any whitespace at beginning or end
my_string = "       hi there       "
my_string = my_string.strip()     # returns "hi there"

# find: Returns the lowest index in the string where substring is found
# Returns -1 if substring is not found
my_string = "eggplant"
index = my_string.find("plant")   # returns 3
index = my_string.find("Tracy")   # returns -1

# split: Splits the string into a list of words at whitespace
my_string = "Tracy is a turtle"
my_list = my_string.split()       # Returns ['Tracy', 'is', 'a', 'turtle']

Set Index

View a dataframe using a different index. This will not change the data frame.

df.set_index("column")

Modify and change the data frame to use a new column as the index.

df.set_index("column", inplace=True)

Reset the index by renumbering the rows.

df.reset_index(inplace=True)

Creating Columns

Create a new column with new values.

df["new_column"] = [1, 2, 3, 4, 5, 6]  

Create a new column using a function.

df["new_column"] = function(df["column1"], df["column2"])   

Importing Data

Import data from a CSV file.

# Import the data
df = pd.read_csv (r"data.csv")

# Remove max columns limitation and show all columns. 
pd.set_option("display.max_columns", None)   

Data Cleaning

Dropping Data

# Drop unnecessary columns
df = df.drop(["column1", "column2"], axis=1)   

Determine missing values in each column.

df.isnull().sum()    

Drop missing values.

# Drop rows that contain missing values
df.dropna()

# Drop columns that contain missing values
df.dropna(axis=1)

Fill in missing values.

# Fill in with a specific value
df.fillna(0, inplace=True)

# Fill in with the number in the row behind it.
df.fillna(method='bfill')

# Fill in with the number in the column before it.
df.fillna(method='ffill', axis=1)

Determine the number of duplicate rows.

df.duplicated().sum()   

Find the duplicated row(s).

df.loc[df.duplicated()]  

Drop duplicate rows.

df.drop_duplicates(inplace=True)  

Change the data type of a column.

# Change the data type to a specific data type
df.column.astype(data_type)

# Change the data type to a float
pd.to_numeric(df.column)

Grouping/Sorting

Grouping

# Groups and returns the count 

df.groupby("value_to_group_by").column.count()

# Groups and returns the maximum value in two columns

df.groupby("value_to_group_by")[["column1", "column2"]].max()

# Groups and returns the min, max and sum of the 
# values in a column 

df.groupby("value_to_group_by").column.agg([min, max, sum])

# Groups and returns the sorted list of values in a column 

df.groupby("value_to_group_by").column.agg([sorted]

Sorting

# Sort values (increasing/ascending)

df.sort_values(by="sorting_value")

# Sorts one column of values (decreasing/descending) and 
# then by another (increasing/ascending)

df.sort_values(by=["sort1", "sort2"], ascending=[False, True])

Combining Datasets

To concatenating or merge a dataset, make sure that column names match between the different datasets.

# Concatenating two datasets:
# add second data set on as new rows
# use the reset_index function to renumber the rows

combined_df = pd.concat([df1, df2]).reset_index()

# Merging/Joining two datasets:
# Merge everything from both data sets
pd.merge(df1, df2, on="name", how="outer")

# Merge only values that exist in BOTH data sets
pd.merge(df1, df2, on="name", how="inner")

# Keep everything in the first data set and 
# merge in matching values from the second 
pd.merge(df1, df2, on="name", how="left")

# Keep everything in the second data set and 
# merge in matching values from the first
pd.merge(df1, df2, on="name", how="right")