Introduction to Python

Week 4

ZOOM LINK: https://ualberta-ca.zoom.us/j/94407584174

This week we’re going to be doing a very brief introduction to working with data in Python. While there are lots of ways to deal with data in Python, the most common method is through the library Pandas. You can learn more about Pandas (here)[https://pandas.pydata.org] if you’re interested. We’re going to go over some basic data operations and how to deal with larger datasets in Python.

In this video, Cole will walk you through how we’ll be loading data into Google Colab so it’s accessible to us to work with.

In this video Peter will go over how to use the Pandas library to manipulate data and move effectively when working with dataframes.

Please click here to fill out the post-session feedback form here! We take your feedback into account in real time, so any changes we can make to help you learn better next week we will make!

Content Summary

Importing Data

Importing data via Google Colab is slightly different than doing it in a desktop text editor. However, it’s still very easy. We’ll load in the file we need via the google.colab library.

from google.colab import files
data_to_load = files.upload()

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Now that we have mnaully navigated to our file and uploaded it, we’ll use the pandas library and the io library to read in our csv so we can use it.

import io
df = pd.read_csv(io.BytesIO(data_to_load['penguins.csv']))

We can check out our dataframe by just printing it in a cell.

df

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
...	...	...	...	...	...	...	...	...
339	Chinstrap	Dream	55.8	19.8	207.0	4000.0	male	2009
340	Chinstrap	Dream	43.5	18.1	202.0	3400.0	female	2009
341	Chinstrap	Dream	49.6	18.2	193.0	3775.0	male	2009
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

344 rows × 8 columns

Data manipulation with `pandas`

As data scientists, the pandas library in Python is incredibly important not just for reading in data but for manipulating and exporting said data using Python’s computational power. Let’s begin by reading in the file we worked with in the previous document:

We will quickly introduce you to a few functions that will be vital for viewing, organizing, and manipulating your data sets by performing them on the penguin data.

Viewing data

One of the simpler function is the head function. Just like in R, this function allows you to take a quick peek at your data by viewing the first few entries. When you’re working with very big data and would like to have a quick check on the structure of your data, this function is very useful.

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

It is typical to use head as a quick sanity check whenever you import data to ensure it’s been imported correctly. There is also a tail function that does the same, but with the end of your data. To get a more comprehensive summary of your data, you can use the info function, which gives a quick overview of every variable in your data frame. This is especially important if your data has any null values hiding in the file; you can quickly sniff them out using info.

penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object
 1   island             344 non-null    object
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

As you can see, we have a few null values hiding in our penguin data here. We can deal with those later.

For continuous variables, we can get a more thorough statistical summary of the data using the describe function.

penguins.describe()

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
count	342.000000	342.000000	342.000000	342.000000	344.000000
mean	43.921930	17.151170	200.915205	4201.754386	2008.029070
std	5.459584	1.974793	14.061714	801.954536	0.818356
min	32.100000	13.100000	172.000000	2700.000000	2007.000000
25%	39.225000	15.600000	190.000000	3550.000000	2007.000000
50%	44.450000	17.300000	197.000000	4050.000000	2008.000000
75%	48.500000	18.700000	213.000000	4750.000000	2009.000000
max	59.600000	21.500000	231.000000	6300.000000	2009.000000

Notice that the species, island, and sex columns are automatically left out, as are any rows that contain null values. This can be quite handy for getting a quick summary of your continuous data before really diving into it.

Manipulating variable names

For now, let’s focus on one of the most important factors of data frames, something that makes them superior to arrays in Python. Notice when we took a quick look at our penguin data that each variable is given a string name. We can look at these names again using the columns command.

penguins.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

Renaming these columns is very easy, and we often might want to do so depending on our question. Let’s just try a quick example; maybe we want to shorten the name “year” to “yr” for conciseness. We use the rename function to accomplish this:

penguins.rename(columns = {'year': 'yr'}, inplace = True)

penguins.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'yr'],
      dtype='object')

A couple quick notes here: the use of the inplace = True parameter modifies the object penguins automatically, without us needing to use an assignment statement anywhere in our code. In other words, it saves us a bit of writing as otherwise we would have to type penguins = penguins.rename(columns = {'year': 'yr'}). Additionally, the notation used to rename objects is known as a dictionary in Python; we won’t get into dictionaries here because they are not easy to work with, but they are used for many things in Python.

We can also select variables individually using the column name for any variable. Here, the notation is similar to traditional arrays, as we use the square brackets and the name of the column we would like to select. We can use this to get the statistics for any variable, including discrete variables such as species.

penguins["species"].describe()

count        344
unique         3
top       Adelie
freq         152
Name: species, dtype: object

More thoroughly, we can get a frequency table for every value of species.

penguins["species"].value_counts()

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

We can also very easily make new columns using the same notation. Let’s say we want to know if a penguin is above average in size. Maybe we want to make a new boolean column displaying this. We can do that easily using the notation below.

penguins["above_avg_mass_bool"] = penguins["body_mass_g"] > penguins["body_mass_g"].mean()

penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   species              344 non-null    object
 1   island               344 non-null    object
 2   bill_length_mm       342 non-null    float64
 3   bill_depth_mm        342 non-null    float64
 4   flipper_length_mm    342 non-null    float64
 5   body_mass_g          342 non-null    float64
 6   sex                  333 non-null    object
 7   yr                   344 non-null    int64  
 8   above_avg_mass_bool  344 non-null    bool   
dtypes: bool(1), float64(4), int64(1), object(3)
memory usage: 22.0+ KB

Dealing with bad data

Now, let’s get back to those missing values. We did have some NAs in our data, and depending on what task you are looking to complete, these values can be dealt with in many different ways. Sometimes we want to set them to a different value (e.g., 0), but sometimes it is just best to leave them as is. More often though, it is customary to simply remove these rows from the data frame as a means to smooth out future analysis. This is done rather simply in Python, using the dropna command.

penguins_good = penguins.dropna()

penguins_good.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   species              333 non-null    object
 1   island               333 non-null    object
 2   bill_length_mm       333 non-null    float64
 3   bill_depth_mm        333 non-null    float64
 4   flipper_length_mm    333 non-null    float64
 5   body_mass_g          333 non-null    float64
 6   sex                  333 non-null    object
 7   yr                   333 non-null    int64  
 8   above_avg_mass_bool  333 non-null    bool   
dtypes: bool(1), float64(4), int64(1), object(3)
memory usage: 23.7+ KB

Notice how the inplace = True argument is left out here; we could have included it in the call to dropna but we are already assigning this revised data frame to a new variable, and we don’t want to modify penguins just yet. More importantly though, notice how we have 333 entries in total (based on the second line of the print output) and for each variable, all 333 are non-null!

As mentioned before, we don’t always want to simply remove null values; we might occasionally want to replace them with something else. Here I think it would make the most sense to remove them given the data, but for pedagogical purposes, let’s replace them instead.

penguin_bill_length = penguins["bill_length_mm"]

penguin_bill_length.head()

  39.1
  39.5
  40.3
   NaN
  36.7
Name: bill_length_mm, dtype: float64

Let’s pretend we want to replace null values with the mean bill length from the entire database. First we need to get the mean, and then we need to replace these null values with that number. This takes a couple lines of code.

mean_bill_length = penguin_bill_length.mean()

penguin_bill_length.fillna(mean_bill_length, inplace = True)

penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   species              344 non-null    object
 1   island               344 non-null    object
 2   bill_length_mm       344 non-null    float64
 3   bill_depth_mm        342 non-null    float64
 4   flipper_length_mm    342 non-null    float64
 5   body_mass_g          342 non-null    float64
 6   sex                  333 non-null    object
 7   yr                   344 non-null    int64  
 8   above_avg_mass_bool  344 non-null    bool   
dtypes: bool(1), float64(4), int64(1), object(3)
memory usage: 22.0+ KB

Look closely - now the bill_length_mm column of the data frame doesn’t have any null values anymore! That being said, I am going to remove all the null values going forward, using the same command as earlier but with inplace = True this time.

penguins.dropna(inplace = True)

penguins.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   species              333 non-null    object
 1   island               333 non-null    object
 2   bill_length_mm       333 non-null    float64
 3   bill_depth_mm        333 non-null    float64
 4   flipper_length_mm    333 non-null    float64
 5   body_mass_g          333 non-null    float64
 6   sex                  333 non-null    object
 7   yr                   333 non-null    int64  
 8   above_avg_mass_bool  333 non-null    bool   
dtypes: bool(1), float64(4), int64(1), object(3)
memory usage: 23.7+ KB

Manipulating rows of data

We discussed how to rename columns, but selecting specific rows (or observations) in your data frame can also be important. We do this using either the loc or iloc functions. The loc function is used to select rows by name (here, we do not have a row index so this function is not very useful), while iloc uses the index number.

print(penguins.iloc[0])
#gets just one value

print(penguins.iloc[0:3])
#gets the values from index 0 to index 2; note that index 3 is not included

species                   Adelie
island                 Torgersen
bill_length_mm              39.1
bill_depth_mm               18.7
flipper_length_mm            181
body_mass_g                 3750
sex                         male
yr                          2007
above_avg_mass_bool        False
Name: 0, dtype: object
  species     island  bill_length_mm  ...     sex    yr  above_avg_mass_bool
0  Adelie  Torgersen            39.1  ...    male  2007                False
1  Adelie  Torgersen            39.5  ...  female  2007                False
2  Adelie  Torgersen            40.3  ...  female  2007                False

[3 rows x 9 columns]

This is helpful if we know the exact numerical order of our data, but often times we do not. Sometimes, we need to sift through an entire data set just to find a few rows with specific conditions that we are interested in. How do we do this without manually scrolling through the entire data set on Excel? Since this is nearly impossible with large data sets, we are lucky to have a tool for conditional selection in Python. We do this by using the square brackets once again, but this time, we include a boolean condition inside the brackets.

adelies = penguins[penguins["species"] == "Adelie"]

adelies["species"].describe() # The only value is "Adelie" now!

count        146
unique         1
top       Adelie
freq         146
Name: species, dtype: object

As you can see, this refined data frame only includes the Adelie penguins. This subset behaves just like any other data frame, so we can call all the same functions on it. We can now subset even further to conduct additional analyses.

mean_adelie_bill_size = adelies["bill_length_mm"].mean()

small_billed_adelies = adelies[adelies["bill_length_mm"] < mean_adelie_bill_size]

small_billed_adelies.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	yr	above_avg_mass_bool
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007	False
13	Adelie	Torgersen	38.6	21.2	191.0	3800.0	male	2007	False
14	Adelie	Torgersen	34.6	21.1	198.0	4400.0	male	2007	True
15	Adelie	Torgersen	36.6	17.8	185.0	3700.0	female	2007	False
16	Adelie	Torgersen	38.7	19.0	195.0	3450.0	female	2007	False

Notice how here we used the comparison operator < for numerical values, which we can do easily in addition to comparing strings. We can also filter for multiple variables at once if we want to using the & and | operators, which correspond to “and” and “or” respectively. For example, what if we only wanted to analyze data for the female Adelie penguins? We can do so below.

female_adelies = penguins[(penguins["species"] == "Adelie") & (penguins["sex"] == "female")]

female_adelies.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	yr	above_avg_mass_bool
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007	False
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007	False
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007	False
6	Adelie	Torgersen	38.9	17.8	181.0	3625.0	female	2007	False
12	Adelie	Torgersen	41.1	17.6	182.0	3200.0	female	2007	False

A small note - we actually need to use the parentheses around each boolean condition here so the order of operations works. If you get an error in your code when trying to do this, that might be why!

One last helpful function is the append function. This function will add the rows of one data frame onto another, assuming the data frames both have the same column dimensions. Let’s pretend we didn’t get the penguin data all at once, and instead got it in species groups. Maybe we want to bind all the data frames together so the data is all in one place. We can do that below:

chinstraps = penguins[penguins["species"] == "Chinstrap"]
gentoos = penguins[penguins["species"] == "Gentoo"]
#Pretend you didn't see these! :)

penguins_bound = adelies.append(chinstraps.append(gentoos))

penguins_bound.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 275
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   species              333 non-null    object
 1   island               333 non-null    object
 2   bill_length_mm       333 non-null    float64
 3   bill_depth_mm        333 non-null    float64
 4   flipper_length_mm    333 non-null    float64
 5   body_mass_g          333 non-null    float64
 6   sex                  333 non-null    object
 7   yr                   333 non-null    int64  
 8   above_avg_mass_bool  333 non-null    bool   
dtypes: bool(1), float64(4), int64(1), object(3)
memory usage: 23.7+ KB

Using the append function inside itself is a bit of a slick programming manuever, but it’s completely allowed.

pandas has a ton of useful functions, and while what we’ve covered here may seem a bit overwhelming, there is in fact even more that we chose not to get in to. In all honesty, you may not ever need to use some of the things covered here, but in our eyes the most important thing is being familiar with the basics of the library as well as understanding how to learn about the other functions yourselves. That being said, there are some great resources to aid you into your pandas introduction; this website does a particularly nice job of summarizing all the important points, but a simple Google search of “python pandas tutorial” will return a vast array of helpful results.