Discovering Data Magic: A Guide to Pandas for Beginners

- March 28, 2024

Pandas Primer

What kind of data does pandas handle?

Pandas, a Python library, excels in managing tabular data like spreadsheets or database tables. It offers a DataFrame object for easy exploration, cleaning, and processing of various data formats, from CSVs to SQL tables, making it an indispensable tool for data analysis.

I want to start using pandas!!!

import pandas as pd alias for pandas is pd

pandas data table representation

A Data Frame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the `data.frame` in R.

To manually store data in a table, create `DataFrame`. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the `DataFrame`.

Each column in a `DataFrame` is a `Series`

I’m just interested in working with the data in the column Age

dataframe_try["Age"]

Do something with a DataFrame or Series

We can find maximum and minimum on the DataFrame by selecting the Age column and applying max() and min() on the series

I’m interested in some basic statistics of the numerical data of my data table

The describe() method provides a quick overview of the numerical data in a DataFrame

How do I read and write tabular data?

pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_.

A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:

Whereas read_ functions are used to read data to pandas, the to_* methods are used to store data. The to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.

The equivalent read function read_excel() will reload the data to a DataFrame:

The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:

It is indeed a DataFrame.
There are 2000 entries, i.e. 2000 rows.
Each row has a row label (aka the index) with values ranging from 0 to 1999.
The table has 17 columns. Most columns have a value for each of the rows (all 2000 values are non-null). Some columns do have missing values and less than 2000 non-null values.
The columns Name, Gender, the other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
The kind of data (characters, integers…) in the different columns are summarized by listing the dtypes.
The approximate amount of RAM used to hold the DataFrame is provided as well.
How do I select a subset of a DataFrame?
How do I select specific columns from a DataFrame?
To select a single column, use square brackets [] with the column name of the column of interest.

Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series. We can verify this by checking the type of the output:

And have a look at the shape of the output:

DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.

To select multiple columns, use a list of column names within the selection brackets [].

The returned data type is a pandas DataFrame

The selection returned a DataFrame with 2000 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

How do I filter specific rows from a `DataFrame`?

To select rows based on a conditional expression, use a condition inside the selection brackets `[]`.
The condition inside the selection brackets Student`["Age"] > 17` checks for which rows the `Age` column has a value larger than 17
/

The output of the conditional expression (`>`, but also `==`, `!=`, `<`, `<=`,… would work) is actually a pandas `Series` of boolean values (either `True` or `False`) with the same number of rows as the original `DataFrame`. Such a `Series` of boolean values can be used to filter the `DataFrame` by putting it in between the selection brackets `[]`. Only rows for which the value is `True` will be selected.
We know from before that the original student-score`DataFrame` consists of 2000
rows. Let’s have a look at the number of rows which satisfy the condition by checking the `shape` attribute of the resulting `DataFrame`:

Similar to the conditional expression, the `isin()` conditional function returns a `True` for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets `[]`. In this case, the condition inside the selection brackets `titanic["Pclass"].isin([2, 3])` checks for which rows the `Pclass` column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 1 or 3 and combining the two statements with an `|` (or) operator:

The `notna()` conditional function returns a `True` for each row the values are not a `Null` value. As such, this can be combined with the selection brackets `[]` to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:

How do I select specific rows and columns from a `DataFrame`?

In this case, a subset of both rows and columns is made in one go and just using selection brackets `[]` is not sufficient anymore. The `loc`/`iloc` operators are required in front of the selection brackets `[]`. When using `loc`/`iloc`, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the `loc` operator in front of the selection brackets `[]`. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

Again, a subset of both rows and columns is made in one go and just using selection brackets `[]` is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the `iloc` operator in front of the selection brackets `[]`.
When selecting specific rows and/or columns with `loc` or `iloc`, new values can be assigned to the selected data. For example, to assign the name `anonymous` to the first 3 elements of the fourth column:

----------Await the dawn, the next chapter near----------

Comments