Discovering Data Magic: A Guide to Pandas for Beginners
Pandas Primer
What kind of data does pandas handle?
pandas data table representation
A Data Frame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame
in R.
data.frame
in R.To manually store data in a table, create DataFrame
. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame
.
DataFrame
. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame
.
I’m just interested in working with the data in the column Age
dataframe_try["Age"]
Do something with a DataFrame or Series
We can find maximum and minimum on the DataFrame by selecting the Age column and applying max() and min() on the series
I’m interested in some basic statistics of the numerical data of my data table
The describe() method provides a quick overview of the numerical data in a DataFrame
How do I read and write tabular data?
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.
A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:
Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.
The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:
- It is indeed a DataFrame.
- There are 2000 entries, i.e. 2000 rows.
- Each row has a row label (aka the index) with values ranging from 0 to 1999.
- The table has 17 columns. Most columns have a value for each of the rows (all 2000 values are non-null). Some columns do have missing values and less than 2000 non-null values.
- The columns Name, Gender, the other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
- The kind of data (characters, integers…) in the different columns are summarized by listing the dtypes.
- The approximate amount of RAM used to hold the DataFrame is provided as well.
How do I select a subset of a DataFrame?
How do I select specific columns from a DataFrame?
To select a single column, use square brackets [] with the column name of the column of interest.
Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series. We can verify this by checking the type of the output:
And have a look at the shape of the output:
DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.
To select multiple columns, use a list of column names within the selection brackets [].
The returned data type is a pandas DataFrame
The selection returned a DataFrame with 2000 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.
How do I filter specific rows from a DataFrame
?
- To select rows based on a conditional expression, use a condition inside the selection brackets
[]
. - The condition inside the selection brackets Student
["Age"] > 17
checks for which rows the Age
column has a value larger than 17
/
The output of the conditional expression (>
, but also ==
, !=
, <
, <=
,… would work) is actually a pandas Series
of boolean values (either True
or False
) with the same number of rows as the original DataFrame
. Such a Series
of boolean values can be used to filter the DataFrame
by putting it in between the selection brackets []
. Only rows for which the value is True
will be selected.
We know from before that the original student-scoreDataFrame
consists of 2000
rows. Let’s have a look at the number of rows which satisfy the condition by checking the shape
attribute of the resulting DataFrame
:
Similar to the conditional expression, the isin()
conditional function returns a True
for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []
. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3])
checks for which rows the Pclass
column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is either 1 or 3 and combining the two statements with an |
(or) operator:
The notna()
conditional function returns a True
for each row the values are not a Null
value. As such, this can be combined with the selection brackets []
to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:
How do I select specific rows and columns from a DataFrame
?
In this case, a subset of both rows and columns is made in one go and just using selection brackets []
is not sufficient anymore. The loc
/iloc
operators are required in front of the selection brackets []
. When using loc
/iloc
, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the loc
operator in front of the selection brackets []
. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
Do something with a DataFrame or Series
How do I read and write tabular data?
A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:
Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.
- It is indeed a DataFrame.
- There are 2000 entries, i.e. 2000 rows.
- Each row has a row label (aka the index) with values ranging from 0 to 1999.
- The table has 17 columns. Most columns have a value for each of the rows (all 2000 values are non-null). Some columns do have missing values and less than 2000 non-null values.
- The columns Name, Gender, the other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
- The kind of data (characters, integers…) in the different columns are summarized by listing the dtypes.
- The approximate amount of RAM used to hold the DataFrame is provided as well.
How do I select a subset of a DataFrame?
How do I select specific columns from a DataFrame?
How do I filter specific rows from a DataFrame
?
- To select rows based on a conditional expression, use a condition inside the selection brackets
[]
. - The condition inside the selection brackets Student
["Age"] > 17
checks for which rows theAge
column has a value larger than 17
/
The output of the conditional expression (>
, but also ==
, !=
, <
, <=
,… would work) is actually a pandas Series
of boolean values (either True
or False
) with the same number of rows as the original DataFrame
. Such a Series
of boolean values can be used to filter the DataFrame
by putting it in between the selection brackets []
. Only rows for which the value is True
will be selected.
DataFrame
consists of 2000rows. Let’s have a look at the number of rows which satisfy the condition by checking the
shape
attribute of the resulting DataFrame
:isin()
conditional function returns a True
for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []
. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3])
checks for which rows the Pclass
column is either 2 or 3.|
(or) operator:The
notna()
conditional function returns aTrue
for each row the values are not aNull
value. As such, this can be combined with the selection brackets[]
to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:
How do I select specific rows and columns from a DataFrame
?
In this case, a subset of both rows and columns is made in one go and just using selection brackets
[]
is not sufficient anymore. Theloc
/iloc
operators are required in front of the selection brackets[]
. When usingloc
/iloc
, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the loc
operator in front of the selection brackets []
. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
Again, a subset of both rows and columns is made in one go and just using selection brackets []
is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc
operator in front of the selection brackets []
.
When selecting specific rows and/or columns with loc
or iloc
, new values can be assigned to the selected data. For example, to assign the name anonymous
to the first 3 elements of the fourth column:
----------Await the dawn, the next chapter near----------
Again, a subset of both rows and columns is made in one go and just using selection brackets
[]
is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use theiloc
operator in front of the selection brackets[]
.
When selecting specific rows and/or columns with loc
or iloc
, new values can be assigned to the selected data. For example, to assign the name anonymous
to the first 3 elements of the fourth column:
Comments
Post a Comment