Discovering Data Magic: A Guide to Pandas for Beginners

 Pandas Primer

What kind of data does pandas handle?

Pandas, a Python library, excels in managing tabular data like spreadsheets or database tables. It offers a DataFrame object for easy exploration, cleaning, and processing of various data formats, from CSVs to SQL tables, making it an indispensable tool for data analysis.

I want to start using pandas!!!

 import pandas as pd                  alias for pandas is pd 


pandas data table representation



A Data Frame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

To manually store data in a table, create DataFrame. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.


   



Each column in a DataFrame is a Series


I’m just interested in working with the data in the column Age

dataframe_try["Age"]


Do something with a DataFrame or Series


We can find maximum and minimum on the DataFrame by selecting the Age column and applying max() and min() on the series


I’m interested in some basic statistics of the numerical data of my data table



The describe() method provides a quick overview of the numerical data in a DataFrame


How do I read and write tabular data?



pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.


A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:



Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.


The equivalent read function read_excel() will reload the data to a DataFrame:




The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:


  • It is indeed a DataFrame.
  • There are 2000 entries, i.e. 2000 rows.
  • Each row has a row label (aka the index) with values ranging from 0 to 1999.
  • The table has 17 columns. Most columns have a value for each of the rows (all 2000 values are non-null). Some columns do have missing values and less than 2000 non-null values.
  • The columns Name, Gender, the other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
  • The kind of data (characters, integers…) in the different columns are summarized by listing the dtypes.
  • The approximate amount of RAM used to hold the DataFrame is provided as well.

How do I select a subset of a DataFrame?

How do I select specific columns from a DataFrame?

To select a single column, use square brackets [] with the column name of the column of interest.



Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series. We can verify this by checking the type of the output:


And have a look at the shape of the output:


DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.


To select multiple columns, use a list of column names within the selection brackets [].


 The returned data type is a pandas DataFrame




The selection returned a DataFrame with 2000 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

How do I filter specific rows from a DataFrame?



  • To select rows based on a conditional expression, use a condition inside the selection brackets [].
  •  The condition inside the selection brackets Student["Age"] > 17 checks for which rows         the Age column has a value larger than 17 

/


The output of the conditional expression (>, but also ==!=<<=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

We know from before that the original student-scoreDataFrame consists of 2000
rows. Let’s have a look at the number of rows which satisfy the condition by checking the 
shape attribute of the resulting DataFrame:







Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 1 or 3 and combining the two statements with an | (or) operator:




  • The notna() conditional function returns a True for each row the values are not a Null value. As such, this can be combined with the selection brackets [] to filter the data table.

You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:



How do I select specific rows and columns from a DataFrame?



  • In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

   When using the column names, row labels or a condition expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.




  • Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].

When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the fourth column:



----------Await the dawn, the next chapter near----------

































Comments