Hedaro Blog

How to Create a Pandas DataFrame

The most important data structure is the Pandas DataFrame (notice the Camel Case, more on this later). It will also be one of the most commonly used terms when dealing with this library. At a high level, we as analysts, as developers, need to get our data inside a dataframe.

It is when we get our data inside this data structure that we will be able to harness the power of Pandas

The steps below are meant to be for someone relatively new to the Pandas world. It shows you a few ways to quickly create dataframes. Get your coffee ready.

In [1]:
# import libraries
import pandas as pd
import sys

I like to start by sharing the version of Python and the relevant libraries I will be using for this tutorial. We know that different versions of the same library may behave differently. So to avoid issues realted to different library or Python versions, I make it very clear what I used in this notebook. So instead of banging your head for hours, you could test to see if a different version of Pandas or Python is causing issues for you.

In [2]:
print('Python: ' + sys.version.split('|')[0])
print('Pandas: ' + pd.__version__)
Python: 3.5.1 
Pandas: 0.23.4

This tutorial is also available in video form. I try to go in more detail in the notebook but the video is worth watching.

In [3]:
from IPython.display import YouTubeVideo

Python Lists

Python lists are very commonly used. They are essentially arrays that can hold any kind of data. We can put strings or numbers inside lists. Let's make a simple one and see how they work.

We start by reating a variable named d. Now lists start and end with brackets. If you see brackets, most likely you are dealing with a list. Inside our list, we placed 4 numbers. Pretty easy.

In [4]:
d = [0,1,2,3]
[0, 1, 2, 3]

We can be extra sure by asking Python what kind of object is the variable d. We can also get the length of the list.

In [5]:


List to Dataframe

So how do we get this list into a dataframe? Like I meantioned earlier, if you cannot get your data into a dataframe, then there isn't much Pandas can do for you.

We start by creating a variable named df. In many examples, df is a very common way to name your variables that hold a dataframe.

And if you haven't figured it out, df is short for dataframe.

IMPORTANT: Note that the DataFrame method is camel case. Knowing this may save you some frustration.

The key parameter is called data and this is where you are going to place the list we created a few seconds ago. After this is done, all you have to do is print the dataframe.

In [6]:
df = pd.DataFrame(data=d)
0 0
1 1
2 2
3 3

Labeling DataFrame Columns

Did you notice an issue we have with our dataframe? Yes, the column name is zero. We only have one column. The other column with no column name is not really a column. This is called the index. It is similar to the row numbers in an Excel file. It is also similar to identity columns in a database table. One thing to keep in mind is that this column does not have to be unique. This won't come into play in this lesson, but just sharing for awareness.

Every dataframe will come with an index

The dataframe method has a columns parameter and this is the trick to getting your columns named.

As you can see below, not only does the HTML table look much nicer, but it will make your readers of your future notebooks very happy. I like to pass a Python list to the columns parameter. If you have more than one column, you can create a list of multiple column names.

In [7]:
df = pd.DataFrame(data=d, columns=['Revenue'])
0 0
1 1
2 2
3 3

Python Dictionary

The Python dictionary is another commonly used object. It is not as common as the Python list, but you will see it a lot. The advantage of using a dictionary is that it lets us label our columns ahead of time. This means we can skip the step of setting the columns parameter in Pandas.

If you see curly brackets, then you may be looking at a Python dictionary. After the initial curly bracket, you pass in a string. This string will represent the column name of your dataframe. Then we use a colon and then I like to pass a list. See how lists are everywhere? We finish things up by closing the parenthesis.

In [8]:
d = {'Revenue':[5,6,7]}
{'Revenue': [5, 6, 7]}

For the paranoid like myself. We can check the type and size as shown below. Note that we did not get thee for the length as we were not counting the list but the dictionary. We only have one column so the length is one. Get it?

In [9]:


Dict to Dataframe

Luckily most of the steps to get a dict into a dataframe we have already done. It is actually even easier since we are going to ignore the columns parameter. Pandas is smart enough to know the column names are already provided in the Python dictionary.

I didn't mention it before, but pd is the alias for the Pandas library. This alias is allowing us to reach into tha Pandas library and gives us access to all the methods and functions Pandas has to offer.

In [10]:
df = pd.DataFrame(d)
0 5
1 6
2 7

What about dictionaries with multiple columns? Can we do that? Yes and yes! All we have to do is seperate each element in the dictionary. Remeber to follow the same format. Name of column, then colon, then a Python list.

In [11]:
d = {'Revenue':[5,6,7],
{'Cost': [5.0, 6.1, 7.2], 'Revenue': [5, 6, 7]}

Did you notice that? When I started creating dataframes, I noticed an odd default behavior with the ordering of the columns. The order I placed columns in the Python dictionary did not always match with the dataframe column order. This was a bit annoying but it's something you are going to have to work with.

The good news is that if you have Python version 3.6+ and Pandas version 0.23.0 this will all be fixed. Below taken from the Pandas website:

Until Python 3.6, dicts in Python had no formally defined ordering. For Python version 3.6 and later, dicts are ordered by insertion order, see PEP 468. Pandas will use the dict’s insertion order, when creating a Series or DataFrame from a dict and you’re using Python version 3.6 or higher.

In [12]:
Cost Revenue
0 5.0 5
1 6.1 6
2 7.2 7

A work around is to use the columns parameter and force your columns to be ordered a certain way. I know I mentioned we did not need to use the columns parameter with dictionaries but I guess I lied. If order matters to you then you might as well use it.

In [13]:
pd.DataFrame(d, columns=['Revenue','Cost'])
Revenue Cost
0 5 5.0
1 6 6.1
2 7 7.2

This tutorial was created by HEDARO