Solving real world data science tasks with Python Pandas!



Follow me for more tech content!

Practice your Python Pandas data science skills with problems on StrataScratch!

In this video we use Python Pandas & Python Matplotlib to analyze and answer business questions about 12 months worth of sales data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

Setup!
Github source code & data:
Installing Jupyter Notebook:
Installing Pandas library:

Check out the first video I did on Pandas:

Check out the videos I did on Matplotlib:

Detailed video description! (timeline can be found in comments)

We start by cleaning our data. Tasks during this section include:
– Drop NaN values from DataFrame
– Removing rows based on a condition
– Change the type of columns (to_numeric, to_datetime, astype)

Once we have cleaned up our data a bit, we move the data exploration section. In this section we explore 5 high level business questions related to our data:
– What was the best month for sales? How much was earned that month?
– What city sold the most product?
– What time should we display advertisemens to maximize the likelihood of customer’s buying product?
– What products are most often sold together?
– What product sold the most? Why do you think it sold the most?

To answer these questions we walk through many different pandas & matplotlib methods. They include:
– Concatenating multiple csvs together to create a new DataFrame (pd.concat)
– Adding columns
– Parsing cells as strings to make new columns (.str)
– Using the .apply() method
– Using groupby to perform aggregate analysis
– Plotting bar charts and lines graphs to visualize our results
– Labeling our graphs

If you enjoy this video, make sure to leave it a like and subscribe to not miss any future similar tutorials :).

Check out the new “solving real world data science tasks” video I posted!

———————————————

Follow me on social media!
Instagram |
Twitter |

———————————————

Video Timeline!
0:00 – Intro
1:22 – Downloading the Data
2:57 – Getting started with the code (Jupyter Notebook)

Task #1: Merging 12 csvs into a single dataframe (3:35)
4:25 – Read single CSV file
5:44 – List all files in a directory
7:06 – Concatenating files
11:00 – Reading in Updated dataframe

Task #2: Add a Month column (12:48)
14:12 – Parse string in Pandas cell (.str)

Cleaning our data!
17:31 – Drop NaN values from df
21:25 – Remove rows based on condition

Task #3: Add a sales column (24:58)
25:58 – Another way to convert a column to numeric (ints & floats)

Question #1: What was the best month for sales? (29:20)
30:35 – Visualizing our results with bar chart in matplotlib

Question #2: What city sold the most product? (34:17)
35:32 – Add a city column
36:10 – Using the .apply() method (super useful!!)
40:35 – Why do we use the lambda x ?
40:57 – Dropping a column
46:45 – Answering the question (using groupby)
47:34 – Plotting our results

Question #3: What time should we display advertisements to maximize the likelihood of purchases? (52:13)
53:16 – Using to_datetime() method
56:01 – Creating hour & minute columns
58:17 – Matplotlib line graph to plot our results
1:00:15 – Interpreting our results

Question #4: What products are most often sold together? (1:02:17)
1:03:31 – Finding duplicate values in our DataFrame
1:05:43 – Use transform() method to join values from two rows into a single row
1:08:00 – Dropping rows with duplicate values
1:09:39 – Counting pairs of products (itertools, collections)

Question #5: What product sold the most? Why do you think it did? (1:14:04)
1:15:28 – Graphing data
1:18:41 – Overlaying a second Y-axis on existing chart
1:23:41 – Interpreting our results

———————
If you are curious to learn how I make my tutorials, check out this video:

Join the Python Army to get access to perks!
YouTube –
Patreon –

*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

source

29 thoughts on “Solving real world data science tasks with Python Pandas!”
  1. hours = [hour for hour, df in sales.groupby('Hour')]

    plt.plot(hours, sales.groupby(['Hour']).count())
    plt.xticks(hours)
    plt.xlabel('Hour')
    plt.ylabel('Number of Orders')
    plt.grid()
    plt.show()

    The code above is giving me an error. Can someone help please.
    InvalidIndexError: (slice(None, None, None), None)

  2. uh, the data I got had extra columns and I had to use a lambda function in order to remove the extra header columns baked into the data. otherwise I couldnt force columns to be of type int or whatever. so

    for file in files:

    df = pd.read_csv("./SalesAnalysis/Sales_Data/"+file, skip_blank_lines=True)

    df.dropna(how="all", inplace=True)

    df = df[df['Order ID'].apply(lambda x: str(x).isdigit())]

    allmonthsdata = pd.concat([allmonthsdata, df])

  3. I get the feeling in this video that you know more than you're letting on but you're just trying to make things as basic as possible and I love it. I hope to teach others in this same manner. God bless you

  4. I did the following at 48:58:
    sales_grpd_by_city = all_data.groupby("City").sum()
    plt.barh(sales_grpd_by_city.index, sales_grpd_by_city["Total Sales"])

    Thanks a lot Keith, your videos have really helped me. Wairimu, from Kenya

  5. How can we plot the bar chart you visualized using seaborn ?
    sns.barplot( months,results,data =all_months_data) #this lines showing attribute error
    sns.barplot( 'Month',y,data =all_months_data) # this prints a weird graph

  6. i know its been more than 2 years but i got a problem. when i extract the files, it says corrupted. i really want to watch the whole video along with practicing please help 🙁

  7. For the 4rth question

    df['grouped'].value_counts()[:12].plot(kind='barh')

    can i do it like this? because you have introduced a library, so I just took my own way to solve it and don't know if gives the exact answrer

  8. I'm having issues with the default folder to use for the work. I got as far as the 'new_data.csv', but funny it was created in the default directory under :C/ drive and not in the special folder I created for the project. Can you expatiate on how to use a directory for one's project on Notebook and to change the default directory? Thank you.

  9. Great video, thanks for sharing..In splitting city, lambda function is not needed….try this::

    df['City'] = df['Purchase Address'].str.split(",",expand=True)[1]

Leave a Reply

Your email address will not be published.

Captcha loading...