A Free 17-Hour Course to Learn Python for Data Science (for Beginners)

Learn data science with Python from scratch.

May 29, 2025

Learning Python is a fundamental first step in becoming a data scientist. Python’s simplicity and ecosystem have made it the most popular programming language for data science.

That’s why I created a free 17-hour video course to learn it from scratch!

This complete course for absolute beginners goes from learning basic Python to applying machine learning on a text classification problem. Each topic builds on the previous ones, reflecting the real workflow of data science projects.

Let’s see each module in more detail!

To get weekly articles like this, subscribe 👇 After subscribing, check my welcome email to download my Python, ChatGPT, and more cheat sheets :)

Module 1: Python Basics for Data Science

For those new to programming, a crash course in core Python is crucial before diving into data science tools.

In this introductory module, we’ll learn basic Python syntax, data types (numbers, strings, lists, dictionaries), control structures (loops and conditionals), and functions. Mastering these basics will provide a strong foundation to write and understand code for data analysis tasks.

What does coding in Python look like?

# Basic Python operations

x = 5
y = 3
print("Sum:", x + y)

# Looping through a list

fruits = ["Apple", "Banana", "Cherry"]
for fruit in fruits:
    print("I like", fruit)

Here’s the link to watch the first module of this course

Link: Module 1 - Python Basics for Data Science

Module 2: Introduction to Pandas and NumPy

With core Python skills in place, the next step is learning NumPy and pandas, two libraries that form the backbone of data science in Python. These libraries provide high-performance data structures and functions that make it vastly easier to work with data.

Why are Pandas and NumPy essential? Together, they allow data scientists to load, manipulate, and analyze data efficiently. For example, with pandas one can read a CSV file into a DataFrame in one line of code and then easily filter or summarize it. NumPy provides the numerical foundation, enabling operations like vectorized calculations (applying a formula to an entire array) that are both convenient and fast.

The snippet below shows how one might use NumPy and pandas in practice. We create a NumPy array and compute a statistic, and we create a pandas DataFrame to perform an operation on a column:

import numpy as np
import pandas as pd

# Using NumPy for numerical computations
data = np.array([1, 2, 3, 4])
print("NumPy mean:", data.mean())

# Using pandas for tabular data manipulation
df = pd.DataFrame({'A': [10, 20, 30],
                   'B': [2, 4, 6]})
print("DataFrame sum of A:", df['A'].sum())

Here’s the link to watch the second module of this course:

Link: Module 2 - Introduction to Pandas and NumPy

Module 3: Web Scraping with Pandas (Project #1)

The goal of this first project is to learn how to gather data from the web and import it into Python for analysis using pandas.

In many real-world situations, data is not neatly provided as a file – it might be embedded in web pages (HTML tables, lists, etc.) or available through web APIs. Web scraping is the technique of extracting data from websites, and it’s an essential skill for data scientists to build their own datasets. This project teaches beginners how to perform basic web scraping and load the data into a pandas DataFrame, which they can then analyze.

Pandas provides convenient functions for simple scraping tasks. For example, if a Wikipedia page contains a table, read_html can often fetch it directly. For more complex web scraping tasks – such as handling JavaScript, or scraping non-tabular data, we would need more powerful libraries like Selenium or Scrapy. That said, pandas will give you a gentle introduction to web scraping.

Web scraping with pandas is sometimes as easy as this:

import pandas as pd

url = "site link"
tables = pd.read_html(url)
df = tables[0]                    
print(df.head())

Here’s the link to watch the third module of this course:

Link: Module 3 - Web Scraping with Pandas (Project #1)

Module 4: Filtering Data & Data Extraction

Once data is loaded into a pandas DataFrame, a fundamental operation is filtering the data. “Filtering” means selecting only those subsets of data (rows or columns) that meet certain criteria

Why is filtering essential? Real datasets often contain more information than you need for a particular analysis. For example, if you have a dataset of worldwide economic indicators but you are only studying data for a specific country, you would filter the dataset to that country. Similarly, if you have collected time-series data for a range of dates but only need the last year of data, filtering by date is necessary. Also, in data cleaning (a module we’ll see later), filtering is used to remove outliers or invalid entries.

Here’s what filtering data looks like:

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Cathy'],
                   'Age':  [24, 27, 19]})

adults = df[df['Age'] > 20]   # filter rows where Age > 20
print(adults)

Here’s the link to watch the fourth module of this course:

Link: Module 4 - Filtering Data & Data Extraction

Module 5: Making Data Visualizations (Project #2)

Data visualization is an important part of data science. It allows us to translate complex datasets into visual representations that are easier for the human brain to understand. By plotting data, we can quickly identify patterns, trends, and outliers that might not be obvious from raw numbers. Visualization is also key to communicating results to others. A well-crafted chart can convey insights that would be hard to glean from tables.

Python offers different libraries for data visualization, such as Matplotlib and Seaborn. That said, probably the easiest way to make visualizations with Python is using the pandas library.

This second project consists in making visualizations such as barplots, histograms, scatterplots, and more with pandas.

Link: Module 5 - Making Data Visualizations (Project #2)

Module 6: GroupBy, Aggregate Function, and Concatenation

Many analytical questions are of the form “what is the value by some category?” For example, “What is the average income by country?”, “How many transactions per day of the week?” or “What is the total sales for each product category?“

GroupBy allows you to answer these by aggregating data within each subgroup.

Common aggregation methods include .sum(), .mean(), and .count(), among others. For example, if you want to get the average income for each country present in a dataframe, you’d do something like this:

df.groupby('Country')['Income'].mean()

Here’s the link to watch the fourth module of this course:

Link: Module 6 - GroupBy and Aggregate Function

Module 7: Regular Expressions

Data scientists often work with textual data, whether it’s extracting specific information from strings or cleaning and preprocessing raw text for analysis. Regular expressions (regex) are a powerful tool for such tasks.

A regular expression is essentially a pattern that describes a set of strings. With regex, you can test if a string matches a pattern, find all substrings that match, or replace parts of a string that match a pattern.

Here’s an example of how to use regex to filter data inside a dataframe.

import pandas as pd

# Small dataset
data = {
    'Name': ['Alan', 'Bob', 'Ann'],
    'Age': [25, 22, 27]
}

df = pd.DataFrame(data)

# Using regex to filter names that start with 'A' and end with 'n'
filtered_df = df[df['Name'].str.contains(r'^A.*n$', regex=True)]

print(filtered_df)

Here’s the link to watch the seventh module of this course:

Link: Module 7 - Regular Expressions

Module 8: Data Cleaning with Pandas (Project #3)

“Garbage in, garbage out” is a saying that highlights the importance of data cleaning.

Real-world data is often noisy, incomplete, or inconsistent. Before any analysis or modeling, this raw data needs to be cleaned and structured properly. Data professionals spend a large portion of their time on data cleaning – by some estimates, around 80% of a data scientist’s time is spent preparing and cleaning data. Data cleaning is thus one of the essential steps in the data science process, as the quality of insights you can draw is directly linked to the quality of your data.

In this third project, we will use pandas to perform common data cleaning tasks on a messy dataset.

Link: Module 8 — Data Cleaning with Pandas (Project #3)

Module 9: Machine Learning with Python

After cleaning and exploring data, the next major step in a data science journey is often to build predictive models – in other words, to apply machine learning (ML).

Machine learning allows data scientists to extract patterns from data and make predictions or decisions without being explicitly programmed with rules. In practical terms, this is how we go from analyzing what has happened (descriptive analytics) to predicting what might happen (predictive analytics) or making automated decisions (like classifying an email as spam or not spam). Understanding ML opens up a huge range of applications: you can train a model to predict housing prices from past data, classify images, cluster customers into groups, etc.

In the video below, we’ll learn the core concepts of machine learning while building a linear regression model in Python. We'll use 2 basic machine learning libraries in Python: statsmodels and sklearn.

Link: Module 9 - Machine Learning with Python

Module 10: Text Classification (Project #4)

The goal of this last project is to create a machine learning model that will predict whether a movie review is positive or negative. This is known as binary text classification and will help us explore the scikit-learn library while building a basic machine learning model from scratch.

Text classification is a supervised learning task where the input features are not numeric measurements but text strings. Thus, it introduces the concept of feature extraction from text. Typically, one must convert text into a numerical representation before feeding it to a machine learning model. The most common approach is the Bag-of-Words model or its variant TF-IDF:

Bag-of-Words: We create features from word occurrences: for each document (text example), we count how often each word from a vocabulary appears. This transforms the text into a vector of numbers (word counts). For example, if our vocabulary is {"bad", "good", "great", "terrible"}, a review "This product is good and great" might transform into [0, 1, 1, 0]
TF-IDF: It’s a refinement of word counts that downweights common words and upweights rare but potentially informative words.

These and more new concepts are covered in the project below.

Link: Module 10 - Text Classification (Project #4)

That’s it! Let your Python for data science journey begin

Artificial Corner

Discussion about this post