1. Pandas Module – Introduction
What is Pandas?
Pandas is an open-source Python library used for data manipulation and analysis. It makes working with structured data (like tables, spreadsheets, or CSV files) fast and easy.
Why use Pandas?
- It lets you work with large datasets effortlessly
- It provides tools to clean, filter, sort, and analyze data
- It works well with other libraries like NumPy, Matplotlib, and Scikit-learn
- It reads and writes data from CSV, Excel, JSON, SQL, and more
How to install Pandas:
pip install pandas
How to import Pandas:
import pandas as pd
The alias
pdis a widely accepted convention. You’ll see it in every Pandas tutorial and codebase.
2. Series
What is a Series?
A Series is a one-dimensional labeled array. Think of it as a single column in a spreadsheet. It can hold data of any type — integers, strings, floats, booleans, etc.
- Every item in a Series has an index (a label).
- By default, the index starts at 0 and increases by 1.
- A Series is like a Python list but much more powerful.
Key characteristics:
- One column of data
- Each value has a label (index)
- Supports fast operations like filtering, math, and searching
3. Series – Syntax & Parameters
Basic Syntax:
pandas.Series(data, index, dtype, name, copy)
Parameters explained:
data— The actual data. Can be a list, dictionary, scalar value, or NumPy array.index— Labels for each element. If not provided, defaults to 0, 1, 2, …dtype— Data type of the values (e.g.,int,float,str). Optional.name— A name for the Series. Useful when converting to DataFrame.copy— IfTrue, makes a copy of the data. Default isFalse.
4. Series – Example
Example 1: Create a simple Series from a list
import pandas as pd
fruits = ["Apple", "Banana", "Cherry"]
s = pd.Series(fruits)
print(s)
Output:
0 Apple
1 Banana
2 Cherry
dtype: object
Example 2: Create a Series from a dictionary
import pandas as pd
data = {"a": 10, "b": 20, "c": 30}
s = pd.Series(data)
print(s)
Output:
a 10
b 20
c 30
dtype: int64
Example 3: Create a Series with a name
import pandas as pd
s = pd.Series([100, 200, 300], name="Scores")
print(s)
Output:
0 100
1 200
2 300
Name: Scores, dtype: int64
5. Labels
What are Labels?
Labels are the index values assigned to each item in a Series. They allow you to access data by a meaningful name instead of just a number.
- By default, Pandas assigns numeric labels (0, 1, 2, …).
- You can define your own custom labels using the
indexparameter. - Labels don’t have to be unique, but it’s a good practice to keep them unique.
Example: Custom labels
import pandas as pd
scores = [85, 90, 78]
s = pd.Series(scores, index=["Alice", "Bob", "Charlie"])
print(s)
Output:
Alice 85
Bob 90
Charlie 78
dtype: int64
Accessing a value by label:
print(s["Bob"]) # Output: 90
Why labels matter:
- They make your data readable and meaningful
- They allow you to access rows by name, not just position
- They are preserved even when you sort or filter the data
6. DataFrames
What is a DataFrame?
A DataFrame is a two-dimensional table — just like a spreadsheet or SQL table. It has rows and columns. Each column in a DataFrame is a Series.
- Rows are identified by the index
- Columns are identified by column names
- A DataFrame can hold different data types in different columns
Example: Create a DataFrame from a dictionary
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["Chennai", "Mumbai", "Delhi"]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 Chennai
1 Bob 30 Mumbai
2 Charlie 35 Delhi
Key points:
- Each key in the dictionary becomes a column
- The index is automatically 0, 1, 2, …
- You can have as many rows and columns as you need
7. Locate Rows
How to access a specific row:
Pandas provides two main methods:
loc[]— Access by label (index name)iloc[]— Access by integer position (0, 1, 2, …)
Example using loc[]:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35]
}
df = pd.DataFrame(data)
print(df.loc[1])
Output:
Name Bob
Age 30
Name: 1, dtype: object
Example using iloc[]:
print(df.iloc[0])
Output:
Name Alice
Age 25
Name: 0, dtype: object
Use
locwhen your index has labels (names). Useilocwhen you want to use position numbers.
8. Locate Multiple Rows
You can pass a list of index values to loc[] to get multiple rows at once.
Example:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "Diana"],
"Age": [25, 30, 35, 28]
}
df = pd.DataFrame(data)
print(df.loc[[0, 2]])
Output:
Name Age
0 Alice 25
2 Charlie 35
Using a range with iloc[]:
print(df.iloc[1:3])
Output:
Name Age
1 Bob 30
2 Charlie 35
Note: With
iloc, the end index is exclusive (like Python slicing).iloc[1:3]returns rows at positions 1 and 2.
9. Named Index
By default, rows are numbered 0, 1, 2, etc. But you can give them meaningful names using the index parameter or the set_index() method.
Example: Set index during creation
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Score": [85, 90, 78]
}
df = pd.DataFrame(data, index=["Student1", "Student2", "Student3"])
print(df)
Output:
Name Score
Student1 Alice 85
Student2 Bob 90
Student3 Charlie 78
Example: Use set_index() on an existing DataFrame
df2 = df.set_index("Name")
print(df2)
Output:
Score
Name
Alice 85
Bob 90
Charlie 78
Accessing a named row:
print(df.loc["Student2"])
10. Load Files in a DataFrame (CSV)
One of the most common tasks in data analysis is loading a CSV file into a DataFrame.
Syntax:
df = pd.read_csv("filename.csv")
Example:
Suppose you have a file called students.csv:
Name,Age,Grade
Alice,20,A
Bob,22,B
Charlie,21,A
Loading it:
import pandas as pd
df = pd.read_csv("students.csv")
print(df)
Output:
Name Age Grade
0 Alice 20 A
1 Bob 22 B
2 Charlie 21 A
Useful parameters for read_csv():
sep=","— Specify the delimiter (default is comma)header=0— Row number to use as column namesindex_col="Name"— Set a column as the indexnrows=100— Load only the first 100 rows (useful for big files)
11. Load Data Using to_string()
When you print a large DataFrame, Pandas may truncate it and show ... in the middle. Use to_string() to print the entire DataFrame without truncation.
Example:
import pandas as pd
df = pd.read_csv("students.csv")
print(df.to_string())
This prints every row and column without cutting anything off.
When to use it:
- When your DataFrame has many rows and you want to see all of them
- When saving the full data to a text file
- During debugging to inspect the complete dataset
Example saving to a file:
with open("output.txt", "w") as f:
f.write(df.to_string())
12. JSON File
Pandas can also load JSON (JavaScript Object Notation) data directly into a DataFrame.
Syntax:
df = pd.read_json("filename.json")
Example:
Suppose you have a file called data.json:
{
"Name": {"0": "Alice", "1": "Bob", "2": "Charlie"},
"Age": {"0": 25, "1": 30, "2": 35}
}
Loading it:
import pandas as pd
df = pd.read_json("data.json")
print(df)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
You can also load JSON from a URL:
df = pd.read_json("https://example.com/api/data.json")
13. Analyzing the Data – Head, Tail & Info Method
Once you’ve loaded data, the first thing you want to do is explore it. Pandas gives you three key methods for this.
head() — View the first N rows
import pandas as pd
df = pd.read_csv("students.csv")
print(df.head()) # Default: first 5 rows
print(df.head(3)) # First 3 rows
tail() — View the last N rows
print(df.tail()) # Default: last 5 rows
print(df.tail(2)) # Last 2 rows
info() — Get a summary of the DataFrame
print(df.info())
Output (example):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
2 Grade 3 non-null object
dtypes: object(2), int64(1)
memory usage: 200.0+ bytes
What info() tells you:
- Total number of rows and columns
- Column names
- Number of non-null (non-missing) values per column
- Data type of each column
- Memory usage
14. Data Cleaning
What is Data Cleaning?
Real-world data is rarely perfect. It often has:
- Missing values (empty cells)
- Wrong formats (e.g., dates stored as text)
- Wrong or invalid data (e.g., negative ages)
- Duplicate rows
Data Cleaning is the process of fixing or removing such problems so your analysis is accurate and reliable.
15. Data Cleaning – Empty Cells
Empty cells (also called null values or NaN) are a very common issue.
Detecting missing values:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.isnull()) # Returns True/False for each cell
print(df.isnull().sum()) # Count of nulls per column
Option 1: Remove rows with empty cells
df_clean = df.dropna()
print(df_clean)
Option 2: Fill empty cells with a value
df_filled = df.fillna(0) # Replace NaN with 0
df_filled = df.fillna("Unknown") # Replace NaN with a string
df_filled = df["Age"].fillna(df["Age"].mean()) # Replace with column average
Option 3: Fill using forward-fill (use previous row’s value)
df_ffill = df.fillna(method="ffill")
Use
dropna()when missing rows are few and unimportant. Usefillna()when you want to preserve as much data as possible.
16. Data Cleaning – Wrong Format
Sometimes data is in the wrong format. For example, a date column might contain text like "2023/01/15" instead of an actual date object.
Example: Fix date format
import pandas as pd
data = {
"Name": ["Alice", "Bob"],
"JoinDate": ["2023/01/15", "2022/07/20"]
}
df = pd.DataFrame(data)
# Convert string to proper datetime format
df["JoinDate"] = pd.to_datetime(df["JoinDate"])
print(df)
print(df.dtypes)
Output:
Name JoinDate
0 Alice 2023-01-15
1 Bob 2022-07-20
Name object
JoinDate datetime64[ns]
dtype: object
Example: Convert string numbers to integers
df["Age"] = df["Age"].astype(int)
Example: Handle rows where conversion fails
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
# Invalid values become NaN instead of raising an error
17. Data Cleaning – Wrong Data
Sometimes data contains logically incorrect values — like a person with age = -5, or a salary of 0 for a CEO. You need to fix or remove these.
Example: Replace a wrong value
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 130, 35] # 130 is clearly wrong
}
df = pd.DataFrame(data)
# Replace the wrong value with a correct one
df.loc[df["Age"] > 100, "Age"] = 30
print(df)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Example: Remove rows with wrong data
df = df[df["Age"] < 100] # Keep only rows where Age is realistic
Example: Replace a specific value
df["Name"] = df["Name"].replace("Boob", "Bob") # Fix a typo
18. Data Cleaning – Duplicates
Duplicate rows can skew your analysis. Pandas makes it easy to find and remove them.
Detect duplicate rows:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Alice", "Charlie"],
"Age": [25, 30, 25, 35]
}
df = pd.DataFrame(data)
print(df.duplicated())
Output:
0 False
1 False
2 True
3 False
dtype: bool
Remove duplicate rows:
df_clean = df.drop_duplicates()
print(df_clean)
Output:
Name Age
0 Alice 25
1 Bob 30
3 Charlie 35
Remove duplicates based on a specific column:
df_clean = df.drop_duplicates(subset=["Name"])
Keep the last occurrence instead of first:
df_clean = df.drop_duplicates(keep="last")
19. Correlation
What is Correlation?
Correlation measures the relationship between two numeric columns. It tells you:
- Whether two variables move together (positive correlation)
- Whether they move in opposite directions (negative correlation)
- Whether there’s no relationship (correlation near 0)
Correlation values range from -1 to +1:
+1— Perfect positive correlation0— No correlation-1— Perfect negative correlation
Example:
import pandas as pd
data = {
"Hours_Studied": [1, 2, 3, 4, 5],
"Exam_Score": [50, 55, 65, 72, 85],
"Sleep_Hours": [8, 7, 6, 5, 4]
}
df = pd.DataFrame(data)
print(df.corr())
Output:
Hours_Studied Exam_Score Sleep_Hours
Hours_Studied 1.000000 0.991100 -1.000000
Exam_Score 0.991100 1.000000 -0.991100
Sleep_Hours -1.000000 -0.991100 1.000000
Reading the results:
Hours_StudiedandExam_Scorehave a correlation of ~0.99 → strong positive (study more, score higher)Hours_StudiedandSleep_Hourshave a correlation of -1.0 → perfect negative (study more, sleep less)
Only numeric columns are included in
corr(). Non-numeric columns are automatically ignored.
Quick Reference Table
| Method / Function | Description | Example |
|---|---|---|
pd.Series(data) | Create a 1D labeled array | pd.Series([1, 2, 3]) |
pd.Series(data, index=[...]) | Series with custom labels | pd.Series([1,2], index=["a","b"]) |
pd.DataFrame(dict) | Create a 2D table from dictionary | pd.DataFrame({"A":[1,2],"B":[3,4]}) |
df.loc[label] | Access row by label/index name | df.loc["Student1"] |
df.iloc[n] | Access row by integer position | df.iloc[0] |
df.loc[[1, 3]] | Access multiple rows by label | df.loc[[0, 2]] |
df.iloc[1:3] | Access multiple rows by position range | df.iloc[0:2] |
pd.DataFrame(data, index=[...]) | Set named index on creation | pd.DataFrame(d, index=["R1","R2"]) |
df.set_index("col") | Set a column as the index | df.set_index("Name") |
pd.read_csv("file.csv") | Load CSV into DataFrame | pd.read_csv("data.csv") |
df.to_string() | Print full DataFrame without truncation | print(df.to_string()) |
pd.read_json("file.json") | Load JSON file into DataFrame | pd.read_json("data.json") |
df.head(n) | View first N rows (default 5) | df.head(3) |
df.tail(n) | View last N rows (default 5) | df.tail(3) |
df.info() | Summary: columns, types, nulls, memory | df.info() |
df.isnull() | Detect missing values (True/False) | df.isnull() |
df.isnull().sum() | Count missing values per column | df.isnull().sum() |
df.dropna() | Remove rows with any missing values | df.dropna() |
df.fillna(value) | Fill missing values with a given value | df.fillna(0) |
df.fillna(method="ffill") | Fill missing values using forward-fill | df.fillna(method="ffill") |
pd.to_datetime(df["col"]) | Convert column to datetime format | pd.to_datetime(df["Date"]) |
df["col"].astype(type) | Change column data type | df["Age"].astype(int) |
pd.to_numeric(df["col"], errors="coerce") | Convert to number, invalid → NaN | pd.to_numeric(df["col"], errors="coerce") |
df.loc[condition, "col"] = val | Replace values matching a condition | df.loc[df["Age"]>100, "Age"] = 30 |
df["col"].replace(old, new) | Replace a specific value in a column | df["Name"].replace("Boob","Bob") |
df.duplicated() | Detect duplicate rows (True/False) | df.duplicated() |
df.drop_duplicates() | Remove duplicate rows | df.drop_duplicates() |
df.drop_duplicates(subset=["col"]) | Remove duplicates based on column | df.drop_duplicates(subset=["Name"]) |
df.drop_duplicates(keep="last") | Keep last occurrence of duplicates | df.drop_duplicates(keep="last") |
df.corr() | Compute pairwise correlation of columns | df.corr() |
Happy Learning! Pandas is a massive library — but once you’re comfortable with these fundamentals, everything else becomes much easier to pick up. Practice with real datasets (Kaggle has great free ones) and you’ll be analyzing data like a pro in no time.


