Pandas in Python

1. Pandas Module – Introduction

What is Pandas?

Pandas is an open-source Python library used for data manipulation and analysis. It makes working with structured data (like tables, spreadsheets, or CSV files) fast and easy.

Why use Pandas?

  • It lets you work with large datasets effortlessly
  • It provides tools to clean, filter, sort, and analyze data
  • It works well with other libraries like NumPy, Matplotlib, and Scikit-learn
  • It reads and writes data from CSV, Excel, JSON, SQL, and more

How to install Pandas:

pip install pandas

How to import Pandas:

import pandas as pd

The alias pd is a widely accepted convention. You’ll see it in every Pandas tutorial and codebase.


2. Series

What is a Series?

A Series is a one-dimensional labeled array. Think of it as a single column in a spreadsheet. It can hold data of any type — integers, strings, floats, booleans, etc.

  • Every item in a Series has an index (a label).
  • By default, the index starts at 0 and increases by 1.
  • A Series is like a Python list but much more powerful.

Key characteristics:

  • One column of data
  • Each value has a label (index)
  • Supports fast operations like filtering, math, and searching

3. Series – Syntax & Parameters

Basic Syntax:

pandas.Series(data, index, dtype, name, copy)

Parameters explained:

  • data — The actual data. Can be a list, dictionary, scalar value, or NumPy array.
  • index — Labels for each element. If not provided, defaults to 0, 1, 2, …
  • dtype — Data type of the values (e.g., int, float, str). Optional.
  • name — A name for the Series. Useful when converting to DataFrame.
  • copy — If True, makes a copy of the data. Default is False.

4. Series – Example

Example 1: Create a simple Series from a list

import pandas as pd

fruits = ["Apple", "Banana", "Cherry"]
s = pd.Series(fruits)
print(s)

Output:

0     Apple
1    Banana
2    Cherry
dtype: object

Example 2: Create a Series from a dictionary

import pandas as pd

data = {"a": 10, "b": 20, "c": 30}
s = pd.Series(data)
print(s)

Output:

a    10
b    20
c    30
dtype: int64

Example 3: Create a Series with a name

import pandas as pd

s = pd.Series([100, 200, 300], name="Scores")
print(s)

Output:

0    100
1    200
2    300
Name: Scores, dtype: int64

5. Labels

What are Labels?

Labels are the index values assigned to each item in a Series. They allow you to access data by a meaningful name instead of just a number.

  • By default, Pandas assigns numeric labels (0, 1, 2, …).
  • You can define your own custom labels using the index parameter.
  • Labels don’t have to be unique, but it’s a good practice to keep them unique.

Example: Custom labels

import pandas as pd

scores = [85, 90, 78]
s = pd.Series(scores, index=["Alice", "Bob", "Charlie"])
print(s)

Output:

Alice      85
Bob        90
Charlie    78
dtype: int64

Accessing a value by label:

print(s["Bob"])   # Output: 90

Why labels matter:

  • They make your data readable and meaningful
  • They allow you to access rows by name, not just position
  • They are preserved even when you sort or filter the data

6. DataFrames

What is a DataFrame?

A DataFrame is a two-dimensional table — just like a spreadsheet or SQL table. It has rows and columns. Each column in a DataFrame is a Series.

  • Rows are identified by the index
  • Columns are identified by column names
  • A DataFrame can hold different data types in different columns

Example: Create a DataFrame from a dictionary

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["Chennai", "Mumbai", "Delhi"]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City
0    Alice   25  Chennai
1      Bob   30   Mumbai
2  Charlie   35    Delhi

Key points:

  • Each key in the dictionary becomes a column
  • The index is automatically 0, 1, 2, …
  • You can have as many rows and columns as you need

7. Locate Rows

How to access a specific row:

Pandas provides two main methods:

  • loc[] — Access by label (index name)
  • iloc[] — Access by integer position (0, 1, 2, …)

Example using loc[]:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
}
df = pd.DataFrame(data)

print(df.loc[1])

Output:

Name    Bob
Age      30
Name: 1, dtype: object

Example using iloc[]:

print(df.iloc[0])

Output:

Name    Alice
Age        25
Name: 0, dtype: object

Use loc when your index has labels (names). Use iloc when you want to use position numbers.


8. Locate Multiple Rows

You can pass a list of index values to loc[] to get multiple rows at once.

Example:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana"],
    "Age": [25, 30, 35, 28]
}
df = pd.DataFrame(data)

print(df.loc[[0, 2]])

Output:

      Name  Age
0    Alice   25
2  Charlie   35

Using a range with iloc[]:

print(df.iloc[1:3])

Output:

      Name  Age
1      Bob   30
2  Charlie   35

Note: With iloc, the end index is exclusive (like Python slicing). iloc[1:3] returns rows at positions 1 and 2.


9. Named Index

By default, rows are numbered 0, 1, 2, etc. But you can give them meaningful names using the index parameter or the set_index() method.

Example: Set index during creation

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Score": [85, 90, 78]
}

df = pd.DataFrame(data, index=["Student1", "Student2", "Student3"])
print(df)

Output:

           Name  Score
Student1  Alice     85
Student2    Bob     90
Student3  Charlie   78

Example: Use set_index() on an existing DataFrame

df2 = df.set_index("Name")
print(df2)

Output:

         Score
Name          
Alice       85
Bob         90
Charlie     78

Accessing a named row:

print(df.loc["Student2"])

10. Load Files in a DataFrame (CSV)

One of the most common tasks in data analysis is loading a CSV file into a DataFrame.

Syntax:

df = pd.read_csv("filename.csv")

Example:

Suppose you have a file called students.csv:

Name,Age,Grade
Alice,20,A
Bob,22,B
Charlie,21,A

Loading it:

import pandas as pd

df = pd.read_csv("students.csv")
print(df)

Output:

      Name  Age Grade
0    Alice   20     A
1      Bob   22     B
2  Charlie   21     A

Useful parameters for read_csv():

  • sep="," — Specify the delimiter (default is comma)
  • header=0 — Row number to use as column names
  • index_col="Name" — Set a column as the index
  • nrows=100 — Load only the first 100 rows (useful for big files)

11. Load Data Using to_string()

When you print a large DataFrame, Pandas may truncate it and show ... in the middle. Use to_string() to print the entire DataFrame without truncation.

Example:

import pandas as pd

df = pd.read_csv("students.csv")
print(df.to_string())

This prints every row and column without cutting anything off.

When to use it:

  • When your DataFrame has many rows and you want to see all of them
  • When saving the full data to a text file
  • During debugging to inspect the complete dataset

Example saving to a file:

with open("output.txt", "w") as f:
    f.write(df.to_string())

12. JSON File

Pandas can also load JSON (JavaScript Object Notation) data directly into a DataFrame.

Syntax:

df = pd.read_json("filename.json")

Example:

Suppose you have a file called data.json:

{
  "Name": {"0": "Alice", "1": "Bob", "2": "Charlie"},
  "Age":  {"0": 25, "1": 30, "2": 35}
}

Loading it:

import pandas as pd

df = pd.read_json("data.json")
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

You can also load JSON from a URL:

df = pd.read_json("https://example.com/api/data.json")

13. Analyzing the Data – Head, Tail & Info Method

Once you’ve loaded data, the first thing you want to do is explore it. Pandas gives you three key methods for this.

head() — View the first N rows

import pandas as pd

df = pd.read_csv("students.csv")
print(df.head())       # Default: first 5 rows
print(df.head(3))      # First 3 rows

tail() — View the last N rows

print(df.tail())       # Default: last 5 rows
print(df.tail(2))      # Last 2 rows

info() — Get a summary of the DataFrame

print(df.info())

Output (example):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   Grade   3 non-null      object
dtypes: object(2), int64(1)
memory usage: 200.0+ bytes

What info() tells you:

  • Total number of rows and columns
  • Column names
  • Number of non-null (non-missing) values per column
  • Data type of each column
  • Memory usage

14. Data Cleaning

What is Data Cleaning?

Real-world data is rarely perfect. It often has:

  • Missing values (empty cells)
  • Wrong formats (e.g., dates stored as text)
  • Wrong or invalid data (e.g., negative ages)
  • Duplicate rows

Data Cleaning is the process of fixing or removing such problems so your analysis is accurate and reliable.


15. Data Cleaning – Empty Cells

Empty cells (also called null values or NaN) are a very common issue.

Detecting missing values:

import pandas as pd

df = pd.read_csv("data.csv")
print(df.isnull())         # Returns True/False for each cell
print(df.isnull().sum())   # Count of nulls per column

Option 1: Remove rows with empty cells

df_clean = df.dropna()
print(df_clean)

Option 2: Fill empty cells with a value

df_filled = df.fillna(0)             # Replace NaN with 0
df_filled = df.fillna("Unknown")     # Replace NaN with a string
df_filled = df["Age"].fillna(df["Age"].mean())  # Replace with column average

Option 3: Fill using forward-fill (use previous row’s value)

df_ffill = df.fillna(method="ffill")

Use dropna() when missing rows are few and unimportant. Use fillna() when you want to preserve as much data as possible.


16. Data Cleaning – Wrong Format

Sometimes data is in the wrong format. For example, a date column might contain text like "2023/01/15" instead of an actual date object.

Example: Fix date format

import pandas as pd

data = {
    "Name": ["Alice", "Bob"],
    "JoinDate": ["2023/01/15", "2022/07/20"]
}

df = pd.DataFrame(data)

# Convert string to proper datetime format
df["JoinDate"] = pd.to_datetime(df["JoinDate"])
print(df)
print(df.dtypes)

Output:

    Name   JoinDate
0  Alice 2023-01-15
1    Bob 2022-07-20

Name               object
JoinDate    datetime64[ns]
dtype: object

Example: Convert string numbers to integers

df["Age"] = df["Age"].astype(int)

Example: Handle rows where conversion fails

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
# Invalid values become NaN instead of raising an error

17. Data Cleaning – Wrong Data

Sometimes data contains logically incorrect values — like a person with age = -5, or a salary of 0 for a CEO. You need to fix or remove these.

Example: Replace a wrong value

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 130, 35]   # 130 is clearly wrong
}

df = pd.DataFrame(data)

# Replace the wrong value with a correct one
df.loc[df["Age"] > 100, "Age"] = 30
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Example: Remove rows with wrong data

df = df[df["Age"] < 100]   # Keep only rows where Age is realistic

Example: Replace a specific value

df["Name"] = df["Name"].replace("Boob", "Bob")   # Fix a typo

18. Data Cleaning – Duplicates

Duplicate rows can skew your analysis. Pandas makes it easy to find and remove them.

Detect duplicate rows:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Alice", "Charlie"],
    "Age": [25, 30, 25, 35]
}

df = pd.DataFrame(data)
print(df.duplicated())

Output:

0    False
1    False
2     True
3    False
dtype: bool

Remove duplicate rows:

df_clean = df.drop_duplicates()
print(df_clean)

Output:

      Name  Age
0    Alice   25
1      Bob   30
3  Charlie   35

Remove duplicates based on a specific column:

df_clean = df.drop_duplicates(subset=["Name"])

Keep the last occurrence instead of first:

df_clean = df.drop_duplicates(keep="last")

19. Correlation

What is Correlation?

Correlation measures the relationship between two numeric columns. It tells you:

  • Whether two variables move together (positive correlation)
  • Whether they move in opposite directions (negative correlation)
  • Whether there’s no relationship (correlation near 0)

Correlation values range from -1 to +1:

  • +1 — Perfect positive correlation
  • 0 — No correlation
  • -1 — Perfect negative correlation

Example:

import pandas as pd

data = {
    "Hours_Studied": [1, 2, 3, 4, 5],
    "Exam_Score":    [50, 55, 65, 72, 85],
    "Sleep_Hours":   [8, 7, 6, 5, 4]
}

df = pd.DataFrame(data)

print(df.corr())

Output:

                Hours_Studied  Exam_Score  Sleep_Hours
Hours_Studied       1.000000    0.991100    -1.000000
Exam_Score          0.991100    1.000000    -0.991100
Sleep_Hours        -1.000000   -0.991100     1.000000

Reading the results:

  • Hours_Studied and Exam_Score have a correlation of ~0.99 → strong positive (study more, score higher)
  • Hours_Studied and Sleep_Hours have a correlation of -1.0 → perfect negative (study more, sleep less)

Only numeric columns are included in corr(). Non-numeric columns are automatically ignored.


Quick Reference Table

Method / FunctionDescriptionExample
pd.Series(data)Create a 1D labeled arraypd.Series([1, 2, 3])
pd.Series(data, index=[...])Series with custom labelspd.Series([1,2], index=["a","b"])
pd.DataFrame(dict)Create a 2D table from dictionarypd.DataFrame({"A":[1,2],"B":[3,4]})
df.loc[label]Access row by label/index namedf.loc["Student1"]
df.iloc[n]Access row by integer positiondf.iloc[0]
df.loc[[1, 3]]Access multiple rows by labeldf.loc[[0, 2]]
df.iloc[1:3]Access multiple rows by position rangedf.iloc[0:2]
pd.DataFrame(data, index=[...])Set named index on creationpd.DataFrame(d, index=["R1","R2"])
df.set_index("col")Set a column as the indexdf.set_index("Name")
pd.read_csv("file.csv")Load CSV into DataFramepd.read_csv("data.csv")
df.to_string()Print full DataFrame without truncationprint(df.to_string())
pd.read_json("file.json")Load JSON file into DataFramepd.read_json("data.json")
df.head(n)View first N rows (default 5)df.head(3)
df.tail(n)View last N rows (default 5)df.tail(3)
df.info()Summary: columns, types, nulls, memorydf.info()
df.isnull()Detect missing values (True/False)df.isnull()
df.isnull().sum()Count missing values per columndf.isnull().sum()
df.dropna()Remove rows with any missing valuesdf.dropna()
df.fillna(value)Fill missing values with a given valuedf.fillna(0)
df.fillna(method="ffill")Fill missing values using forward-filldf.fillna(method="ffill")
pd.to_datetime(df["col"])Convert column to datetime formatpd.to_datetime(df["Date"])
df["col"].astype(type)Change column data typedf["Age"].astype(int)
pd.to_numeric(df["col"], errors="coerce")Convert to number, invalid → NaNpd.to_numeric(df["col"], errors="coerce")
df.loc[condition, "col"] = valReplace values matching a conditiondf.loc[df["Age"]>100, "Age"] = 30
df["col"].replace(old, new)Replace a specific value in a columndf["Name"].replace("Boob","Bob")
df.duplicated()Detect duplicate rows (True/False)df.duplicated()
df.drop_duplicates()Remove duplicate rowsdf.drop_duplicates()
df.drop_duplicates(subset=["col"])Remove duplicates based on columndf.drop_duplicates(subset=["Name"])
df.drop_duplicates(keep="last")Keep last occurrence of duplicatesdf.drop_duplicates(keep="last")
df.corr()Compute pairwise correlation of columnsdf.corr()

Happy Learning! Pandas is a massive library — but once you’re comfortable with these fundamentals, everything else becomes much easier to pick up. Practice with real datasets (Kaggle has great free ones) and you’ll be analyzing data like a pro in no time.

Leave a Reply

Your email address will not be published. Required fields are marked *