Pandas in Python - Rephel's Assets

1. Pandas Module – Introduction

What is Pandas?

Pandas is an open-source Python library used for data manipulation and analysis. It makes working with structured data (like tables, spreadsheets, or CSV files) fast and easy.

Why use Pandas?

It lets you work with large datasets effortlessly
It provides tools to clean, filter, sort, and analyze data
It works well with other libraries like NumPy, Matplotlib, and Scikit-learn
It reads and writes data from CSV, Excel, JSON, SQL, and more

How to install Pandas:

pip install pandas

How to import Pandas:

import pandas as pd

The alias pd is a widely accepted convention. You’ll see it in every Pandas tutorial and codebase.

2. Series

What is a Series?

A Series is a one-dimensional labeled array. Think of it as a single column in a spreadsheet. It can hold data of any type — integers, strings, floats, booleans, etc.

Every item in a Series has an index (a label).
By default, the index starts at 0 and increases by 1.
A Series is like a Python list but much more powerful.

Key characteristics:

One column of data
Each value has a label (index)
Supports fast operations like filtering, math, and searching

3. Series – Syntax & Parameters

Basic Syntax:

pandas.Series(data, index, dtype, name, copy)

Parameters explained:

data — The actual data. Can be a list, dictionary, scalar value, or NumPy array.
index — Labels for each element. If not provided, defaults to 0, 1, 2, …
dtype — Data type of the values (e.g., int, float, str). Optional.
name — A name for the Series. Useful when converting to DataFrame.
copy — If True, makes a copy of the data. Default is False.

4. Series – Example

Example 1: Create a simple Series from a list

import pandas as pd

fruits = ["Apple", "Banana", "Cherry"]
s = pd.Series(fruits)
print(s)

Output:

0     Apple
1    Banana
2    Cherry
dtype: object

Example 2: Create a Series from a dictionary

import pandas as pd

data = {"a": 10, "b": 20, "c": 30}
s = pd.Series(data)
print(s)

Output:

a    10
b    20
c    30
dtype: int64

Example 3: Create a Series with a name

import pandas as pd

s = pd.Series([100, 200, 300], name="Scores")
print(s)

Output:

0    100
1    200
2    300
Name: Scores, dtype: int64

5. Labels

What are Labels?

Labels are the index values assigned to each item in a Series. They allow you to access data by a meaningful name instead of just a number.

By default, Pandas assigns numeric labels (0, 1, 2, …).
You can define your own custom labels using the index parameter.
Labels don’t have to be unique, but it’s a good practice to keep them unique.

Example: Custom labels

import pandas as pd

scores = [85, 90, 78]
s = pd.Series(scores, index=["Alice", "Bob", "Charlie"])
print(s)

Output:

Alice      85
Bob        90
Charlie    78
dtype: int64

Accessing a value by label:

print(s["Bob"])   # Output: 90

Why labels matter:

They make your data readable and meaningful
They allow you to access rows by name, not just position
They are preserved even when you sort or filter the data

6. DataFrames

What is a DataFrame?

A DataFrame is a two-dimensional table — just like a spreadsheet or SQL table. It has rows and columns. Each column in a DataFrame is a Series.

Rows are identified by the index
Columns are identified by column names
A DataFrame can hold different data types in different columns

Example: Create a DataFrame from a dictionary

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["Chennai", "Mumbai", "Delhi"]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City
0    Alice   25  Chennai
1      Bob   30   Mumbai
2  Charlie   35    Delhi

Key points:

Each key in the dictionary becomes a column
The index is automatically 0, 1, 2, …
You can have as many rows and columns as you need

7. Locate Rows

How to access a specific row:

Pandas provides two main methods:

loc[] — Access by label (index name)
iloc[] — Access by integer position (0, 1, 2, …)

Example using loc[]:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
}
df = pd.DataFrame(data)

print(df.loc[1])

Output:

Name    Bob
Age      30
Name: 1, dtype: object

Example using iloc[]:

print(df.iloc[0])

Output:

Name    Alice
Age        25
Name: 0, dtype: object

Use loc when your index has labels (names). Use iloc when you want to use position numbers.

8. Locate Multiple Rows

You can pass a list of index values to loc[] to get multiple rows at once.

Example:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana"],
    "Age": [25, 30, 35, 28]
}
df = pd.DataFrame(data)

print(df.loc[[0, 2]])

Output:

      Name  Age
0    Alice   25
2  Charlie   35

Using a range with iloc[]:

print(df.iloc[1:3])

Output:

      Name  Age
1      Bob   30
2  Charlie   35

Note: With iloc, the end index is exclusive (like Python slicing). iloc[1:3] returns rows at positions 1 and 2.

9. Named Index

By default, rows are numbered 0, 1, 2, etc. But you can give them meaningful names using the index parameter or the set_index() method.

Example: Set index during creation

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Score": [85, 90, 78]
}

df = pd.DataFrame(data, index=["Student1", "Student2", "Student3"])
print(df)

Output:

           Name  Score
Student1  Alice     85
Student2    Bob     90
Student3  Charlie   78

Example: Use set_index() on an existing DataFrame

df2 = df.set_index("Name")
print(df2)

Output:

         Score
Name          
Alice       85
Bob         90
Charlie     78

Accessing a named row:

print(df.loc["Student2"])

10. Load Files in a DataFrame (CSV)

One of the most common tasks in data analysis is loading a CSV file into a DataFrame.

Syntax:

df = pd.read_csv("filename.csv")

Example:

Suppose you have a file called students.csv:

Name,Age,Grade
Alice,20,A
Bob,22,B
Charlie,21,A

Loading it:

import pandas as pd

df = pd.read_csv("students.csv")
print(df)

Output:

      Name  Age Grade
0    Alice   20     A
1      Bob   22     B
2  Charlie   21     A

Useful parameters for read_csv():

sep="," — Specify the delimiter (default is comma)
header=0 — Row number to use as column names
index_col="Name" — Set a column as the index
nrows=100 — Load only the first 100 rows (useful for big files)

11. Load Data Using `to_string()`

When you print a large DataFrame, Pandas may truncate it and show ... in the middle. Use to_string() to print the entire DataFrame without truncation.

Example:

import pandas as pd

df = pd.read_csv("students.csv")
print(df.to_string())

This prints every row and column without cutting anything off.

When to use it:

When your DataFrame has many rows and you want to see all of them
When saving the full data to a text file
During debugging to inspect the complete dataset

Example saving to a file:

with open("output.txt", "w") as f:
    f.write(df.to_string())

12. JSON File

Pandas can also load JSON (JavaScript Object Notation) data directly into a DataFrame.

Syntax:

df = pd.read_json("filename.json")

Example:

Suppose you have a file called data.json:

{
  "Name": {"0": "Alice", "1": "Bob", "2": "Charlie"},
  "Age":  {"0": 25, "1": 30, "2": 35}
}

Loading it:

import pandas as pd

df = pd.read_json("data.json")
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

You can also load JSON from a URL:

df = pd.read_json("https://example.com/api/data.json")

13. Analyzing the Data – Head, Tail & Info Method

Once you’ve loaded data, the first thing you want to do is explore it. Pandas gives you three key methods for this.

`head()` — View the first N rows

import pandas as pd

df = pd.read_csv("students.csv")
print(df.head())       # Default: first 5 rows
print(df.head(3))      # First 3 rows

`tail()` — View the last N rows

print(df.tail())       # Default: last 5 rows
print(df.tail(2))      # Last 2 rows

`info()` — Get a summary of the DataFrame

print(df.info())

Output (example):

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   Grade   3 non-null      object
dtypes: object(2), int64(1)
memory usage: 200.0+ bytes

What info() tells you:

Total number of rows and columns
Column names
Number of non-null (non-missing) values per column
Data type of each column
Memory usage

14. Data Cleaning

What is Data Cleaning?

Real-world data is rarely perfect. It often has:

Missing values (empty cells)
Wrong formats (e.g., dates stored as text)
Wrong or invalid data (e.g., negative ages)
Duplicate rows

Data Cleaning is the process of fixing or removing such problems so your analysis is accurate and reliable.

15. Data Cleaning – Empty Cells

Empty cells (also called null values or NaN) are a very common issue.

Detecting missing values:

import pandas as pd

df = pd.read_csv("data.csv")
print(df.isnull())         # Returns True/False for each cell
print(df.isnull().sum())   # Count of nulls per column

Option 1: Remove rows with empty cells

df_clean = df.dropna()
print(df_clean)

Option 2: Fill empty cells with a value

df_filled = df.fillna(0)             # Replace NaN with 0
df_filled = df.fillna("Unknown")     # Replace NaN with a string
df_filled = df["Age"].fillna(df["Age"].mean())  # Replace with column average

Option 3: Fill using forward-fill (use previous row’s value)

df_ffill = df.fillna(method="ffill")

Use dropna() when missing rows are few and unimportant. Use fillna() when you want to preserve as much data as possible.

16. Data Cleaning – Wrong Format

Sometimes data is in the wrong format. For example, a date column might contain text like "2023/01/15" instead of an actual date object.

Example: Fix date format

import pandas as pd

data = {
    "Name": ["Alice", "Bob"],
    "JoinDate": ["2023/01/15", "2022/07/20"]
}

df = pd.DataFrame(data)

# Convert string to proper datetime format
df["JoinDate"] = pd.to_datetime(df["JoinDate"])
print(df)
print(df.dtypes)

Output:

    Name   JoinDate
0  Alice 2023-01-15
1    Bob 2022-07-20

Name               object
JoinDate    datetime64[ns]
dtype: object

Example: Convert string numbers to integers

df["Age"] = df["Age"].astype(int)

Example: Handle rows where conversion fails

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
# Invalid values become NaN instead of raising an error

17. Data Cleaning – Wrong Data

Sometimes data contains logically incorrect values — like a person with age = -5, or a salary of 0 for a CEO. You need to fix or remove these.

Example: Replace a wrong value

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 130, 35]   # 130 is clearly wrong
}

df = pd.DataFrame(data)

# Replace the wrong value with a correct one
df.loc[df["Age"] > 100, "Age"] = 30
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Example: Remove rows with wrong data

df = df[df["Age"] < 100]   # Keep only rows where Age is realistic

Example: Replace a specific value

df["Name"] = df["Name"].replace("Boob", "Bob")   # Fix a typo

18. Data Cleaning – Duplicates

Duplicate rows can skew your analysis. Pandas makes it easy to find and remove them.

Detect duplicate rows:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Alice", "Charlie"],
    "Age": [25, 30, 25, 35]
}

df = pd.DataFrame(data)
print(df.duplicated())

Output:

0    False
1    False
2     True
3    False
dtype: bool

Remove duplicate rows:

df_clean = df.drop_duplicates()
print(df_clean)

Output:

      Name  Age
0    Alice   25
1      Bob   30
3  Charlie   35

Remove duplicates based on a specific column:

df_clean = df.drop_duplicates(subset=["Name"])

Keep the last occurrence instead of first:

df_clean = df.drop_duplicates(keep="last")

19. Correlation

What is Correlation?

Correlation measures the relationship between two numeric columns. It tells you:

Whether two variables move together (positive correlation)
Whether they move in opposite directions (negative correlation)
Whether there’s no relationship (correlation near 0)

Correlation values range from -1 to +1:

+1 — Perfect positive correlation
0 — No correlation
-1 — Perfect negative correlation

Example:

import pandas as pd

data = {
    "Hours_Studied": [1, 2, 3, 4, 5],
    "Exam_Score":    [50, 55, 65, 72, 85],
    "Sleep_Hours":   [8, 7, 6, 5, 4]
}

df = pd.DataFrame(data)

print(df.corr())

Output:

                Hours_Studied  Exam_Score  Sleep_Hours
Hours_Studied       1.000000    0.991100    -1.000000
Exam_Score          0.991100    1.000000    -0.991100
Sleep_Hours        -1.000000   -0.991100     1.000000

Reading the results:

Hours_Studied and Exam_Score have a correlation of ~0.99 → strong positive (study more, score higher)
Hours_Studied and Sleep_Hours have a correlation of -1.0 → perfect negative (study more, sleep less)

Only numeric columns are included in corr(). Non-numeric columns are automatically ignored.

Quick Reference Table

Method / Function	Description	Example
`pd.Series(data)`	Create a 1D labeled array	`pd.Series([1, 2, 3])`
`pd.Series(data, index=[...])`	Series with custom labels	`pd.Series([1,2], index=["a","b"])`
`pd.DataFrame(dict)`	Create a 2D table from dictionary	`pd.DataFrame({"A":[1,2],"B":[3,4]})`
`df.loc[label]`	Access row by label/index name	`df.loc["Student1"]`
`df.iloc[n]`	Access row by integer position	`df.iloc[0]`
`df.loc[[1, 3]]`	Access multiple rows by label	`df.loc[[0, 2]]`
`df.iloc[1:3]`	Access multiple rows by position range	`df.iloc[0:2]`
`pd.DataFrame(data, index=[...])`	Set named index on creation	`pd.DataFrame(d, index=["R1","R2"])`
`df.set_index("col")`	Set a column as the index	`df.set_index("Name")`
`pd.read_csv("file.csv")`	Load CSV into DataFrame	`pd.read_csv("data.csv")`
`df.to_string()`	Print full DataFrame without truncation	`print(df.to_string())`
`pd.read_json("file.json")`	Load JSON file into DataFrame	`pd.read_json("data.json")`
`df.head(n)`	View first N rows (default 5)	`df.head(3)`
`df.tail(n)`	View last N rows (default 5)	`df.tail(3)`
`df.info()`	Summary: columns, types, nulls, memory	`df.info()`
`df.isnull()`	Detect missing values (True/False)	`df.isnull()`
`df.isnull().sum()`	Count missing values per column	`df.isnull().sum()`
`df.dropna()`	Remove rows with any missing values	`df.dropna()`
`df.fillna(value)`	Fill missing values with a given value	`df.fillna(0)`
`df.fillna(method="ffill")`	Fill missing values using forward-fill	`df.fillna(method="ffill")`
`pd.to_datetime(df["col"])`	Convert column to datetime format	`pd.to_datetime(df["Date"])`
`df["col"].astype(type)`	Change column data type	`df["Age"].astype(int)`
`pd.to_numeric(df["col"], errors="coerce")`	Convert to number, invalid → NaN	`pd.to_numeric(df["col"], errors="coerce")`
`df.loc[condition, "col"] = val`	Replace values matching a condition	`df.loc[df["Age"]>100, "Age"] = 30`
`df["col"].replace(old, new)`	Replace a specific value in a column	`df["Name"].replace("Boob","Bob")`
`df.duplicated()`	Detect duplicate rows (True/False)	`df.duplicated()`
`df.drop_duplicates()`	Remove duplicate rows	`df.drop_duplicates()`
`df.drop_duplicates(subset=["col"])`	Remove duplicates based on column	`df.drop_duplicates(subset=["Name"])`
`df.drop_duplicates(keep="last")`	Keep last occurrence of duplicates	`df.drop_duplicates(keep="last")`
`df.corr()`	Compute pairwise correlation of columns	`df.corr()`

Happy Learning! Pandas is a massive library — but once you’re comfortable with these fundamentals, everything else becomes much easier to pick up. Practice with real datasets (Kaggle has great free ones) and you’ll be analyzing data like a pro in no time.

1. Pandas Module – Introduction

2. Series

3. Series – Syntax & Parameters

4. Series – Example

5. Labels

6. DataFrames

7. Locate Rows

8. Locate Multiple Rows

9. Named Index

10. Load Files in a DataFrame (CSV)

11. Load Data Using to_string()

12. JSON File

13. Analyzing the Data – Head, Tail & Info Method

head() — View the first N rows

tail() — View the last N rows

info() — Get a summary of the DataFrame

14. Data Cleaning

15. Data Cleaning – Empty Cells

16. Data Cleaning – Wrong Format

17. Data Cleaning – Wrong Data

18. Data Cleaning – Duplicates

19. Correlation

Quick Reference Table

Related Posts

Matplotlib in Python

NumPy: The Foundation of Data Science in Python

Sci Kit Learn Project 1

Leave a ReplyCancel Reply

11. Load Data Using `to_string()`

`head()` — View the first N rows

`tail()` — View the last N rows

`info()` — Get a summary of the DataFrame