Implemtation of a CSV File

!pip install Faker

import pandas as pd
import numpy as np

from faker import Faker
import random

# Initialize Faker to generate fake data
fake = Faker()

# Define the number of student records to generate
num_students = 1000

# Generate synthetic data for student records
data = {
    'Student_ID': [fake.unique.random_number(digits=8) for _ in range(num_students)],
    'Age': [random.randint(18, 25) for _ in range(num_students)],
    'Gender': [random.choice(['Male', 'Female']) for _ in range(num_students)],
    'Ethnicity': [fake.random_element(elements=('White', 'Black', 'Asian', 'Hispanic'))
     for _ in range(num_students)],
    'Parental_Education': [fake.random_element(elements=('High School', 'Bachelor', 
    'Master', 'PhD'))
    for _ in range(num_students)],
    'Household_Income': [random.randint(20000, 100000) for _ in range(num_students)],
    'Previous_GPA': [round(random.uniform(2.0, 4.0), 2) for _ in range(num_students)],
    'Attendance_Rate': [round(random.uniform(0.5, 1.0), 2) for _ in range(num_students)],
    'Study_Hours_Per_Week': [random.randint(5, 40) for _ in range(num_students)],
    'Participation_Score': [random.randint(0, 100) for _ in range(num_students)],
    'Midterm_Grade': [round(random.uniform(50, 100), 2) for _ in range(num_students)],
    'Final_Grade': [round(random.uniform(50, 100), 2) for _ in range(num_students)]
}

# We create a DataFrame from the generated data
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('academic_performance_data.csv', index=False)

print("Data generated and saved successfully!")

Code Explanation:

1. Imports: We import pandas for data manipulation, numpy for numerical operations, and Faker for generating synthetic data.

2. Generate Synthetic Data: We use the Faker library to generate synthetic data for academic performance. This includes details such as student ID, age, gender, ethnicity, parental education, household income, previous GPA, attendance rate, study hours per week, participation score, midterm grade, and final grade.

3. Create DataFrame: We create a pandas DataFrame using the generated student data and specify column names.

4. Save DataFrame to CSV: Finally, we save the DataFrame to a CSV file named academic_performance_data.csv for further analysis and modeling.

import pandas as pd

# Load the data into a DataFrame
data = pd.read_csv('academic_performance_data.csv')

# Display the first few rows of the DataFrame to inspect the data
print(data.head())

Importing the Pandas Library: The first line of the code imports the pandas library and aliases it as pd. This alias is a standard convention in the Python community, making the code shorter and more consistent with common practice. pandas provides the DataFrame, a powerful tool for data manipulation and analysis.

Loading Data into a DataFrame: Here, the read_csv() function is used to read a CSV file into a DataFrame. A CSV (Comma-Separated Values) file is a type of plain text file that uses specific structuring to lay out data. The read_csv() function is versatile and can handle different formats by adjusting its parameters, but in its simplest form, it just needs the filename. The resulting DataFrame, data, is a two-dimensional labeled data structure with columns potentially of different types.

Inspecting the First Few Rows: This line displays the first five rows of the DataFrame using the head() method. By default, head() returns the first five rows, which helps in quickly checking whether the data has been loaded as expected, and gives a glimpse into the structure of the dataset (column names, data types, and a few values).

# Load the data into a DataFrame
data = pd.read_csv('academic_performance_data.csv')

# Display the first few rows of the DataFrame to inspect the data
print(data.head())

# Check unique values in columns with potential issues
print("Unique values in 'Attendance_Rate':\n", data['Attendance_Rate'].unique())
print("\nUnique values in 'Participation_Score':\n", data['Participation_Score']
.unique())
print("\nUnique values in 'Midterm_Grade':\n", data['Midterm_Grade'].unique())
print("\nUnique values in 'Final_Grade':\n", data['Final_Grade'].unique())

Loading Data into a DataFrame:

Purpose: This line of code loads a CSV file named 'academic_performance_data.csv' into a pandas DataFrame called data.

Process: The read_csv() function from the pandas library (pd) is used here. It reads the data from the CSV file and converts it into a structured format with rows and columns, making it easier to handle and analyze.

Displaying the First Few Rows:

Purpose: The purpose of this line is to print the first five rows of the DataFrame. This helps in quickly checking the structure of the data (e.g., column headers, types of data, and example values).

Functionality: data.head() fetches the first five rows of the DataFrame. When enclosed in print(), it displays this slice of the dataset in the console.

Checking Unique Values in Specific Columns:

Purpose: These lines are used to print the unique values found in four specific columns: 'Attendance_Rate', 'Participation_Score', 'Midterm_Grade', and 'Final_Grade'. Understanding unique values helps identify the range of data, detect any anomalies or irregularities, and confirm data consistency.

Functionality:

data['Column_Name'].unique(): This function retrieves a list of unique values from the specified column ('Column_Name') in the DataFrame. This is crucial for data cleaning and validation processes to ensure that data entries are consistent and within expected limits.

print(): Outputs the unique values list along with a descriptive message to the console, providing an immediate visual check for unusual or incorrect values.

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values:\n", missing_values)

Check for Missing Values:

Purpose: This line of code calculates the number of missing (or null) values in each column of the DataFrame data.

Functionality:

data.isnull(): This function returns a DataFrame where each cell is either True or False, depending on whether the corresponding cell in data is null (i.e., NaN in numerical data or None/empty in object data).

.sum(): This function adds up the True values across each column (since True is treated as 1 and False as 0). The result is a Series where each value represents the total count of missing values in each column.

Print the Missing Values:

Purpose: This prints the counts of missing values for each column, as calculated in the previous step.

Process: By passing the Series missing_values to the print() function along with a descriptive message, you ensure clear output to the console, which makes it easier to quickly understand where data might be incomplete.

# Perform one-hot encoding for categorical variables
data_encoded = pd.get_dummies(data, columns=['Gender', 'Ethnicity', 'Parental_Education
'])

# Display the encoded data
print("Encoded data:\n", data_encoded.head())

Perform One-Hot Encoding:

Purpose: This line converts categorical column values into one-hot encoded vectors. It is crucial for preparing categorical data for machine learning models, which typically require numerical input.

Functionality:

pd.get_dummies(): This function is used for converting categorical variable(s) into dummy/indicator variables. It creates new columns for each unique value in the specified categorical columns ('Gender', 'Ethnicity', 'Parental_Education').

columns Parameter: This specifies the list of column names to encode. Only the columns listed here will be transformed, and each unique value in these columns will get its own column in the resulting DataFrame.

Result: The data_encoded DataFrame will have the original columns minus the ones specified, plus additional columns for each unique category in the specified columns, filled with binary values (0 or 1).

Display the Encoded Data:

Purpose: This prints the first five rows of the newly created data_encoded DataFrame to verify the successful application of one-hot encoding.

Functionality:

data_encoded.head(): Fetches the first five rows of the data_encoded DataFrame. This is used to inspect the DataFrame and ensure that the one-hot encoding was applied as expected.

from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Select numerical features to scale
numerical_features = ['Age', 'Household_Income', 'Previous_GPA', 'Attendance_Rate', 
'Study_Hours_Per_Week', 'Participation_Score', 'Midterm_Grade']

# Scale numerical features
data_encoded[numerical_features] = scaler.fit_transform(data_encoded[
numerical_features])

# Display the scaled data
print("Scaled data:\n", data_encoded.head())

Importing the MinMaxScaler:

Purpose: Imports the MinMaxScaler class from the sklearn.preprocessing module, which is designed to scale each feature to a given range, often [0, 1].

Functionality: MinMaxScaler transforms features by scaling each feature to a specified range. This is achieved by subtracting the minimum value of the feature and then dividing by the range (maximum minus minimum).

Initialize MinMaxScaler:

Initialization: Creates an instance of MinMaxScaler. This instance will be used to scale numerical data. By default, it scales data to the range [0, 1].

Select Numerical Features to Scale:

Feature Selection: Specifies the columns in the DataFrame that contain numerical data suitable for scaling. These features are selected based on their nature (e.g., continuous variables that benefit from normalization).

Scale Numerical Features:

Purpose: Applies the MinMax scaling to the selected numerical features of the data_encoded DataFrame.

Functionality:

fit_transform(): This method first fits the scaler to the data, which involves calculating the minimum and maximum values of the data. It then transforms the data by scaling it to the default range [0, 1].

Assignment: The scaled data replaces the original values of the numerical features in data_encoded.

Display the Scaled Data:

Purpose: Prints the first five rows of the scaled DataFrame to visually confirm that the scaling operation was performed correctly.

Functionality: The head() method retrieves the first five rows of the DataFrame, providing a quick snapshot of the transformed data.

from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = data_encoded.drop('Final_Grade', axis=1)  # Features (all columns except 
'Final_Grade')
y = data_encoded['Final_Grade']  # Target variable ('Final_Grade')

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
random_state=42)

# Display the shapes of the training and testing sets
print("Training set - X:", X_train.shape, "y:", y_train.shape)
print("Testing set - X:", X_test.shape, "y:", y_test.shape)

Importing train_test_split

Although not shown directly above the code, it must be noted that this function would typically be imported from scikit-learn's model selection module:This function is a vital tool in machine learning to evaluate the performance of models.

Define Features (X) and Target Variable (y)

Features (X): This line creates a new DataFrame X by dropping the Final_Grade column from the data_encoded DataFrame. axis=1 indicates that a column (not a row) is to be dropped. This DataFrame X includes all columns that will be used as input variables for a machine learning model.

Target Variable (y): This line sets the Final_Grade column as the target variable, stored in y, which the model will predict.

Split the Data into Training and Testing Sets:

This function call splits the features X and the target y into training and testing sets.

test_size=0.2 specifies that 20% of the data should be reserved for the test set, while the remaining 80% is used for training the model.

random_state=42 ensures that the split is reproducible; the same random state means that the split will be the same each time the code is run, which is important for debugging and comparing model performance across different runs.

Display the Shapes of the Training and Testing Sets:

This prints the shapes of the training and testing datasets.

X_train.shape and y_train.shape show the dimensions of the training datasets for the features and the target variable, respectively.

X_test.shape and y_test.shape do the same for the testing datasets.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared Score:", r2)

Import Necessary Libraries:

LinearRegression: This class from sklearn.linear_model is used to perform linear regression.

mean_squared_error, r2_score: These functions from sklearn.metrics are used to evaluate the performance of the regression model. mean_squared_error measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. r2_score provides the coefficient of determination, a statistical measure of how well the observed outcomes are replicated by the model.

Initialize the Linear Regression Model:

A LinearRegression instance is created and stored in the variable model. This instance will be used to configure, train, and make predictions.

Train the Model on the Training Data:

fit() method: This method fits the linear model to the training data. It takes the feature matrix X_train and the target vector y_train from the training dataset and finds the coefficients (weights) that minimize the residual sum of squares between the observed targets in the dataset and the targets predicted by the linear approximation.

Make Predictions on the Testing Data:

predict() method: This method makes predictions using the linear model fitted above. It takes the feature matrix X_test from the testing dataset and predicts the target values y_pred.

Evaluate the Model:

Mean Squared Error (MSE): Calculates the mean squared error of the predictions, providing a measure of the quality of the estimator. It's the average of the squares of the differences between actual and estimated values.

R-squared (R²) Score: Computes the coefficient of determination, which is a statistical measure of how well future samples are likely to be predicted by the model. R² score provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of total variation of outcomes explained by the model.

These print statements display the computed MSE and R² score, providing quick feedback on the model's accuracy and fit.

Trying Different algorithms:

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest Regression': RandomForestRegressor(random_state=42),
    'Gradient Boosting Regression': GradientBoostingRegressor(random_state=42)
}

# Train and evaluate each model
for name, model in models.items():
    print("Training", name)
    # Train the model on the training data
    model.fit(X_train, y_train)

    # Make predictions on the testing data
    y_pred = model.predict(X_test)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print("Mean Squared Error:", mse)
    print("R-squared Score:", r2)
    print("-----------------------")

Chosing LinearRegression:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Train the model on the training data
linear_model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred_linear = linear_model.predict(X_test)

# Evaluate the model
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)

print("Linear Regression:")
print("Mean Squared Error:", mse_linear)
print("R-squared Score:", r2_linear)

Import Libraries:

LinearRegression: This class from sklearn.linear_model is used to create a linear regression model. It fits a linear model with coefficients to minimize the residual sum of squares between the observed and predicted values.

mean_squared_error, r2_score: These functions from sklearn.metrics are used to evaluate the performance of the model. The mean squared error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual values. The R² score, or coefficient of determination, indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.

Initialize and Train the Linear Regression Model:

Initialization: A LinearRegression object named linear_model is instantiated. This object will be used to perform the linear regression.

Training: The fit() method is used to train the linear regression model on the training data. Here, X_train contains the feature data and y_train contains the corresponding target values.

Make Predictions:

Prediction: After the model has been trained, the predict() method is used to make predictions on the test dataset (X_test). The predictions are stored in y_pred_linear, which contains the model's estimated values based on the test features.

Evaluate the Model:

Evaluation: The performance of the model is evaluated using two metrics:

Mean Squared Error (MSE): This calculates the average of the squares of the errors between the actual values (y_test) and the predicted values (y_pred_linear).

R-squared (R²) Score: This provides a measure of how well the variations in the dependent variable are explained by the independent variables in the model. A higher R² score indicates a better fit of the model to the data.

Print the Evaluation Results:

These lines print the evaluation results, displaying the MSE and R² score. This feedback helps to understand the effectiveness of the model in terms of prediction accuracy and how much of the variance in the dependent variable can be explained by the model.

# Get the coefficients of the linear regression model
coefficients = linear_model.coef_

# Create a DataFrame to display the coefficients along with the corresponding 
feature names
coefficients_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': coefficients
})

# Sort the coefficients by absolute value to identify the most influential features
coefficients_df['Absolute_Coefficient'] = abs(coefficients_df['Coefficient'])
coefficients_df = coefficients_df.sort_values(by='Absolute_Coefficient', 
ascending=False)

# Display the coefficients DataFrame
print("Coefficients of Linear Regression Model:")
print(coefficients_df)

Get the Coefficients of the Linear Regression Model:

linear_model.coef_: This attribute of the LinearRegression model contains the coefficients for the predictors used in the model. These coefficients quantify the relationship between each feature and the target variable, indicating how much the target variable is expected to increase when the feature increases by one unit, all else being equal.

Create a DataFrame to Display the Coefficients:

Data Frame Creation: A new DataFrame coefficients_df is created using pandas. It maps each feature name from X.columns (the names of the predictors) to its corresponding coefficient from the linear_model.

This format makes the coefficients more interpretable, as you can see which feature corresponds to which coefficient.

Sort the Coefficients by Absolute Value:

Adding Absolute Coefficient Column: This line calculates the absolute value of each coefficient and stores it in a new column Absolute_Coefficient. The absolute value is used because coefficients can be negative or positive, and we are often interested in the magnitude of the influence, regardless of the direction.

Sorting: The DataFrame is then sorted by Absolute_Coefficient in descending order. This sorting helps to quickly identify the features with the most significant impact on the target variable, whether positive or negative.

Display the Coefficients DataFrame:

These print statements are used to display the complete DataFrame with the coefficients, now sorted by their absolute values. This visualization aids in quickly assessing which features are the most influential in the model.

import matplotlib.pyplot as plt
import seaborn as sns

# Plot scatter plots for numerical features vs. final grade
numerical_features = ['Age', 'Household_Income', 'Previous_GPA', 
'Attendance_Rate', 'Study_Hours_Per_Week', 'Participation_Score', 'Midterm_Grade']

plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
    plt.subplot(3, 3, i+1)
    sns.scatterplot(x=feature, y='Final_Grade', data=data_encoded)
    plt.title(f'{feature} vs. Final Grade')
plt.tight_layout()
plt.show()

Import Visualization Libraries:

matplotlib.pyplot: This module in the Matplotlib library provides a MATLAB-like interface for making plots and is commonly used for creating graphs and charts.

seaborn: Seaborn is a Python data visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics.

Plot Scatter Plots for Numerical Features vs. Final Grade:

Numerical Features: This list contains the names of the features that are to be plotted against the target variable 'Final Grade'. Each of these features is presumed to be a continuous variable that could influence or correlate with the final grade.

Figure Size: Sets up the overall figure size for the scatter plot matrix. The size is 15x10 inches, providing enough space for each subplot to be visible and distinguishable.

Loop through Features: The enumerate function is used here to loop through the numerical_features list. i is the index, and feature is the element (feature name).

Subplot Setup: plt.subplot(3, 3, i+1) dynamically adds a subplot to the figure. The parameters (3, 3) suggest a grid of 3 rows and 3 columns, and i+1 places each subplot in the next position on this grid.

Scatter Plot: sns.scatterplot() is called with the x-axis set to the current feature and the y-axis set to 'Final Grade'. This plots the relationship between each feature and the final grade.

Title: Each subplot is given a title that reflects the data being plotted, aiding in quick identification of the plot during analysis.

Layout Adjustment: plt.tight_layout() adjusts the padding between and around subplots to make them fit well within the figure area.

Display Plot: plt.show() renders the plot. This function is necessary to display the plot when using Matplotlib in a script as opposed to a Jupyter notebook.

# Calculate correlation matrix
correlation_matrix = data_encoded[numerical_features + ['Final_Grade']].corr()

# Plot heatmap of correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Calculate the Correlation Matrix:

Correlation Matrix: This line of code calculates the correlation matrix for the selected features plus the target variable (Final_Grade). The correlation matrix is a table where the variables are shown on both rows and columns, and the cell values are the coefficients that measure the correlation between each pair of variables.

corr() Method: This Pandas DataFrame method computes the pairwise correlation of columns, excluding NA/null values. By default, it uses the Pearson correlation coefficient, but other methods like 'spearman' or 'kendall' can also be used if specified.

Plot Heatmap of Correlation Matrix:

Figure Setup: plt.figure(figsize=(10, 8)) sets up a figure object with the dimensions of 10 inches by 8 inches. This size is chosen to ensure that the heatmap is large enough to be easily readable.

Heatmap: sns.heatmap() is a Seaborn function used to plot the correlation matrix.

correlation_matrix: The correlation matrix calculated earlier is passed as the main data to plot.

annot=True: This argument enables annotations inside the squares of the heatmap, displaying the correlation coefficients.

cmap='coolwarm': Specifies the color map. 'Coolwarm' is a visually appealing, diverging color map well-suited for displaying standardized ranges with a midpoint; for correlation matrices, this highlights positive (warm) and negative (cool) correlations.

fmt=".2f": Controls the string formatting for annotations, showing numbers up to two decimal places.

linewidths=0.5: Sets the width of the lines that will divide each cell in the heatmap.

Title and Display: plt.title('Correlation Heatmap') adds a title to the heatmap. plt.show() is used to display the figure. This command ensures that the heatmap is rendered and visible in your output.

Search This Blog

GrandGravity

Implemtation of a CSV File

Comments

Post a Comment

Popular posts from this blog

Building a Vibrant ToDo App with Flutter

Implementation of Streamlit web Application

Samsung AI Innovation Campus Graduation