Pranitha Nekkanti Portfolio

YouTube Trending Data Pipeline and Analytics using AWS

Project Overview

This project processes and transforms YouTube trending data, focusing on automating data ingestion, transformation, and visualization using AWS services. The objective is to combine YouTube data in JSON and CSV formats, transform it into a query-friendly format (Parquet), and enable automated querying and visualization through AWS Athena and Amazon QuickSight.

Key Features:

Data Ingestion: Uploaded the YouTube dataset from Kaggle into an S3 bucket with folders for JSON data and region-specific CSV files.
Data Transformation: Combined data from CSV and JSON files using the common id and category_id columns, and converted them to Parquet format for efficient querying.
Data Processing with AWS:
- AWS Lambda: Used Python scripts to convert JSON files to Parquet.
- AWS Glue: Employed Glue's auto-generated Spark scripts to convert CSV files to Parquet.
- AWS Athena: Used for SQL querying and joining the cleansed Parquet datasets.
Automated Workflow: Set up Lambda triggers to automatically execute the ETL pipeline when new data is added to S3.
Visualization: Visualized the cleansed data using Amazon QuickSight, creating interactive dashboards and visual reports.

Outcomes and Benefits:

Efficient Data Transformation: Parquet format provided faster queries and reduced storage costs, improving the overall performance of data processing and analysis.
Automated ETL Pipeline: Lambda triggers automated the processing of new datasets, ensuring real-time availability for querying.
Improved Data Access: AWS Athena allowed seamless querying of large datasets stored in S3 without needing an extensive ETL infrastructure.
Actionable Insights: QuickSight dashboards provided real-time insights into YouTube trending data, facilitating better data-driven decision-making.

Technical Details:

Dataset: YouTube trending dataset from Kaggle, stored as JSON and region-specific CSV files in an S3 bucket.
Data Transformation:
- AWS Lambda: Used to convert JSON files to Parquet format.
- AWS Glue: Employed Glue-generated Python Spark scripts to convert CSV files to Parquet.
Joining Process: The Parquet outputs from JSON and CSV conversions were joined in AWS Athena using common keys (id, category_id), with type casting to ensure compatibility.
Visualization: Amazon QuickSight was used for building visual dashboards based on the processed data, enabling users to explore insights interactively.

Conclusion:

This project successfully automated the process of ingesting, transforming, and visualizing YouTube trending data using AWS services. By converting large datasets into Parquet format, it enabled faster querying in Athena and real-time visualization in QuickSight. The combination of AWS Glue, Lambda, and S3 provided a scalable and efficient infrastructure, making it easy to manage and process future datasets. Future enhancements could involve integrating machine learning models to predict trends or further optimizing the data pipeline for more complex analytics.

IoT-based Health Monitoring for Elderly People

Project Overview

Led a team of four to develop an innovative hardware model for comprehensive health monitoring of elderly individuals. The system utilized IoT technology to provide real-time health data and enhance remote care capabilities.

Key Achievements:

Designed and implemented a system to monitor vital health parameters including temperature, blood pressure, heart rate, ECG, and body posture.
Integrated Arduino for sensor interfacing, Raspberry Pi for data processing, and Detectron2 for advanced posture analysis.
Incorporated ThingSpeak Cloud for real-time data visualization, analysis, and alert generation.
Developed a user-friendly mobile application for caregivers to access health data and receive alerts.
Implemented GSM module for secure data transmission, ensuring patient privacy.
Created a scalable architecture capable of continuous long-term monitoring.

Technologies Used:

Arduino, Raspberry Pi, Detectron2, ThingSpeak Cloud, IoT sensors, GSM module

Impact:

Awarded best hardware model by the department, recognizing its innovation and practical application.
Demonstrated potential for significant cost reduction in elderly care by enabling remote monitoring and early intervention.
Improved quality of life for elderly individuals by allowing them to stay in their homes while receiving continuous health monitoring.

This project showcases my ability to lead a team, integrate complex technologies, and develop practical solutions that address real-world healthcare challenges using IoT and cloud technologies.

SF Automation

Project Overview

SF Automation is a comprehensive project designed to revolutionize the hiring process by automating candidate management and communication. This solution focuses on enhancing efficiency, accuracy, and speed in the recruitment workflow, particularly for handling large volumes of candidate data. The project ensures a seamless and country-specific hiring process, adhering to unique requirements and regulations of different regions.

Key Features:

Bulk Data Handling: The system allows for the bulk entry of candidate details, equipped with robust validation mechanisms to ensure data accuracy and integrity.
Automated Emails: At every stage of the hiring process, automated emails are sent to candidates, keeping them informed and engaged.
Country-Specific Processes: The hiring process is tailored to meet the specific requirements and regulations of different countries, ensuring compliance and relevance.
Approval and Verification: The solution includes a structured workflow for approvals and verifications, making sure all necessary checks are in place before proceeding.
Tracking and Management: The system tracks all stages of the hiring process, providing real-time updates and insights.

Outcomes and Benefits:

Efficiency and Speed: The fully automated process significantly reduces the turnaround time (TAT) of hiring, allowing for quicker onboarding of candidates.
Accuracy and Compliance: With automated validations and country-specific processes, the system ensures high accuracy and compliance with local laws and regulations.
Enhanced Communication: Automated emails at each stage of the process keep candidates informed, improving their experience and engagement.
Scalability: The solution is designed to handle bulk uploads and manage large volumes of candidate data effortlessly.
Approval Workflow: Streamlined approval processes and verification steps ensure that all necessary checks are completed systematically.

Technical Details:

Technology Stack: The project utilizes a robust technology stack, including modern web frameworks, database management systems, and email automation tools.
Integration: Seamless integration with existing HR systems and databases ensures a smooth transition and adoption.
Security: Strong data security measures are implemented to protect sensitive candidate information.

Conclusion:

The SF Automation project has successfully transformed the hiring process, making it more efficient, accurate, and user-friendly. By automating various stages and ensuring compliance with country-specific regulations, this solution stands out as a critical tool for modern recruitment.

Innovative Aftermarket System: Seamless Data Migration and Processing

Project Overview

Innovative Aftermarket System is designed to facilitate the seamless migration of data between cloud and on-premises environments using Azure Data Factory (ADF). This project addresses the complexities of data migration, ensuring smooth, secure, and efficient transfer and management of data. Additionally, it focuses on preprocessing data to ensure its quality and usability.

Key Features:

Data Migration: Efficient migration of data between cloud and on-premises systems using Azure Data Factory, ensuring minimal disruption and data integrity.
Pipeline Construction: Building robust pipelines in ADF to automate the data migration process, enabling scheduled and real-time data transfers.
Data Preprocessing: Cleaning and preprocessing data within Azure Data Factory to ensure data quality. This includes handling missing values and ensuring the data is ready for use.
Incremental Data Addition: Pipelines in ADF are designed to handle incremental data updates, ensuring that new and modified data is accurately captured and transferred.
Backend Database: Using SQL Server Management Studio (SSMS) as the backend database to store and manage the migrated data, providing a reliable and scalable storage solution.

Outcomes and Benefits:

Efficient Data Migration: Streamlined migration process reduces downtime and ensures data integrity during the transfer between cloud and on-premises environments.
Improved Data Quality: Preprocessing steps like cleaning and adding missing values ensure that the data is accurate, complete, and ready for analysis or operational use.
Automated Workflows: Automated pipelines reduce manual intervention, minimizing errors and increasing efficiency in data management tasks.
Scalability: The system is designed to handle large volumes of data and scale with the growing needs of the organization.
Real-Time Updates: Incremental data addition ensures that the system is always up-to-date with the latest data changes, supporting real-time decision-making.

Technical Details:

Technology Stack: The project utilizes Azure Data Factory for data migration and pipeline construction, along with SQL Server Management Studio (SSMS) for backend database management.
Data Processing: Data cleaning, handling missing values, and other preprocessing steps are performed within Azure Data Factory before data storage.
Integration: Seamless integration with existing data systems and workflows ensures minimal disruption and ease of adoption.
Security: Robust security measures are implemented to protect data during migration and storage, adhering to best practices and compliance standards.

Conclusion:

The Innovative Aftermarket System project effectively addresses the challenges of data migration between cloud and on-premises environments. By leveraging Azure Data Factory and SSMS, the project ensures efficient, secure, and accurate data transfer and management. The inclusion of data preprocessing steps further enhances the quality and usability of the data.

IPL Performance Analyzer

Project Overview

This project involved creating an interactive dashboard using IPL (Indian Premier League) cricket match data from 2008 to 2020. The dashboard visualizes various statistics and insights from IPL matches, providing a comprehensive overview of team and player performances, match outcomes, and other key metrics.

Key Features:

Visualizations of most runs scored by batsmen and most wickets taken by bowlers over the years.
Team performance analysis, including teams with the most wins.
Match venue statistics and toss decision impacts.
Player performance metrics, including Man of the Match awards.
Interactive filters allowing users to explore data by year, player, and other parameters.
Multiple chart types including bar charts, bubble charts, and maps to represent different data aspects

Outcomes and Benefits:

Comprehensive data analysis: The dashboard provides in-depth insights into IPL match statistics, player performances, and team strategies.
User-friendly interface: Interactive elements allow various stakeholders to easily explore and analyze the data.
Multi-purpose tool: The dashboard caters to diverse users including team owners, sports analysts, media, betting companies, fantasy sports enthusiasts, and fans.
Data-driven decision making: Coaches and team management can use the insights for strategic planning and player selection.
Enhanced fan engagement: Cricket enthusiasts can use the dashboard to track their favorite teams and players, enhancing their IPL experience.

Technical Details:

Technical Stack: Utilizes Tableau for data visualization, with CSV datasets and pre-processing within Tableau.
Dataset Details: Two CSV files containing 816 match observations and 193,468 ball-by-ball records, linked by unique match IDs.
Visualization Techniques: Employs bar charts, bubble charts, maps, and color intensity charts for effective data representation.
Interactive Features: Includes date filters, player parameters, and name filters for customized data exploration.
Dashboard Components: Displays match statistics, player performances, team analyses, and toss impact across multiple chart types.
Potential Users: Caters to team owners, sports analysts, media, betting companies, fantasy sports enthusiasts, and cricket fans.

Conclusion:

This IPL Dashboard project demonstrates strong data visualization and analysis skills using Tableau. It showcases the ability to transform complex sports data into an intuitive, interactive tool that provides valuable insights for various stakeholders in the cricket industry. The project highlights proficiency in data preprocessing, dashboard design, and creating user-centric data visualization solutions.

Handwritten Digit Analysis using Naive Bayes Classification for digit 7 and 8

Project Overview

This project aimed to classify handwritten digits "7" and "8" from the MNIST dataset using a Naive Bayes classifier. By leveraging the mean and standard deviation of pixel values as features, the classifier's performance was assessed on a test set, with the final goal of achieving high classification accuracy.

Key Features:

Feature Extraction: Utilized mean pixel value and standard deviation of pixel values for each image.
Visualization: Created scatter plots to illustrate the distribution of features for digits "7" and "8."
Parameter Estimation: Estimated parameters of the 2-D normal distribution for each digit using Maximum Likelihood Estimation (MLE).
Classification: Employed the Naive Bayes classifier to categorize test samples based on calculated probabilities.

Outcomes and Benefits:

Classification Accuracy: Achieved a final accuracy of 82.14% on the test set.
Performance Metrics: Demonstrated that the Naive Bayes classifier effectively utilized the mean and standard deviation of pixel values to distinguish between digits "7" and "8."

Technical Details:

Dataset: MNIST dataset with images of digits "7" (6265 training, 1028 testing) and "8" (5851 training, 974 testing).
Feature Extraction Methods: Mean pixel value and standard deviation of pixel values.
Parameter Estimation: Mean vector (μ) and covariance matrix (Σ) calculated using Maximum Likelihood Estimation (MLE).
- Digit "7": μ_7 = [44.2168, 71.8449], Σ_7 = [[938.9635, 1283.1142], [1283.1142, 1861.3919]]
- Digit "8": μ_8 = [19.3797, 52.9172], Σ_8 = [[157.1901, 385.0217], [385.0217, 962.8436]]
Classifier: Naive Bayes classifier using the probability density function (PDF) of a 2-D normal distribution.

Conclusion:

The Naive Bayes classifier successfully classified digits "7" and "8" from the MNIST dataset with an accuracy of 82.14% using basic features of mean and standard deviation of pixel values. While the results are promising, there is room for improvement by incorporating additional features or exploring more sophisticated classification techniques to enhance accuracy further.

K-Means Clustering: A Comparative Study of Initialization Strategies

Project Overview

This project explores two different initialization techniques for the k-means clustering algorithm on a 2-D dataset. The aim was to evaluate how these initialization strategies impact the clustering performance and the objective function values, which measure the quality of clustering.

Key Features:

Initialization Strategies: Compared Random Initialization and Max Distance Initialization.
Objective Function Analysis: Evaluated how each strategy affects the within-cluster variance.
Cluster Evaluation: Assessed the clustering quality with varying numbers of clusters (2–10).
Performance Metrics: Measured and compared the total squared distances between data points and cluster centers.

Outcomes and Benefits:

Improved Clustering: Max Distance Initialization consistently produced lower objective function values, indicating better cluster compactness and quality.
Faster Convergence: Strategy 2 demonstrated quicker convergence and superior local optima compared to Random Initialization.
Enhanced Initialization: Max Distance Initialization provides a more systematic approach to selecting initial cluster centers, improving clustering performance.
Reduced Variability: Strategy 2 showed less variation in clustering results across different runs, offering more consistent outcomes.
Better Coverage: Max Distance Initialization helps in capturing a broader range of data distribution, leading to more representative clusters.

Technical Details:

Dataset: A 2-D dataset with 300 examples and two features.
Initialization Techniques: Random Initialization and Max Distance Initialization.
Objective Function: Calculated by summing the squared distances between each data point and the center of its assigned cluster, aiming to minimize this value for better clustering quality.

Conclusion:

The project demonstrated that Max Distance Initialization significantly improves clustering performance compared to Random Initialization. By providing better cluster compactness and faster convergence, it offers a more effective method for initializing cluster centers in the k-means algorithm. Future work could explore additional initialization strategies to further enhance clustering results.

About Me

Education

Master of Science in Information Technology

Bachelor of Technology in Computer Science and Engineering

Professional Experience

Company: PricewaterhouseCoopers (PwC)

Project 1: SF Automation

Project 2: Innovative Aftermarket System (IAS)

Company: VERZEO

Full-stack Web Developer - Intern

Technical Skills

Programming/Scripting Languages:

Technologies:

Operating Systems:

Projects

YouTube Trending Data Pipeline and Analytics using AWS

Project Overview

Key Features:

Outcomes and Benefits:

Technical Details:

Conclusion:

IoT-based Health Monitoring for Elderly People

Project Overview

Key Achievements:

Technologies Used:

Impact:

SF Automation

Project Overview

Key Features:

Outcomes and Benefits:

Technical Details:

Conclusion:

Innovative Aftermarket System: Seamless Data Migration and Processing

Project Overview

Key Features:

Outcomes and Benefits:

Technical Details:

Conclusion:

IPL Performance Analyzer

Project Overview

Key Features:

Outcomes and Benefits:

Technical Details:

Conclusion:

Handwritten Digit Analysis using Naive Bayes Classification for digit 7 and 8

Project Overview

Key Features:

Outcomes and Benefits:

Technical Details:

Conclusion:

K-Means Clustering: A Comparative Study of Initialization Strategies

Project Overview

Key Features:

Outcomes and Benefits:

Technical Details:

Conclusion:

Certifications

Resume

Contact