Welcome to my Applied AI repository. This collection of projects demonstrates my approach to solving complex machine learning challenges across different domains—from time-series forecasting of physiological data to computer vision and natural language processing.
Each project prioritizes rigorous data handling, thoughtful feature engineering, and algorithm optimization over plug-and-play solutions.
Notebook: heartrate_forecast.ipynb
This project tackles the difficult task of forecasting a patient's heart rate 20 minutes ahead using only 226 minutes of wearable-sensor data. The raw data presented significant ethical and technical challenges, including severe sensor errors and missing oximeter readings.
*Data Cleaning: Impossible values were replaced with NaNs and backfilled using oximeter pulse data as a biologically consistent proxy, while remaining gaps were filled with a 5-minute cubic spline interpolation. *Feature Engineering: Constructed 36 features across 9 clinically driven groups, eventually using Mutual Information to select the top 12 features for the K-Nearest Neighbor (KNN) model. *Optimization: The most impactful optimization was restricting the KNN model's memory to a localized window of the most recent 45 minutes of patient history, preventing older, differing physiological states from contaminating the prediction. *Performance: Achieved a highly accurate, physiologically plausible forecast with an RMSE of 5.36.
Notebook: imgprocessing.ipynb
Can traditional machine learning models tell a cat from a dog without relying on automated CNNs? Using a dataset of 10,000 evenly split images, I set a strict target of achieving at least 80% accuracy and F1 score.
*Feature Engineering: Generated 10,221 hardcoded features spanning 8-dimensional families, including HOG for shape silhouettes, LBP and Gabor for fur texture, and Hu moments for pose-invariant geometry. *The PCA Bottleneck: Initial model baselines failed to reach the 80% mark. An investigation revealed that applying PCA compression rotated the feature space, which fundamentally conflicted with the axis-aligned splits required by tree-based algorithms. *Optimization: Discarding PCA and feeding the raw high-dimensional features directly into the tree-based models drastically improved performance. *Performance: The improved XGBoost model exceeded the project goal, delivering an accuracy of 0.8033 and an F1 Macro of 0.8032.
Notebooks: tweet_classification.ipynb and Deep_Learning_tweet_sentiment_analysis.ipynb
This natural language processing challenge involved classifying 13,240 tweets into three categories: Not Offensive (NOT), Targeted Insult (TIN), and Untargeted Insult (UNT). The primary hurdle was extreme class imbalance, as the UNT category contained only 524 instances.
*Preprocessing & Feature Extraction: Utilized a combination of de-censoring, wordnet lemmatization, and entity-abstraction to restore masked profanity and create a stable vocabulary. *Matrix Construction: The most effective approach was a "kitchen-sink" matrix of 278 dimensions, combining TF-IDF, GloVe embeddings, and 28 specifically engineered signals. *Two-Stage Pipeline: Reframed the classification task into a two-stage binary pipeline to isolate the difficult UNT class. An XGBoost "gatekeeper" first separated NOT tweets from Offensive tweets. Then, a specialized CatBoost classifier split TIN from UNT using targeted class-weight optimization. *Performance: The final cascade model achieved an accuracy of 0.74 and an F1 Macro of 0.57.
*Deep Learning Cascade Architecture: Upgraded the pipeline to a hierarchical deep learning model to better capture semantic nuance. A RoBERTaTwitter "Gatekeeper" classifier handles the initial NOT vs. OFFENSIVE split, while a specialized SupConTweetClassifier (leveraging Supervised Contrastive Learning and BERTweet) strictly handles the OFFENSIVE subset to isolate the difficult UNT class from TIN.
- Advanced Optimization: The deep learning pipeline incorporates rigorous threshold optimization (tuning the specialist threshold to 0.80) and explores hybrid architectures that concatenate the 28 engineered features directly with the BERT embeddings to maximize the final F1-Macro score. Final deep learning cascade model achieved an accuracy of 0.81 and an F1 Macro of 0.73