Skip to content

RudimentaryChef/FRED_NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

159 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FRED_NLP

As of 4/15/24 This is my current project

In order to install all necessary packages please run this command:

pip install -r requirements.txt

Repository and File Information

  • code folder: Contains the Jupyter Notebook and all code files. Run the Fred_Runner file to use our Bert Model on your data. Please contact adikrish6824@gmail.com for access to our weights file. Use the code in the jupyter notebook to generate your own model!

  • sampleData folder: Sample Data. Due to FERPA, I can not publish real student data/survey that we will train the model on. To this end, I have used chatGPT to generate data to train my model for this demo.

  • ReadME.MD file: This file

  • .ipynb_checkpoints folder: Checkpoints for my Jupyter Notebook file

Project Story

  • Large online courses have surveys that instructors need to manually sort through which may take hours of time, and for certain MOOCs may be borderline impossible

  • Dr. Mayer, a professor I have researched with and my former Linear Algebra and Multivariable Calculus professor brought this problem up with me.

  • To reduce instructor effort in large courses, this project serves to categorize survey responses using machine learning and then automate the appropriate instructor action (send an email with a response to the FAQ) or flag certain responses that require instructor attention.

  • I am working with Dr. Mayer on this project

Important Dates

  • 10/31/23: First Sprint Deadline (Minimum Viable Product) ✅

  • 11/6/23: Proposal Deadline ✅

  • 4/15/23 -> 4/16/23: University System of Georgia Education Conference ✅

Current Progress

  • Sprint 1 complete, basic three-pronged classifier created with tensor flow. No major issues were detected in testing.

  • Sprint 2 complete, 4-pronged classifier that combines various flows. Sprint 2 issues: The model has been overfitted due to majority of data being NC it started categorizing every response as NC. Possible Options to fix the problem in Sprint 2 (model categorizes everything as no concern due to a large amount of no concern within the data set):

    1. Organically make my data better by adding more options that have concerns and less options that are "no concern" (This would be the best for JUST this problem BUT LACKS Generalization)
    2. Data Augmentation (maybe just cloning?) (I think this might be the best option right now as far as long term expansion goes)
    3. Changing to semi supervised learning (Not enough data)
    4. Convert to a model that uses transfer learning instead (Need to look into this more)
  • Sprint 3 complete, Back Translation Augmentation Attempt: Sprint 3 issues: Major Roadblock. Google Translate API is extremely slow and unreliable. An alternative solution needs to be found.

  • Sprint 4 COMPLETE, Look into Alternate Data Augmentation Method using Open AI API.

    Important Resource: https://cookbook.openai.com/examples/how_to_handle_rate_limits

    Open-AI API fine tuning problems: https://medium.com/@abhishekmazumdar94/fine-tuning-an-open-ai-model-dc78e6ad5a07 THIS HAS BEEN DEEMED AS FEASIBLE. However it is time-consuming and I will get back to this after I have a viable first product. I have managed to generate augmented data.

    This method has been successfully implemented on a smaller scale.

  • Sprint 5 Complete!, Use an LLM for text categorization

    Important Resources: https://towardsdatascience.com/choosing-the-right-language-model-for-your-nlp-use-case-1288ef3c4929

    SUCCESS WITH BERT Added Feature to automatically categorize any column for an excel sheet.

                     [ https://www.youtube.com/watch?v=IzbjGaYQB-U&ab_channel=PritishMishra](url)
    
  • Sprint 6, Improve performance on more complicated cases OR Implement Unsupervised learning

    Important Resources: To be found

    Method: To be determined

    Possible Ideas: Augmented REAL data by using the same method as I used to generate augmented data. Reinforcement training of sorts with real data. Transfer learning?

About

Purpose:

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors