NDSC: Data Science Track - College Happiness Score

Introduction

To help build your data science skills, we're inviting you to join the National Data Science competition where you will be using real-world data sets and advanced analytical tools. Through this project, you will develop modern data science skills and apply them to a captivating challenge: assessing the "College Happiness Score" and various factors influencing it.

Challenge

Challenge: Predict the "College Happiness Score" across American Universities

As prospective students navigate their higher education options, understanding the value and satisfaction a college provides is crucial. Leveraging data from the College Scorecard Database, Forbes Rankings, HappyScore Data, Crime Data, Undergraduate Enrollment figures, and insights from Columbia University's Advanced Data Analysis Course, students will create models to gauge university life quality using any subset of combination of features provided in our dataset. They will delve into what makes a college environment not just survivable, but enjoyable, and predict how various institutions rank on the happiness scale.

<aside> 💡 You can access the data here.

</aside>

What Will Students Learn?

The basics of coding in Python, a language esteemed in both academia and industry. Techniques for handling and interpreting real, untamed data sets. Contemporary statistical modeling and machine learning methods, tailored for multidimensional data like the "College Happiness Score."

What You Will Need to Do

We have split the data in two parts - the first part will be your training data and the second part is the test data. Both sets of data have features and labels, but you will only be able to view those of the training data. The objective is for you to use the information available in the training data to create models to make a prediction using the test set features to make an educated guess about what the correct labels and college happiness score will be.

Submission Requirements

Please submit a Colab notebook that is executable without bugs. Please make your Colab notebook public and shareable with a shared link. You will have to submit this link.
The notebook should be able to run without any edits and it should show us how the model is trained and how you create the final prediction. We will do a “run all” and points will be taken if the notebook does not execute or have some errors inside.
In the end of the notebook, it should save the prediction in a .csv file and the file should have ONE column called “prediction”. You can do this by using the pandas package. The naming convention is required to be “firstname_lastname_age_school.csv”. For example, Tony Stark, 15 years old, from Menlo School should save the file by running this code: tony_stark_15_menloschool.csv.
The final prediction should look like y_train.csv, which can be seen here. This file is the label (aka. Ground truth) for the training data. Your prediction .csv file should follow exactly the format in the y_train.csv file.
Please also submit a write-up. This can simply be a lab report (no more than 3 pages, Times New Roman, 12 sized font, single space).

<aside> 💡 Notes: Since this is a regression problem, we will use the Mean Square Error of your prediction and the true happiness score as the final grading principle. For details about Mean Square Error, please see here.

You can certainly treat it as a ranking problem. We do not expect you to have the full knowledge of ranking problems - you are welcome to use any methods you like. In any case, please be aware we need the happiness score in the end, not the ranks.

</aside>

What Students Need

Access to the internet and any standard computer, laptop, or Chromebook.