GSU Robinson College of Business

Internal Sprint Project

Fall 2023 Updated on Sep 8, 2023
  • In-person Section: Mondays 10:00 am - 12:00 pm
  • Online Section: Fridays 6:00 pm - 8:00 pm
Email Office Hours Link
Saber Soleymani (instructor) ssoleymani@gsu.edu Mondays 1:00 pm - 2:00 pm Webex Meetings
Krishna Chaitanya Pulipati (Teaching Assistant) kpulipati2@student.gsu.edu Tuesdays 1:00 pm - 2:00 pm Webex Meetings

Learning Objectives

The primary objective of this project is to provide students with hands-on experience in data science, guiding them through each stage of the data analysis pipeline. This encompasses:

  • Problem definition,
  • data collection and preprocessing,
  • exploratory data analysis,
  • predictive modeling, and
  • effective communication of findings.

Notes

  • Attendance is mandatory to receive a certificate of completion.
  • Around 10 students will be transferred to an external sprint project.
  • Each GRA student is expected to have an individual poster for the competition.
  • Non-GRA students can form teams of 2 or work individually.
  • There will be an internal competition for the best poster. Faculties will judge the competition.

Project Description

The project is structured into three distinct phases:

  1. Data Collection Phase: This is the foundational stage where students will gather raw data from various sources. The focus will be on understanding the domain, identifying relevant data sets, and collecting them for analysis.

  2. Data Analysis Phase: This phase involves cleaning the data, performing exploratory data analysis (EDA), feature engineering, and building predictive models. Students will apply statistical and machine learning techniques to derive insights from the data.

  3. Presentation Phase: In this final stage, students will synthesize their findings and insights into a coherent narrative. They will design and create a poster that effectively communicates their methodology, results, and implications.

Requirements

Each of these phases contains tasks that are categorized into three levels of complexity: Easy (Level 1), Medium (Level 2), and Challenging (Level 3).

  • Students are required to complete at least one Level 3 task to demonstrate their ability to tackle complex data science challenges.

  • Additionally, they must complete two Level 2 tasks to show that they have a well-rounded skill set that goes beyond basic data manipulation and analysis.

Task Name Level 1 Level 2 Level 3
Data Collection Use a dataset from a public repository like Kaggle, UCI Machine Learning Repository, or government websites. Understand the dataset's features, target variables, and any accompanying documentation to get a clear idea of what the data represents. Scrape data from a website using tools like BeautifulSoup or Scrapy in Python. Ensure you comply with the website's terms of service and robots.txt file. Clean and preprocess the scraped data to make it suitable for analysis. Collect data from multiple sources such as APIs, databases, and spreadsheets, and integrate them into a single dataset. This may involve data cleaning and transformation to ensure consistency and reliability.
Data Cleaning Handle missing values using basic imputation techniques. In addition to handling missing values, identify and treat outliers using methods like the Z-score or IQR. Also, correct any data entry errors or inconsistencies in categorical variables. Perform advanced data cleaning techniques that include normalization, and transformations. Also, implement data integrity checks to ensure the quality of the dataset.
Exploratory Data Analysis (EDA) Generate basic summary statistics such as mean, median, standard deviation, and quartiles. Create simple visualizations like bar charts, pie charts, and line graphs to understand the distribution and relationships between variables. Conduct a correlation analysis to identify relationships between variables using Pearson or Spearman coefficients. Perform feature importance analysis using techniques like univariate feature selection or by leveraging ensemble models like Random Forest. Create advanced visualizations such as heatmaps, violin plots, and pair plots. Build interactive dashboards using programming libraries like Plotly to allow dynamic data exploration.
Feature Engineering (OPTIONAL) Create new features using basic arithmetic operations like addition and subtraction. For example, if you have a dataset with 'Revenue' and 'Cost', you could create a new feature called 'Profit'. Get introduced to the concept of dimensionality reduction. Apply a simple technique like removing highly correlated features to simplify the feature space. Apply dimensionality reduction techniques like Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) to simplify the feature space.
Model Building Choose and implement a basic machine learning model like Linear Regression for regression tasks or Logistic Regression for classification tasks. Focus on understanding the model's parameters and how to train it. Implement ensemble methods like Random Forest or Gradient Boosting to improve model performance. Alternatively, you can use pre-trained deep learning models and adapt them to your specific problem. Fine-tune pre-trained models to improve model performance. Or, experiment with different architectures and layers to optimize the model's performance.
Model Evaluation Evaluate the model using basic metrics such as R-squared for regression tasks, and Accuracy and ROC (Receiver Operating Characteristic) curve for classification tasks. Understand what these metrics signify and how they are calculated. Extend the evaluation to include more metrics like Precision, Recall, F1-Score for classification tasks, and Mean Absolute Error (MAE) or Root Mean Square Error (RMSE) for regression tasks. Compare these metrics to get a more nuanced understanding of the model's performance. Conduct a comprehensive evaluation that includes advanced techniques like stratified k-fold cross-validation, and metrics like Area Under the Precision-Recall Curve (AUC-PR) or Cohen's Kappa. Use this detailed evaluation to identify areas for model improvement.
Interpretation and Communication Create a poster layout that includes the essential sections: Introduction, Data Collection, Data Cleaning, Exploratory Data Analysis, Feature Engineering, Model Building, and Conclusion. Use basic design elements like bullet points and simple charts. Enhance the poster by incorporating more advanced visual elements like infographics and heatmaps. Make sure the visual elements are aligned with the content and improve the overall readability and impact of the poster. Create a professional-level poster suitable for a data science conference. This includes not just advanced visual elements, but also a coherent narrative that guides the viewer through the poster. Pay attention to the design aspects like color scheme, typography, and layout to make the poster visually appealing and informative.

Deliverables

The capstone of this 10-week project will be a high-quality poster suitable for presentation at a data science conference. The poster will serve as a visual summary of the student’s journey through the data analysis pipeline and should include the following sections:

  • Introduction: A succinct overview of the problem statement, data, and methodology.
  • Data Collection: Outline the sources of your data and how it was gathered.
  • Data Cleaning: Briefly describe the techniques used for data preprocessing.
  • Exploratory Data Analysis: Highlight key insights gained from the initial data exploration.
  • Feature Engineering: Describe the techniques used to create new features.
  • Model Building: Describe the machine learning model(s) chosen for the project.
  • Model Evaluation: Summarize how the model’s performance was assessed.
  • Conclusion: Conclude with the key findings and potential future work.

Additionally, a brief PowerPoint presentation should be prepared to accompany the poster, summarizing the key points.


Schedule

Online Section: ⏱️ Fridays 6:00 pm - 8:00 pm 🛜 iCollege/Webex

In-person Section: ⏱️ Mondays 10:00 am - 12:00 pm 📍 Buckhead Center, Room 404

Week 🛜 Online
1Friday 09/08
2Friday 09/15
3Friday 09/22
4Friday 09/29
-Friday 10/06 (Mid-semester)
5Friday 10/13
6Friday 10/20
7Friday 10/27
8Friday 11/03(Rescheduled for 12/01)
8Friday 11/10
9Friday 11/17
-Friday 11/24 (Thanksgiving)
10Friday 12/01
Week 📍 In-person
1Monday 09/11
2Monday 09/18
3Monday 09/25
4Monday 10/02
-Monday 10/09 (Mid-semester)
5Monday 10/16
6Monday 10/23
7Monday 10/30
8Monday 11/06
9Monday 11/13
-Monday 11/20 (Thanksgiving)
10Monday 11/27

Topics per week (roughly)

  • Goal Definition: Weeks 1,2
  • Data Collection: Weeks 2,3
  • Data Cleaning: Weeks 3,4,5
  • Exploratory Data Analysis: Weeks 5,6,7
  • Predictive Models: Weeks 7,8,9
  • Presentation: Week 10

Saber Soleymani

Visiting Assistat Professor | Software Developer | Data Scientist