{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 11 - Chapter 16" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Due Date: Monday, April 28th no later than 11:59 p.m.\n", "- Partner Information: You may complete this assignment individually or with exactly one classmate.\n", "- Submission Instructions (working alone): Upload your solution, entitled **YourFirstName-YourLastName-Homework11.ipynb** to the \n", "BrightSpace Homework 11 Dropbox.\n", "- Submission Instructions (working with one classmate): Upload your solution, entitled \n", "**YourFirstName-YourLastName-PartnerFirstName-PartnerLastName-Homework11.ipynb** to the BrightSpace Homework 11 Dropbox. Note: If you \n", "work with a partner, only one person needs to submit a solution. If you both submit a solution, the submission that will be graded is the one \n", "from the partner whose last name comes alphabetically first.\n", "- Deadline Reminder: Once the submission deadline passes, BrightSpace will no longer accept your submission and you will no longer be able to earn credit. \n", "Thus, if you are not able to fully complete the assignment, submit whatever you have before the deadline so that partial credit can be earned." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Starting Code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import matplotlib.pyplot as plots\n", "import numpy as np\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download **playerStatistics.csv**\n", "into the same directory as this Jupyter notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Place the csv file in the same directory as your solution\n", "nba_players = Table().read_table(\"playerStatistics.csv\")\n", "nba_players.show(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1 - 2 Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sports leagues around the world are increasingly relying on large-scale data analysis and advanced statistics to evaluate athlete performance. In the National Basketball Association (NBA), one of the most popular “new-age” stats is plus/minus, which measures the point differential when a player is on the court—how many more (or fewer) points their team scores compared to the opponent during their playing time.\n", "\n", "In this exercise, we will attempt to predict plus/minus using regression. To start our analysis, create a new table called **normalized_nba** that is made up of the standard-normalized columns of nba_players. Display the first 5 entries in this table" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Place answer here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2a - 1 Point" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To analyze which stat may perform the best in predicting plus/minus, print a scatteplot for each of the following columns: **points, numMinutes,** and **fieldGoalsAttempted**. Each plot should have the explanatory variable on the x-axis, and the column **plusMinusPoints** (the response variable) on the y-axis. The scatterplot should also show the regression line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Place answer here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2b - 2 Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the correlation between plusMinusPoints and the variables listed in question 2a. Print each value \n", "in the following format (where ?.??? might be 0.532):\n", "\n", "*The correlation between {explanatory variable} and plusMinusPoints = ?.???* " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Place answer here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interlude" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our current method of prediction, regression, does not allow the use of more than one explanatory variable as a predictor for our response variable. What if there was a way to change this? Since our dataset contains over 20 columns, it could be useful to take advantage of all information.\n", "\n", "One way of using more than one attribute is via a process called Principal Component Analysis (PCA). While PCA is outside the scope of \n", "this class, it can be simply described as finding the directions along which the data varies the most. \n", "For more information, watch this [explanatory video](https://www.youtube.com/watch?v=FgakZw6K1QQ).\n", "\n", "After applying PCA to our dataset, we observe that the correlation with plusMinusPoints is increased, \n", "allowing us to make more accurate predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "principal=PCA()\n", "x=principal.fit_transform(normalized_nba.drop('plusMinusPoints').to_df())\n", "\n", "pca_nba = Table().with_columns(\n", " 'PC1', x[:,0],\n", " 'PC2', x[:,1],\n", " 'plusMinusPoints', normalized_nba.column('plusMinusPoints'),\n", ")\n", "\n", "print(\"The correlation between Principal Component 1 and plusMinusPoints = {:.3f}\".format(correlation(pca_nba, 'PC1', 'plusMinusPoints')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3 - 3 Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a function called **slope** that calculates the slope of the regression line using the column **PC1** for the x values \n", "and **plusMinusPoints** for the y values. Then, perform a bootstrap simulation by taking a sample with replacement \n", "from **pca_nba**. Repeat this process 1,000 times, saving the slope of each sample into an array. Finally, print \n", "the slope of **pca_nba**, as well as the 90% confidence interval values produced by the bootstrap simulation.\n", "The output might look something like this:\n", "\n", "*The actual slope = ?.???* \n", "*The 90% confidence interval is [?.???, ?.???]*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Place answer here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 4 - 2 Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine we want to predict the plus/minus of a player that has a Principal Component 1 value of 16. Perform a bootstrap\n", "simulation by taking 1,000 samples with replacement from **pca_nba**. For each sample, fit a linear regression model that uses PC1 to predict plusMinusPoints and use that model to predict the plus/minus for a player with PC1 = 16. Store the predicted value of each sample into an array.\n", "Finally, print the predicted value using **pca_nba**, as well as the 99% confidence interval values produced by the bootstrap simulation. \n", "The output might look something like this:\n", "\n", "*The predicted value = ?.???* \n", "*The 99% confidence interval is [?.???, ?.???]*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Place answer here." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }