{ "cells": [ { "cell_type": "markdown", "id": "aa71a4bf", "metadata": {}, "source": [ "# Homework 2 - Chapter 14" ] }, { "cell_type": "markdown", "id": "d8b5aac0", "metadata": {}, "source": [ "- Due Date: ?? no later than 11:59 p.m.\n", "- Partner Information: You must complete this assignment individually.\n", "- Submission Instructions: Upload your solution, entitled **YourFirstName-YourLastName-Homework2.ipynb** to the \n", "BrightSpace Homework 2 Dropbox.\n", "- Deadline Reminder: Once the submission deadline passes, BrightSpace will no longer accept your submission and you will no longer be able to earn credit. \n", "Thus, if you are not able to fully complete the assignment, submit whatever you have before the deadline so that partial credit can be earned." ] }, { "cell_type": "markdown", "id": "98f2e2cf", "metadata": {}, "source": [ "## Starting Code" ] }, { "cell_type": "code", "execution_count": 97, "id": "597dfa74", "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "import matplotlib.pyplot as plots\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "275f006d", "metadata": {}, "source": [ "Download the file [top_10_spotify_2025.csv]() into the same directory as this Jupyter notebook." ] }, { "cell_type": "code", "execution_count": 98, "id": "8b8cedc4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name artists daily_rank country danceability
What I Want (feat. Tate McRae) Morgan Wallen, Tate McRae 1 US 0.657
Just In Case Morgan Wallen 2 US 0.649
Ordinary Alex Warren 3 US 0.368
I Got Better Morgan Wallen 4 US 0.598
undressed sombr 5 US 0.642
back to friends sombr 6 US 0.436
luther (with sza) Kendrick Lamar, SZA 7 US 0.707
NOKIA Drake 8 US 0.537
I'm The Problem Morgan Wallen 9 US 0.549
BIRDS OF A FEATHER Billie Eilish 10 US 0.747
\n", "

... (2771 rows omitted)

" ], "text/plain": [ "name | artists | daily_rank | country | danceability\n", "What I Want (feat. Tate McRae) | Morgan Wallen, Tate McRae | 1 | US | 0.657\n", "Just In Case | Morgan Wallen | 2 | US | 0.649\n", "Ordinary | Alex Warren | 3 | US | 0.368\n", "I Got Better | Morgan Wallen | 4 | US | 0.598\n", "undressed | sombr | 5 | US | 0.642\n", "back to friends | sombr | 6 | US | 0.436\n", "luther (with sza) | Kendrick Lamar, SZA | 7 | US | 0.707\n", "NOKIA | Drake | 8 | US | 0.537\n", "I'm The Problem | Morgan Wallen | 9 | US | 0.549\n", "BIRDS OF A FEATHER | Billie Eilish | 10 | US | 0.747\n", "... (2771 rows omitted)" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spotify = Table().read_table('top_10_spotify_2025.csv')\n", "spotify" ] }, { "cell_type": "markdown", "id": "00af1fec", "metadata": {}, "source": [ "*Danceability* is a metric Spotify has developed that determines how easy it is to dance to a certain song. It is recorded on a scale of 0 - 1, with 0 being the least \"danceable\", and 1 being the most danceable. The selected dataset contains the top 10 songs in the United States and France everyday from the beginning of 2025 until the date the assignment was created.\n", "\n", "We will use the *danceability* metric to analyze how means change when sampling, and introduce one of the most important ideas in statistics: The **Central Limit Theorem**." ] }, { "cell_type": "markdown", "id": "afe37e2f", "metadata": {}, "source": [ "## Question 1a - 1 Point" ] }, { "cell_type": "markdown", "id": "071e47e7", "metadata": {}, "source": [ "One of the first steps in many statistical analyses is to visualize the data. This helps us get a basic idea on some of the measures we will use in an experiment. Using a Histogram, plot the *danceability* column of the **spotify** table. The histogram should have 20 bins ranging from 0 to 1. Plot a vertical red line at the mean of the data on the x-axis, and extend it from 0 to 3 on the y-axis. (Hint: check the [plots.vline()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.vlines.html) documentation)" ] }, { "cell_type": "code", "execution_count": 99, "id": "4143ac2c", "metadata": {}, "outputs": [], "source": [ "# Place Answer Here" ] }, { "cell_type": "code", "execution_count": 100, "id": "ffc72771", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plots the Histogram of the data with 20 bins, then plots the mean with a vertical line\n", "spotify.hist('danceability', bins=np.arange(0,1.1, .05))\n", "plots.vlines(np.mean(spotify['danceability']), 0, 3, colors='Red');" ] }, { "cell_type": "markdown", "id": "7c118587", "metadata": {}, "source": [ "## Question 1b - 1 Point" ] }, { "cell_type": "markdown", "id": "788d84aa", "metadata": {}, "source": [ "What direction is the histogram skewed? While code is not necessary for this question, you **may** use the following code block if desired. Answer in the space provided." ] }, { "cell_type": "code", "execution_count": 101, "id": "fdd6c02f", "metadata": {}, "outputs": [], "source": [ "# Place Code Here" ] }, { "cell_type": "markdown", "id": "7938b1cd", "metadata": {}, "source": [ "**Answer -**" ] }, { "cell_type": "code", "execution_count": 102, "id": "0f9b46b5", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# NOTE: This code block is optional. It should not be graded.\n", "\n", "# Plots the histogram with mean and median, which could be used to determine skew following the definition in the book\n", "spotify.hist('danceability', bins=np.arange(0,1.1, .05))\n", "plots.vlines(np.mean(spotify['danceability']), 0, 3, colors='Red', label='Mean')\n", "plots.vlines(np.median(spotify['danceability']), 0, 3, colors='Green', label='Median')\n", "plots.legend();" ] }, { "cell_type": "markdown", "id": "0eb842e4", "metadata": {}, "source": [ "**Sample Answer -** Left" ] }, { "cell_type": "markdown", "id": "117b4c8b", "metadata": {}, "source": [ "## Question 2a - 2 Points" ] }, { "cell_type": "markdown", "id": "17d0f4c7", "metadata": {}, "source": [ "The \"informal statement\" in section 14.2.4 of the book says \"In all numerical data sets, the bulk of the entries are within the range *average \n", " $\\pm$ a few SDs*\". Finding this proportion helps us view the spread of the data.\n", " \n", "Create a function that calculates the proportion of data that falls within **2 Standard Deviations** of the mean. Print out this proportion in the following format: *74.68% of the data is within 2 standard deviations of the mean.* (Note: the number will be different in your answer.)" ] }, { "cell_type": "code", "execution_count": 103, "id": "1b94c85e", "metadata": {}, "outputs": [], "source": [ "# Place Answer Here" ] }, { "cell_type": "code", "execution_count": 104, "id": "d994912a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "95.76% of the data is within 2 standard deviations of the mean.\n" ] } ], "source": [ "def prop_in_bounds(col):\n", " '''Calculates the proportion of the data within 2 SDs'''\n", " mean = np.mean(col)\n", " sd = np.std(col)\n", " # Takes the column length, necessary to calculate the proportion\n", " col_len = len(col)\n", "\n", " # Calculates the upper and lower bounds\n", " lower_bound = mean - (2 * sd)\n", " upper_bound = mean + (2 * sd)\n", "\n", " number = 0\n", " # Counts the number of datapoints within the bounds\n", " for i in col:\n", " if lower_bound <= i <= upper_bound:\n", " number += 1\n", "\n", " # Returns the proportion\n", " return number/col_len\n", "\n", "# Calculates the proportion and prints in the desired format\n", "chebychevs_2 = prop_in_bounds(spotify['danceability'])\n", "print(f'{round(chebychevs_2*100, 2)}% of the data is within 2 standard deviations of the mean.')" ] }, { "cell_type": "markdown", "id": "9f62b853", "metadata": {}, "source": [ "## Question 2b - 1 Point" ] }, { "cell_type": "markdown", "id": "518f6740", "metadata": {}, "source": [ "Since this distribution is not normal, we cannot use any known proportions for data that is similar to a bell curve. However, we can use **Chebychev's bounds**. Does the proportion from question 2a make sense when using Chebychev's bounds?" ] }, { "cell_type": "markdown", "id": "57d6c6fe", "metadata": {}, "source": [ "**Answer -**" ] }, { "cell_type": "markdown", "id": "9478609f", "metadata": {}, "source": [ "**Sample Answer -** Yes, because Chebychev's bounds provide a lower bound for the proportion. At least 75% of the data should fall within 2 Standard Deviations, and we have ~95% of the data within 2 Standard Deviations. " ] }, { "cell_type": "markdown", "id": "773dc313", "metadata": {}, "source": [ "## Question 3 - 1 Point" ] }, { "cell_type": "markdown", "id": "47d0eccb", "metadata": {}, "source": [ "Converting data to *Standard Units* is a helpful technique while performing data analysis. This technique is also known as **Z-Score Normalizing**. Add a column containing the danceability scores in standard unites called *norm_dance* to the **spotify** table. Then, display a histogram of the new column." ] }, { "cell_type": "code", "execution_count": 105, "id": "7cd3b59b", "metadata": {}, "outputs": [], "source": [ "# Place Answer Here" ] }, { "cell_type": "code", "execution_count": 106, "id": "9b096436", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def normalize(col):\n", " '''z-score normalizes the data'''\n", " return (col - np.mean(col))/np.std(col)\n", "\n", "# Adds the column to the table and displays the histogram\n", "spotify = spotify.with_column('norm_dance', normalize(spotify['danceability']))\n", "spotify.hist('norm_dance')" ] }, { "cell_type": "markdown", "id": "1f5560b5", "metadata": {}, "source": [ "## Question 4 - 1 Point" ] }, { "cell_type": "markdown", "id": "7fb6e9c7", "metadata": {}, "source": [ "Print the mean and the standard deviation of the *norm_dance* column of the **spotify** table in the following format: *The (mean/standard deviation) is: 1.7*. (Note: the number will change)" ] }, { "cell_type": "code", "execution_count": 107, "id": "0f0f5490", "metadata": {}, "outputs": [], "source": [ "# Place Answer Here" ] }, { "cell_type": "code", "execution_count": 108, "id": "6a66207b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mean is: -0.0\n", "The standard deviation is: 1.0\n" ] } ], "source": [ "print(f'The mean is: {round(np.mean(spotify['norm_dance']), 2)}')\n", "print(f'The standard deviation is: {round(np.std(spotify['norm_dance']), 2)}')" ] }, { "cell_type": "markdown", "id": "f7efb687", "metadata": {}, "source": [ "## Question 5a - 2 Points" ] }, { "cell_type": "markdown", "id": "5bf46700", "metadata": {}, "source": [ "Now that we have standardized data, we can attempt to understand the Central Limit Theorem. Create a function called **simulate** that takes a sample of variable size 500 times, takes the mean *norm_dance* of the sample, then adds it it to an array. Once the array has 500 means in it, return the array. \n", "\n", "Use this function to create two tables: one with a sample size of 5 called *means_5*, and one with a sample size of 250 called *means_250*. Each table should have 500 rows. Display a histogram of both tables." ] }, { "cell_type": "code", "execution_count": 109, "id": "5cf5dc1e", "metadata": {}, "outputs": [], "source": [ "# Place Answer Here" ] }, { "cell_type": "code", "execution_count": 110, "id": "0896f6ca", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def simulate(col, reps):\n", " '''Simulates 1000 means of a set sample size'''\n", " sample_means = make_array()\n", "\n", " # Gets 1000 means, adds to the array\n", " for _ in range(500):\n", " song_sample = spotify.sample(reps)\n", " sample_means = np.append(sample_means, np.mean(song_sample[col]))\n", "\n", " return sample_means\n", "\n", "# Creates two tables, one with a sample size of 5 and one with sample size of 100\n", "means_5 = Table().with_column('Sample Mean', simulate('norm_dance', 5))\n", "means_250 = Table().with_column('Sample Mean', simulate('norm_dance', 250))\n", "\n", "# Displays the histograms of the means\n", "means_5.hist('Sample Mean')\n", "means_250.hist('Sample Mean')" ] }, { "cell_type": "markdown", "id": "78e7c850", "metadata": {}, "source": [ "## Question 5b - 1 Point" ] }, { "cell_type": "markdown", "id": "5c389647", "metadata": {}, "source": [ "Print the means the **norm_dance** column of the *spotify* table, and the means of the *means_5* and *means_250* tables in the following format: *The mean of (Replace with table name) is: -4.5* (Note: The number will change). Round these values to two decimal places. What table has a mean closest to the *spotify* table? Does this make sense? Answer in the space provided." ] }, { "cell_type": "code", "execution_count": 111, "id": "29f1452e", "metadata": {}, "outputs": [], "source": [ "# Place Code Here" ] }, { "cell_type": "markdown", "id": "fe97b3ee", "metadata": {}, "source": [ "**Answer -**" ] }, { "cell_type": "code", "execution_count": 112, "id": "110b0697", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The mean of the spotify table is: -0.0\n", "The mean of the means_5 table is: -0.03\n", "The mean of the means_250 table is: -0.0\n" ] } ], "source": [ "# Prints the means of all the tables to two decimal places\n", "print(f'The mean of the spotify table is: {round(np.mean(spotify['norm_dance']), 2)}')\n", "print(f'The mean of the means_5 table is: {round(np.mean(means_5['Sample Mean']), 2)}')\n", "print(f'The mean of the means_250 table is: {round(np.mean(means_250['Sample Mean']), 2)}')" ] }, { "cell_type": "markdown", "id": "faaa2af2", "metadata": {}, "source": [ "**Sample Answer-** The table with a sample size of 250 has a mean that is closest to the spotify table. This makes sense as the Central Limit Theorem states that larger samples will be closer to the population mean." ] }, { "cell_type": "markdown", "id": "a0d68ab2", "metadata": {}, "source": [ "## Bonus Question" ] }, { "cell_type": "markdown", "id": "b7d0b4f1", "metadata": {}, "source": [ "A two-sample t-test is a way of comparing two groups, and seeing if their means are \"substantially\" different. This is may sound similar to A/B testing explored in chapter 12 of the text, but while A/B testing checks if two samples come from the same underlying distribution, t-tests examine whether one group's mean is different than the other. In our case, we can use this test to see if the *danceability* of one countries top songs is higher or lower than another country.\n", "\n", "T-tests have a set of assumptions that need to be met in order for them to work. These assumptions are the observations are **Independent**, the samples have **Equal Variance**, and the distributions of each sample are **Normal**." ] }, { "cell_type": "markdown", "id": "e55984e0", "metadata": {}, "source": [ "## Bonus Question (a) - 1 Point" ] }, { "cell_type": "markdown", "id": "29dfe6e9", "metadata": {}, "source": [ "The first step in performing a t-test is creating the two groups. Then, we must check if the groups have nearly **Equal Variance**. This is one of the assumptions that must be met to perform a t-test.\n", "\n", "Create two tables, one only containing rows from the *spotify* table where the country is **FR**, and one only containing rows from the *spotify* table where the country is **US**. Call these tables *spotify_fr* and *spotify_us*. Print the standard deviation of the **norm_dance** columns of these tables in the following format: *The variance of norm_dance for top songs in (country) is: 5.68* (Note: The number will change)" ] }, { "cell_type": "code", "execution_count": null, "id": "69a30f1b", "metadata": {}, "outputs": [], "source": [ "# Place Answer Here" ] }, { "cell_type": "code", "execution_count": 113, "id": "43a9aa36", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The variance of norm_dance for top songs in France is: 0.84\n", "The variance of norm_dance for top songs in America is: 1.06\n" ] } ], "source": [ "# Separates the table into two by country\n", "spotify_fr = spotify.where('country', 'FR')\n", "spotify_us = spotify.where('country', 'US')\n", "\n", "# Displays the variance of the new tables\n", "print(f'The variance of norm_dance for top songs in France is: {round(np.std(spotify_fr['norm_dance']), 2)}')\n", "print(f'The variance of norm_dance for top songs in America is: {round(np.std(spotify_us['norm_dance']), 2)}')" ] }, { "cell_type": "markdown", "id": "af217fdf", "metadata": {}, "source": [ "If the ratio of the (larger standard deviation)/(smaller standard deviation) is less than 2, we can claim there is **weak evidence** against the Equal Variance assumption and proceed in our tests. If there is weak evidence against an assumption, we can continue with the test.\n", "\n", "Regardless of the value of the ratio, we will continue as if there is weak evidence against the equal variance assumption." ] }, { "cell_type": "markdown", "id": "d1e580a4", "metadata": {}, "source": [ "### Other Assumptions" ] }, { "cell_type": "markdown", "id": "c3c8b9cd", "metadata": {}, "source": [ "**Independence of Observations -** We can say observations are not independent if knowledge about one observation allows to make an improved guess about another observation. In this case, knowing one song on the top 10 list of a country probably wouldn't help in determining the rest of the list, so we can say there is **weak evidence** against the assumption of independence.\n", "\n", "**Normality -** As we can see on the figures below, neither country follows a normal distribution. However, this is OK! Using the **Central Limit Theorem**, we can still use t-tests on samples that have non-normal distributions, as long as they have a large sample size. Each sample has over 1300 observations, so there are more than enough for us to use a t-test. Therefore, we have **weak evidence** against the assumption of Normality!" ] }, { "cell_type": "code", "execution_count": 121, "id": "6e0dc2af", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Prints the histograms of each country\n", "spotify_fr.hist('norm_dance')\n", "spotify_us.hist('norm_dance')" ] }, { "cell_type": "markdown", "id": "50b0c34d", "metadata": {}, "source": [ "## Bonus Question (b) - 2 Points" ] }, { "cell_type": "markdown", "id": "910e5008", "metadata": {}, "source": [ "Now it is time to perform the actual t-test since all assumptions were met. To do this, we need to decide on a hypothesis we need to test. We will use the following hypoteses:\n", "\n", "$H_0$: The mean danceability of songs popular in France is the same as the mean danceability of songs popular in America\n", "\n", "$H_a$: The mean danceability of songs popular in France is higher than the mean danceability of songs popular in America\n", "\n", "While the math behind a t-test is simple, it is beyond the scope of this class. (Here is a resource if you are interested: [Statology](https://www.statology.org/two-sample-t-test/)) Fill in the *None* values of the following code block with the correct value. The *a_value* and *b_value* should be columns of the two tables created in part a of the bonus question. *equal_variance* Should be a True/False boolean. (Note: Order matters for the *a_value* and *b_value*!)" ] }, { "cell_type": "code", "execution_count": null, "id": "303743e7", "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "unsupported operand type(s) for /: 'NoneType' and 'int'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[92], line 8\u001b[0m\n\u001b[0;32m 5\u001b[0m equal_variance \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;66;03m# This value should be True or False\u001b[39;00m\n\u001b[0;32m 7\u001b[0m \u001b[38;5;66;03m# Performs the t-test and displays the results\u001b[39;00m\n\u001b[1;32m----> 8\u001b[0m t_test \u001b[38;5;241m=\u001b[39m \u001b[43mstats\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mttest_ind\u001b[49m\u001b[43m(\u001b[49m\u001b[43ma\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43ma_value\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mb\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mb_value\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mequal_var\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mequal_variance\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 10\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mThe results of the t-test are: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mt_test\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m)\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\scipy\\_lib\\deprecation.py:234\u001b[0m, in \u001b[0;36m_deprecate_positional_args.._inner_deprecate_positional_args..inner_f\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m 232\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m extra_args \u001b[38;5;241m<\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[0;32m 233\u001b[0m warn_deprecated_args(kwargs)\n\u001b[1;32m--> 234\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 236\u001b[0m \u001b[38;5;66;03m# extra_args > 0\u001b[39;00m\n\u001b[0;32m 237\u001b[0m kwonly_extra_args \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m(kwonly_args[:extra_args]) \u001b[38;5;241m-\u001b[39m deprecated_args\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\scipy\\stats\\_axis_nan_policy.py:586\u001b[0m, in \u001b[0;36m_axis_nan_policy_factory..axis_nan_policy_decorator..axis_nan_policy_wrapper\u001b[1;34m(***failed resolving arguments***)\u001b[0m\n\u001b[0;32m 583\u001b[0m res \u001b[38;5;241m=\u001b[39m _add_reduced_axes(res, reduced_axes, keepdims)\n\u001b[0;32m 584\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m tuple_to_result(\u001b[38;5;241m*\u001b[39mres)\n\u001b[1;32m--> 586\u001b[0m res \u001b[38;5;241m=\u001b[39m \u001b[43mhypotest_fun_out\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43msamples\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwds\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 587\u001b[0m res \u001b[38;5;241m=\u001b[39m result_to_tuple(res, n_out)\n\u001b[0;32m 588\u001b[0m res \u001b[38;5;241m=\u001b[39m _add_reduced_axes(res, reduced_axes, keepdims)\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\scipy\\stats\\_stats_py.py:6678\u001b[0m, in \u001b[0;36mttest_ind\u001b[1;34m(a, b, axis, equal_var, nan_policy, permutations, random_state, alternative, trim, method)\u001b[0m\n\u001b[0;32m 6676\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m trim \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[0;32m 6677\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m np\u001b[38;5;241m.\u001b[39merrstate(divide\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m'\u001b[39m, invalid\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m'\u001b[39m):\n\u001b[1;32m-> 6678\u001b[0m v1 \u001b[38;5;241m=\u001b[39m \u001b[43m_var\u001b[49m\u001b[43m(\u001b[49m\u001b[43ma\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mddof\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mxp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mxp\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 6679\u001b[0m v2 \u001b[38;5;241m=\u001b[39m _var(b, axis, ddof\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m, xp\u001b[38;5;241m=\u001b[39mxp)\n\u001b[0;32m 6681\u001b[0m m1 \u001b[38;5;241m=\u001b[39m xp\u001b[38;5;241m.\u001b[39mmean(a, axis\u001b[38;5;241m=\u001b[39maxis)\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\scipy\\stats\\_stats_py.py:1210\u001b[0m, in \u001b[0;36m_var\u001b[1;34m(x, axis, ddof, mean, xp)\u001b[0m\n\u001b[0;32m 1207\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m_var\u001b[39m(x, axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m, ddof\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m, mean\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m, xp\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m):\n\u001b[0;32m 1208\u001b[0m \u001b[38;5;66;03m# Calculate variance of sample, warning if precision is lost\u001b[39;00m\n\u001b[0;32m 1209\u001b[0m xp \u001b[38;5;241m=\u001b[39m array_namespace(x) \u001b[38;5;28;01mif\u001b[39;00m xp \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m xp\n\u001b[1;32m-> 1210\u001b[0m var \u001b[38;5;241m=\u001b[39m \u001b[43m_moment\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmean\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmean\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mxp\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mxp\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1211\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m ddof \u001b[38;5;241m!=\u001b[39m \u001b[38;5;241m0\u001b[39m:\n\u001b[0;32m 1212\u001b[0m n \u001b[38;5;241m=\u001b[39m x\u001b[38;5;241m.\u001b[39mshape[axis] \u001b[38;5;28;01mif\u001b[39;00m axis \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m xp_size(x)\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\scipy\\stats\\_stats_py.py:1189\u001b[0m, in \u001b[0;36m_moment\u001b[1;34m(a, order, axis, mean, xp)\u001b[0m\n\u001b[0;32m 1186\u001b[0m n_list\u001b[38;5;241m.\u001b[39mappend(current_n)\n\u001b[0;32m 1188\u001b[0m \u001b[38;5;66;03m# Starting point for exponentiation by squares\u001b[39;00m\n\u001b[1;32m-> 1189\u001b[0m mean \u001b[38;5;241m=\u001b[39m (\u001b[43mxp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmean\u001b[49m\u001b[43m(\u001b[49m\u001b[43ma\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkeepdims\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m \u001b[38;5;28;01mif\u001b[39;00m mean \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 1190\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m xp\u001b[38;5;241m.\u001b[39masarray(mean, dtype\u001b[38;5;241m=\u001b[39mdtype))\n\u001b[0;32m 1191\u001b[0m mean \u001b[38;5;241m=\u001b[39m mean[()] \u001b[38;5;28;01mif\u001b[39;00m mean\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m mean\n\u001b[0;32m 1192\u001b[0m a_zero_mean \u001b[38;5;241m=\u001b[39m _demean(a, mean, axis, xp\u001b[38;5;241m=\u001b[39mxp)\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\numpy\\_core\\fromnumeric.py:3860\u001b[0m, in \u001b[0;36mmean\u001b[1;34m(a, axis, dtype, out, keepdims, where)\u001b[0m\n\u001b[0;32m 3857\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 3858\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m mean(axis\u001b[38;5;241m=\u001b[39maxis, dtype\u001b[38;5;241m=\u001b[39mdtype, out\u001b[38;5;241m=\u001b[39mout, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m-> 3860\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_methods\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_mean\u001b[49m\u001b[43m(\u001b[49m\u001b[43ma\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 3861\u001b[0m \u001b[43m \u001b[49m\u001b[43mout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mout\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[1;32mc:\\Users\\allin\\AppData\\Local\\Programs\\Python\\Python313\\Lib\\site-packages\\numpy\\_core\\_methods.py:137\u001b[0m, in \u001b[0;36m_mean\u001b[1;34m(a, axis, dtype, out, keepdims, where)\u001b[0m\n\u001b[0;32m 135\u001b[0m ret \u001b[38;5;241m=\u001b[39m umr_sum(arr, axis, dtype, out, keepdims, where\u001b[38;5;241m=\u001b[39mwhere)\n\u001b[0;32m 136\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(ret, mu\u001b[38;5;241m.\u001b[39mndarray):\n\u001b[1;32m--> 137\u001b[0m ret \u001b[38;5;241m=\u001b[39m \u001b[43mum\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtrue_divide\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 138\u001b[0m \u001b[43m \u001b[49m\u001b[43mret\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mrcount\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mout\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mret\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcasting\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43munsafe\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43msubok\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[0;32m 139\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_float16_result \u001b[38;5;129;01mand\u001b[39;00m out \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 140\u001b[0m ret \u001b[38;5;241m=\u001b[39m arr\u001b[38;5;241m.\u001b[39mdtype\u001b[38;5;241m.\u001b[39mtype(ret)\n", "\u001b[1;31mTypeError\u001b[0m: unsupported operand type(s) for /: 'NoneType' and 'int'" ] } ], "source": [ "#TODO: Fill in the None values in this block!\n", "\n", "import scipy.stats as stats\n", "\n", "a_value = None # This value should contain a column of a table\n", "b_value = None # This value should contain a column of a table\n", "equal_variance = None # This value should be True or False\n", "\n", "# Performs the t-test and displays the results\n", "t_test = stats.ttest_ind(a=a_value, b=b_value, equal_var=equal_variance)\n", "\n", "print(f'The results of the t-test are: {t_test}')" ] }, { "cell_type": "code", "execution_count": 94, "id": "f062baa3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The results of the t-test are: TtestResult(statistic=15.827670381970748, pvalue=4.3120559806621321e-54, df=2779.0)\n" ] } ], "source": [ "import scipy.stats as stats\n", "\n", "# Assigns the correct values to the variables (NOTE: Order matters for a and b values)\n", "a_value = spotify_fr['norm_dance']\n", "b_value = spotify_us['norm_dance']\n", "equal_variance = True\n", "\n", "# Performs the t-test and displays the results\n", "t_test = stats.ttest_ind(a=a_value, b=b_value, equal_var=equal_variance)\n", "print(f'The results of the t-test are: {t_test}')" ] }, { "cell_type": "markdown", "id": "d4479d06", "metadata": {}, "source": [ "## Bonus Question (c) - 2 Points" ] }, { "cell_type": "markdown", "id": "4710b935", "metadata": {}, "source": [ "Now that we have completed the t-test, we can draw a conclusion about our hypotheses. Use the following format, replacing the values with the ones from the previous questions. (Hint: If we have a p-value less than 0.05, we can say there is **strong evidence** for the alternative hypothesis!)\n", "\n", "Format: *There is (amount of evidence) that the true mean danceability of popular French songs is higher than the true mean danceability of popular American songs. (t=(test statistic), p-value=(p value), df=(degrees of freedom))*" ] }, { "cell_type": "markdown", "id": "dd5f2b22", "metadata": {}, "source": [ "**Answer -**" ] }, { "cell_type": "markdown", "id": "6d0cc0d9", "metadata": {}, "source": [ "**Sample Answer-** There is strong evidence that the true mean danceability of popular French songs is higher than the true mean danceability of popular American songs. (t=15.827670381970748, p-value=4.3120559806621321 $e^{-54}$, df=2779)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }