{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Science Self Assessment\n",
"\n",
"This is a short self assessment to help you decide if you have the pre-requisite knowledge for Data Science 1.\n",
"\n",
"This is not a timed test, there are no passing or failing scores. Feel free to use the internet (especially existing StackOverflow answers and package documentation) to help you get through this assessment. If you can answer all of these questions, you will have the appropriate pre-requisite knowledge to be successful in the course."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Programming Questions\n",
"\n",
"The class requires basic programming skills in Python and some familiarity with a number of Python libraries. Among these are: \n",
"\n",
"- Basic Python syntax: Data Types, Lists, Dictionaries, Operators (http://openbookproject.net/thinkcs/python/english3e/index.html)\n",
"- Flow control: Loops, conditional statements, Function (http://openbookproject.net/thinkcs/python/english3e/index.html)\n",
"- Numpy: Arrays, indexing, slicing, vectorization of operations (http://openbookproject.net/thinkcs/python/english3e/index.html)\n",
"- Pandas: Data Frames, FileIO, Selection, Statistics, Grouping, Tables (https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) \n",
"- Plotting with Matplotlib and Seaborn (https://matplotlib.org/) \n",
"- Scipy: Probability functions (`scipy.stats`), Optimization (`optim`)\n",
"\n",
"If you need a python installation, we recommend [anaconda](https://www.anaconda.com/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 1\n",
"\n",
"Below is some sample data about students selling different types of fruit. `sales` is a list of lists. The lists in `sales` contain the name of the student who sold the fruit and the type of fruit which was sold. So if Bernard sold an apple, then `([\"Bernard\",\"Apple\"] in sales)==True`. Turn the data into a pandas dataframe and then use pandas to answer the following:\n",
"\n",
"* How many apples did Anna sell?\n",
"\n",
"* Who sold more Watermelons: Bernard or Daisy\n",
"\n",
"* Who sold the most fruit?\n",
"\n",
"* Which fruit was sold the most?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"np.random.seed(0)\n",
"N = 1000\n",
"students = ['Anna','Bernard','Charlie','Daisy']\n",
"fruits = ['Apple','Peach','Watermelon']\n",
"\n",
"sales = [ [np.random.choice(students), np.random.choice(fruits)] for j in range(N)]\n",
"\n",
"\n",
"#Solution below\n",
"import pandas as pd\n",
"from IPython.display import display\n",
"\n",
"df = pd.DataFrame(sales, columns=['students','fruit'])\n",
"\n",
"ctabed = df.groupby(['students','fruit']).size().unstack()\n",
"\n",
"display(ctabed)\n",
"\n",
"display(ctabed.sum(axis = 1))\n",
"\n",
"display(ctabed.sum(axis = 0))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 2\n",
"\n",
"Shown below is data relating to the position of a car in meters. The data was recorded at the indicated times below (so at time = 1, the car was 1 meter from the starting position). Load the data as a numpy array. Calculate the average speed at which the car was traveling between time points. Do this with a loop and again using array slicing.\n",
"\n",
"Hint: Speed = (Distance Travelled)/(Time To Travel Distance)\n",
"\n",
"\n",
"speeds: [0, 1, 1.2, 1.8, 2.0, 1.7, 1.5, 1.9, 2.1, 2.3]\n",
"\n",
"times: [0, 1, 1.5, 1.9, 2.3, 2.7, 3.8, 4.8, 5.4, 7.0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"position = np.array([0, 1, 1.2, 1.8, 2.0, 1.7, 1.5, 1.9, 2.1, 2.3])\n",
"times = np.array([0, 1, 1.5, 1.9, 2.3, 2.7, 3.8, 4.8, 5.4, 7.0])\n",
"\n",
"#With slicing\n",
"speed_vectorized = (position[1:] - position[:-1])/(times[1:] - times[:-1])\n",
"\n",
"#Without slicing\n",
"speed = np.zeros(position.size - 1)\n",
"for i in range(speed.size):\n",
" speed[i] = (position[i+1] - position[i])/(times[i+1] - times[i])\n",
" \n",
"#Both are equivalent\n",
"np.isclose(speed,speed_vectorized, rtol = 1e-8).all()\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a random 100 by 100 2d array of integers using `numpy.random.randint` ranging from 1 to 100. To ensure your answer is the same as ours, set the random seed to `19920908`. \n",
"\n",
"Which row has the largest mean?\n",
"\n",
"Which column has the smallest sum?\n",
"\n",
"Which is the first column (from left to right) to have sum exceding 600?\n",
"\n",
"Answer these questions without the use of a loop.\n",
"\n",
"Hint: The `argmin`, `argmax`, and `argwhere` functions may be useful.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(19920908)\n",
"\n",
"#The plus 1 here is tricky.\n",
"X = np.random.randint(low = 1, high = 10+1, size = (100,100))\n",
"\n",
"#Which row has largest mean?\n",
"\n",
"print(X.mean(axis = 1).argmax())\n",
"\n",
"#Which column has smallest sum?\n",
"\n",
"print(X.sum(axis = 0).argmin())\n",
"\n",
"#Which is the first column (from left to right) to have sum exceeding 600?\n",
"\n",
"print(np.argwhere((X.sum(axis = 0)>600)).min())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 4\n",
"\n",
"Newton's method is a numerical method finding the roots of a function. Newton's method is\n",
"\n",
"$$ x_{n+1} = x_{n} - \\dfrac{f(x_n)}{f'(x_n)} $$\n",
"\n",
"Below, I've written a function to try to use Newton's method to find the two roots of the function $f(x) = \\exp(-x)\\ln(x+1) - 0.25$.\n",
"\n",
"My function should:\n",
"\n",
"* Terminate when $\\vert f(x_n) \\vert < 1\\times10^{-8}$ or when the number of iterations exceeds 1000.\n",
"\n",
"* Take as its first argument the starting point for the method (i.e $x_0$)\n",
"\n",
"* Take as its second argument the function $f$\n",
"\n",
"* Take as its third argument the function $f'$\n",
"\n",
"My code, as it stands, does not return the right answer. Look through the code and debug the function so that it returns answers similar to `scipy.optimize.newton`. Please don't completely rewrite the code (I spent a long time on it and want to learn what I messed up!).\n",
"\n",
"\n",
"Don't worry about `f` and `fprime`. I've ensured those are correct.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"f = lambda x: np.exp(-x)*np.log(x+1) - 0.25\n",
"fprime = lambda x: -np.exp(-x)*np.log(x+1) + np.exp(-x)/(x+1)\n",
"\n",
"def broken_newtons_method(x0,f, fprime, tol = 1e-8, maxiter = 1000):\n",
" \n",
" res = float('inf')\n",
" iters = 0\n",
" x_n = x0\n",
" \n",
" while (restol) and (iters...If you see 15 games a year, there is a 40% chance that a .275 hitter will have more hits than a .300 hitter.\n",
"\n",
"Bill refers to players by their *batting average* (i.e. .275 means the hitter will hit the ball 275 times for every 1000 times they come at bat). The actual probability is quite smaller than that. Bill wrote this in the late 1970s without the ubiquity of computers to perform the simulations we can. It is quite plausible that Bill used a Normal approximation to arrive at this conclusion.\n",
"\n",
"Assuming that every batter appears 3 times per game for 15 games (for a total of 45 at bats), use a Normal approximation to estimate the probability that a .275 batter hits more hits than a .300 batter. Assume the batters are independent. You can use python to evaluate any complicated functions, but do not estimate the probability via simulation.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Solution\n",
"\n",
"Let $A \\sim \\mbox{Binom}(0.275,45)$ and $B \\sim \\mbox{Binom}(0.300,45)$. \n",
"\n",
"We are looking for $p(A>B)$ or alternatively $p(00$ is $1- \\mathbf{\\Phi}(0) \\approx 0.4$\n",
"\n",
"Where $\\mathbf{\\Phi}$ is the CDF for our normal approximation.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import norm\n",
"\n",
"\n",
"norm(loc = 1.12, scale = np.sqrt(18.42)).cdf(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 3\n",
"\n",
"A diagnostic test has a 99% chance of correctly labeling a person as sick if they are truly sick. The probability that the test labels someone as sick, regardless of disease status is 50%. Approximately 1% of the population has the disease. \n",
"\n",
"a) what is the joint probability of having the disease and a positive test? \n",
"\n",
"b) what is the marginal probability that a test comes back positive? \n",
"\n",
"c) what is the conditional probability that a person has the disease if their test comes back positive?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Solution\n",
"\n",
"a) The joint probability of two events happening is $p(A,B) = p(A)p(B|A)$\n",
"\n",
"$$ p(D+ , T+) = p(D+) p(T+ \\vert D+) = 0.01 \\times 0.99 = 0.0099 $$\n",
"\n",
"b) The marginal probability is \n",
"\n",
"$$ p(T+) = p(D+)p(T+|D+) + p(D-) p(T+|D-) = 0.5049$$\n",
"\n",
"c) The conditional probability of this even can be obtained by Bayes Rule: \n",
"\n",
"$$ p(D+ \\vert T+) = \\dfrac{p(T+ \\vert D+) p(D+) }{p(T+)} = \\dfrac{0.0099}{0.5049} = 0.0196$$\n",
"\n",
"The probability of a positive disease state after a positive test is $~1.96\\%$ "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 4\n",
"\n",
"Why might someone want to know the median rather than the mean of their data?\n",
"\n",
"# Solution\n",
"\n",
"The median is far less sensitive to outliers than the mean. If the data have many outliers, then the mean might not be a good measure of central tendency."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 5\n",
"\n",
"You obtain a dataset with $n$ rows and $n$ columns (the same number of rows and columns). Each column houses numeric data (no categories, just numbers). You're asked to perform a linear regression this data (the outcome is in a different file. It is not one of the $n$ columns). Assume that the data matrix is full rank.\n",
"\n",
"What will the $R^2$ of this regression be?\n",
"\n",
"# Solution\n",
"\n",
"$R^2$ will be one since the problem is perfectly determined."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Linear Algebra Questions\n",
"\n",
"For the class we require some basic linear algebra\n",
"- Vectors, matrices, inner products, outer products, matrix multiplication\n",
"- Eigenvectors, eigenvalues, rank \n",
"- Matrix inversion \n",
"- Norms \n",
"\n",
"Gilbert Strang's book (http://math.mit.edu/~gs/linearalgebra/) might be a good refresher should you need it. Here (http://vmls-book.stanford.edu/vmls.pdf) is another book which may cover the topics you need, though we have not verified its quality. If you have taken MATH 1600 and/or AMATH 2811, that should be enough."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 1\n",
"\n",
"If $A$ $n \\times n$ is a matrix, and $A$ has full rank, is $A$ invertible?\n",
"\n",
"Answer: Yes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 2\n",
"\n",
"If a matrix, $A$, is positive definite, which of the following is false:\n",
"\n",
"A) $\\mathbf{x}^T A \\mathbf{x} >0 $ for every vector which is not 0\n",
"\n",
"B) Every element of A is positive\n",
"\n",
"C) The Eigenvalues of A are positive\n",
"\n",
"D) A is symmetric\n",
"\n",
"Answer: B"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 3\n",
"\n",
"Let $x$ and $y$ be vectors such that $\\vert x \\vert = 3$ and $\\vert y \\vert = 4$. Use the triangle inequality to put an upper bound on the length of $\\vert x+y \\vert$.\n",
"\n",
"Answer: $\\vert x+y \\vert \\leq \\vert x \\vert + \\vert y \\vert =7$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Question 4\n",
"\n",
"Let $A$ be a matrix, and let $\\mathbf{x},\\mathbf{y}$ be vectors. If $A\\mathbf{x} = [4,3,2]^T$ and $A\\mathbf{y} = [-1,2,0]^T$ what is $A(2\\mathbf{x} - \\mathbf{y})$?\n",
"\n",
"Answer: $A(2\\mathbf{x} - \\mathbf{y}) = 2A\\mathbf{x} - A\\mathbf{y} = [8,6,4]^T - [-1,2,0]^T = [9,8,4]^T$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}