Question 1: Python Skills [20 marks]
Given the Python list data=[2, 4, 6, 8, 10, 12, 14], write the answer that will result from
evaluating each of the following expressions:
a) len(data)
b) sum(data)
c) data[2]
d) data [-2]
e) data [2:6]
f) if someone wanted to use Python to analyse their customer data and find the largest
5 clusters of customers with similar purchasing behaviour, what Python library would
you recommend that they use? (pandas, or matplotlib, or sklearn, or csv)?
g) write a Python function called ‘clean’ that takes a string as an input, removes all
commas from that string and then converts it to an integer and returns it. For
example, clean(“4,567,000”) should return the integer 4567000.
Question 2: Pandas Proficiency [20 marks]
Below is the first few rows of a dataset of house prices â there are thousands of rows of data
in the full dataset.
a) Identify any issues in the above data description and data values? Explain how you
would correct each of the issues. [5 marks]
Now assume that this table has already been read into a Pandas DataFrame called houses,
and that all the data issues you have identified have been fixed, so that all the columns are
id bathr24rooms bed $$ rooms finished 44 sqmeter lastsolddate lastsoldprice neighborhood totalrooms year && built
1 2 2 1043 2/17/2016 $ 1,300,000.00 South of Market 4 2007
2 1 1 903 2/17/2016 $ 750,000.00 South of Market 3 2004
3 4 3 1425 2/17/2016 $ 1,495,000.00 Potrero Hill N/A 2003
4 3 3 2231 2/17/2016 $ 2,700,000.00 Potrero Hill 10 1927
5 3 3 1300 2/17/2016 $ 1,530,000.00 Bernal Heights 4 1900
6 1 2 1250 2/17/2016 $ 460,000.00 Crocker Amazon 5 1924
7 1 3 ###1032 2/17/2016 $ 532,000.00 Oceanview No 1939
8 1 2 1200 2/17/2016 $ 1,050,000.00 Mission Terrace 5 1924
9 3.5 4 2700 2/17/2016 $ 3,500,000.00 Noe Valley 9 1912
10 2 3 2016 2/17/2016 $ 1,500,000.00 Hayes Valley 7 1890
11 1 3 ????1798 2/17/2016 $ 848,000.00 Portola Yes 1953
12 1 1 761 2/17/2016 $ 1,000,000.00 South of Market 4 2008
13 1 1 780 8/12/2015 $ 863,000.00 Eureka Valley 4 1981
14 5 5 5786 2/16/2016 $ 4,888,000.00 Lake 12 1926
15 2 2 1688 2/16/2016 $ 1,000,000.00 Inner Sunset 6 1927
16 3 4 1619 2/16/2016 $ 210,000.00 Sunnyside 7 1966
17 1 0 ***398 2/12/2016 $ 525,000.00 Van Ness – Civic Center 4 2008
18 4.5 4 2615 2/12/2016 $ 2,300,000.00 Mission 9 1906
19 2 2 1252 2/12/2016 $ 1,450,000.00 Nob Hill 4 2002
20 2 3 1444 2/12/2016 $ 2,500,000.00 Nob Hill 6 2009
21 2 3 1441 8/6/2013 $ 630,000.00 Oceanview 5 1955
numeric, except for neighborhood which is a string column and lastsolddate which is a
Python date column.
b) Write a Pandas expression that returns the average house price.
c) Write a Pandas expression that returns the median house price.
d) Write a Pandas expression that will return the average price of houses in the “Nob
Hill” neighbourhood.
e) We want to investigate the relationship between the size of houses and their prices.
Draw a scatter graph that plots the house price (“lastsoldprice”) as the Y axis,
against the house size (“finished 44 sqmeter”) as the X axis. Show the scale of
both axes clearly, and draw just the first SIX houses into your graph.
f) Write some Python code that will draw this scatter graph for ALL the houses.
g) Add a new column called “price/room” to the houses table that is the ‘price per
room’. For each house, this is the price of the house (“lastsoldprice”) divided by the
number of rooms in the house (“totalrooms”).
Question 3: Machine Learning Concepts [20 marks]
There are many different types of machine learning algorithms.
a) Define the difference between regression and classification machine learning.
b) Give two examples of machine learning algorithms that are suitable for
regression problems.
c) Give two examples of machine learning algorithms that are suitable for
classification problems.
d) Give a business example where a regression learning approach would be
appropriate.
e) Give a business example where a classification learning approach would be
appropriate.
Question 4: Machine Learning Process [20 marks]
This question will use the same house-price dataset as Question 2, cleaned and loaded into
a Pandas table called houses.
Your manager asks you to use machine learning (and Python) to build a model that will
predict the expected sale price for a house with a given number of bathrooms, total size,
neighbourhood, and construction year, etc. That is, use all the available data to build a
model that predicts house sale prices.
a) Explain the typical process you would follow to use Python to build a model to
predict the house prices. Give a number and a title for each of the steps that you
would take, and briefly explain each step.
b) Sketch out some example Python code that you would use to implement the above
steps using the LinearRegression learning algorithm (from the sklearn.linear_model
library). Use Python comments to break your code into the steps you discussed
above, with the number and title of each step.
Question 5: Evaluation of Models [20 marks]
Socks4You is an online company that sells socks by subscription ($10/month, or
$120/year). Each month, they send one pair of high-quality designer socks to each
customer. They want to analyse their customer base, and their ‘churn rates’ (customers
who decide to stop subscribing to their service).
The following confusion matrix shows the results of applying a Decision Tree machine
learning algorithm to 1000 historical examples of customer churn from the previous
year. The columns show whether the customer did really leave (‘Churn=Yes’) or stay
(‘Churn=No’). The rows show the prediction output from the learned Decision Tree
model.
Churn=Yes Churn=No
Model predicted Yes 500 200
Model predicted No 100 200
Calculate the values of the following evaluation metrics for this model (since you do not
have a calculator, you can write them as a fraction) [2 marks each]:
a) number of true positives?
b) number of false positives?
c) accuracy?
d) precision?
e) recall?
Socks4You is considering using this model as the basis a new marketing campaign to
better retain their existing customers. They will send special offers to all the people that
the model predicts Yes (that is, the customers that are in danger of ‘churning’ away from
Socks4You). The cost of these discounts will average $10 per customer, but it is
expected that it will halve the churn rate, which will save on average half of the annual
subscription of each customer who is persuaded not to churn. The following costbenefit matrix summarises the annual costs and expected benefits of this campaign for
each group of customers.
f) Calculate the expected income after this marketing campaign? Show your
working. [4 marks]
Churn=Yes Churn=No
Model predicted Yes ¾ * $120 – $10 = $80 $120 – $10 = $110
Model predicted No $120 / 2 = $60 $120
The cost-benefit matrix for the next year without the marketing campaign is:
Churn Not Churn
Model predicted Yes $120 / 2 = $60 $120
Model predicted No $120 / 2 = $60 $120
g) Calculate the expected annual income WITHOUT the marketing campaign. [4
marks]
h) Would you recommend that Socks4You goes ahead with the marketing
campaign? Explain your reason. [2 marks
Sample Solution