Mathematics, Computing and Technology Introducing statistics Academic Essay

Mathematics, Computing and Technology Introducing statistics

The end-of-module assessment (EMA) consists of two parts: this written EMA and the iCME (called ‘iCME 81’).
Before you begin work on this written EMA, please read the instructions on the module website for preparing and submitting the EMA. Note in particular:
• you should send this part of the EMA to the address provided by the Assessment Handling Operations Office at The Open University (not your tutor) or submit it electronically using the University’s online TMA/EMA service
• your tutor is not permitted to grant an extension for your EMA.
If you submit the EMA electronically, then you must submit it in PDF format.
This part of the EMA is marked out of 70. The marks allocated to each part of each question are indicated in brackets in the margin.
The Minitab files that you require for this part of the EMA should be downloaded from the module website.
Unless otherwise specified in the question, it is up to you whether you use Minitab to arrive at your answer.
Remember that this is just one part of the EMA, and you must submit both parts of the EMA by the respective cut-off dates.

Copyright !c 2016 The Open University WEB 04408 3
6.1
M140 Written EMA
This assignment covers the whole module.
In this written EMA you will analyse some data from a study of certain snail populations in Europe.
The snails were all from the same species, Cepaea nemoralis (brown-lipped snail). These snails are genetically quite diverse, and that shows up in their shells. An individual snail’s shell can be one of three different colours (yellow, pink, brown) and can have different numbers of dark bands running round it (usually none, one or five). Researchers gathered samples of snails from a number of different locations, and for each location (which was identified by an ID number) counted how many were of each of the colours and banding patterns.
Question 1 – 3 marks
Data on a sample of locations in Germany and Switzerland are given in the file snails.mtw. In this file there are the following variables. • id: a number identifying the location where the snails were found.
• lat: the (geographical) latitude of the location, in degrees north (so the larger the number, the further north the location is, and the smaller the number, the further south it is).
• long: the (geographical) longitude of the location, in degrees east (so the larger the number, the further east the location is, and the smaller the number, the further west it is).
• country: ‘1’ – the location is in Germany, ‘2’ – the location is in Switzerland.
• habitat: the type of vegetation at the location, coded as ‘1’ – grassland, ‘2’ – hedgerow or tall herbaceous plants, ‘3’ – woodland or scrub.
• total: the total number of Cepaea nemoralis snails counted at the location.
• pcty: the percentage of the snails that had yellow shells.
• pctnb: the percentage of the snails that had no dark bands on their shells.
(a) Give the id of the location that was the furthest west. That is, find
the location for which the longitude was smallest. [1]
(b) What was the percentage of yellow snails at the location that you identified in part (a)? Was it also the location that had the highest
percentage of yellow snails? [2] Question 2 5 marks
Using the data given in snails.mtw, investigate the distribution of the percentage of snails with no dark bands, by doing the following.
(a) Using Minitab, produce a stemplot of the percentage of snails with no dark bands. Include this stemplot in your answer. (In doing this, you
should leave the Increment field blank and the Trim outliers
option unselected.) [1]
(b) Use your stemplot to describe the shape of the distribution of the percentage of snails with no dark bands in this sample. Justify your
answers. [4]
Question 3 – 15 marks
One theory about the evolution of these snails predicted that the further east one goes within Germany and Switzerland, the lower will be the proportion of snails with yellow shells. On the basis of the data in snails.mtw, a researcher wishes to construct a model to predict the percentage of yellow snails for a location on the basis of its longitude.
(a) Briefly explain why, in this context, the researcher should treat the longitude of the location as an explanatory variable, and the
percentage of yellow snails as the response variable. [1]
(b) Obtain a scatterplot of the percentage of yellow snails against the longitude, and include it in your answer. Interpret this scatterplot,
giving reasons for your interpretation. [9]
(c) What is the equation of the least squares regression line fitted to these
data? [1]
(d) Using the line fitted in part (c), predict the percentage of snails with yellow shells in a location with longitude 14.5 degrees east. Also state the 95% confidence interval for the mean percentage of yellow snails
for locations at longitude 14.5 degrees east. Give both of your answers
rounded to one decimal place. [2]
(e) Ljubljana, which is in the country of Slovenia, lies at longitude 14.5 degrees east. Would your calculations in part (d) be of use in
predicting the percentage of yellow snails at a location near
Ljubljana? Briefly explain why or why not. [2]

Question 4 14 marks
For the data in snails.mtw, a contingency table of the countries of the locations and the type of habitat is given below.
Country Grassland Hedgerow or tall Woodland or Total
herbaceous plants scrub
Germany 13 22 1 36
Switzerland 3 13 1 17
Total 16 35 2 53
(a) Explain why this table is correctly described as a contingency table. [3]
(b) What is the probability that a randomly selected location from this sample was from a grassland habitat? Give your answer to three
decimal places. [1]
(c) If it is known that a randomly selected location from this sample is on grassland, what is the probability that it is in Switzerland? Give your
answer to three decimal places. [1]
(d) Suppose now that a researcher is interested in using this sample to investigate the following question:
Does the pattern of habitats of sites where Cepaea nemoralis snails were sampled from differ between Germany and Switzerland?
(i) Write down suitable null and alternative hypotheses. [2]
(ii) Without doing any detailed calculations, explain why it is not valid to apply the χ2 test for contingency tables to the data as
they are in the table at the start of this question. [1]
(iii) It is valid to apply the χ2 test for contingency tables if the different types of habitat are combined by leaving the grassland type as it is, and combining the hedgerow etc. type with the woodland etc. type.
With the habitat types combined in the way described, use the χ2 test for contingency tables to test the hypotheses that you wrote down in part (d)(i). To help you, the table with the habitat types combined is given in the file habitats.mtw. Make sure to include
in your answer: [6]
• a table of the expected values (with the expected values given to two decimal places)
• the value of the test statistic
• the degrees of freedom
• the p-value or the values of CV5 and CV1
• your conclusion from the test.
Question 5 11 marks
An ecologist has a theory that there will tend to be fewer yellow snails in open locations like grassland, because yellow snails are easier to see, and will be subject to more predation by birds in places where they can easily be seen.
One way of investigating this is to see whether, on the basis of the sample of data in snails.mtw, the percentage of yellow snails in locations with grassland habitat is smaller than that in the other habitat types.
The percentages of yellow snails at each of the locations in the file snails.mtw are also given in the file yellow.mtw. In yellow.mtw the percentages of yellow snails in grassland habitats are in the column grassland, and the percentages in the other habitat types (hedgerow, woodland etc.) are in the column other. Using this file, or otherwise, do the following.
(a) Obtain boxplots of the percentages of yellow snails in grassland habitats and the percentages of yellow snails in non-grassland habitats. You should ensure that the boxplots are horizontal, have the same scale and are drawn on one diagram. You should prepare these boxplots ready for insertion in a report. That is, ensure that the title
and the label for the horizontal axis are clear and informative. Include
the finished boxplots in your answer. [5]
(b) Complete the following table of summary statistics for the percentages of yellow snails at different locations. (You should give the exact
numbers of locations, and all the other values rounded to one decimal
place.) [2]

Grassland habitat Other habitats

Number
Mean
Median
Standard deviation
Interquartile range Range

(c) Using your answers to both parts (a) and (b), does the percentage of yellow snails in grassland habitats appear to be lower, on average,
than the percentage of snails in the other habitat types? Justify your
opinion. [3]
(d) Even though the ecologist’s theory predicted that the mean percentage of yellow snails would differ between the habitat types in a particular direction, it was decided to perform a two-sided two-sample t-test assuming a pooled variance using the data in yellow.mtw, with
hypotheses H0 : µG = µO and H1 : µG =” µO, where µG and µO are
respectively the population mean percentages of yellow snails in
grassland habitats and in other habitats. How many degrees of
freedom are there for the relevant t distribution? [1] Question 6 9 marks
The ecologist has data from large numbers of surveys of Cepaea nemoralis snails in Germany carried out in the nineteenth century, in locations generally similar to those represented in snails.mtw. In those data, the mean percentage of snails with no bands on their shells was 20.1.
In this question you will investigate whether the data in snails.mtw provide evidence that the mean percentage of unbanded snails at locations in Germany at the time when the data in snails.mtw were obtained (around 2008) had changed from the observed level in the nineteenth century. (Note that in the sample of 36 locations in Germany given in snails.mtw, the mean of the percentages of unbanded snails is 27.55%, and the standard deviation of these percentages is 21.40%.)
(a) Write down suitable null and alternative hypotheses, stating clearly
the meanings of any symbols that you use. [2]
(b) You are going to use a one-sample z-test to test the hypotheses that you wrote down in part (a), using the data in snails.mtw. Explain
why any necessary assumptions are valid regardless of the distribution
of percentages of unbanded snails across locations. [2]
(c) Use the one-sample z-test to test the hypotheses that you wrote down in part (a).
In your answer, make sure to include the following:
• the estimated standard error
• the value of the test statistic
• the p-value or the values of CV5 and CV1
• what conclusions can be drawn from the results of this test. [5]
Question 7 – 9 marks
For the purpose of checking the data quality, the organisation that sponsored the research that produced the data in snails.mtw wants to investigate in more detail a sample of ten of the locations represented in the file.
(a) Using simple random sampling, select the sample of ten locations to be studied. You should use the random number table given in the appendix to Unit 4, starting at row 80. In your answer, you should outline the method that you used to obtain the sample, and detail which random numbers you generate (including any that you discard). [3]
(b) Why is the sample that you selected in part (a) not guaranteed to be a representative sample with respect to the countries of the locations? [2]
(c) Outline a sampling method that could be used that would guarantee that the sample is as representative as possible with respect to the
countries. [2]
(d) One variable about each location that could usefully be collected in such a study is how easy it was to get access to the site for sampling snails, split into the following categories: ‘very easy’, ‘easy’, ‘moderate’, ‘hard’, ‘very hard’.
(i) Give a reason why this variable could be regarded as subjective
data. [1]
(ii) Give a reason why this variable would be regarded as ordinal data
but not interval scale data. [1] Question 8 4 marks
The ecologist would like to carry out an experiment over a number of years to see whether the percentage of yellow snails in a grassland location will increase over time if the vegetation in the area is allowed to grow up to provide more cover. The ecologist needs to decide which of the following to use.
• A matched pairs design, where the ecologist uses a number of grassland locations where there are Cepaea nemoralis snails, records the percentage of yellow snails at each, plants shrubs at the locations to provide more cover, and comes back five years later after the shrubs have grown up and again records the percentages of yellow snails.
• A group comparative design, where the ecologist uses a number of grassland locations where there are Cepaea nemoralis snails, chooses half of them at random and plants shrubs there to provide more cover, and comes back five years later and records the percentages of yellow snails at all the locations (whether or not shrubs were planted).
(a) Give one advantage and one disadvantage of the matched pairs design
compared to the group comparative design. [2]
(b) The ecologist decides to use the matched pairs design. Suggest a statistical test that is likely to be suitable for analysing the data from
this experiment. [1]
(c) Why might the test that you suggested in part (b) turn out not to be
suitable after all? [1]

Is this question part of your assignment?

Place order